Message 142136 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	mrabarnett, tchrist
Date	2011-08-15.17:48:32
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1313430514.3.0.983525514499.issue12753@psf.upfronthosting.co.za>
In-reply-to

Content
Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python. (If this is construed to be an extant bug from than an absent feature, you probably want to change this from a wish to a bug in the ticket.) This is a problem because aliases correct errors in the original names, and are the preferred versions. For example, ISO screwed up when they called U+01A2 LATIN CAPITAL LETTER OI. It is actually LATIN CAPITAL LETTER GHA according to the file NameAliases.txt in the Unicode Character Database. However, Python blows up when you try to use this: % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER OI}")' Ƣ % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER GHA}")' File "<string>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name Exit 1 This unfortunate, because the formal aliases correct egregious blunders, such as the Standard reading "BRAKCET" instead of "BRACKET": $ uninames '^\s+%' Ƣ 01A2 LATIN CAPITAL LETTER OI % LATIN CAPITAL LETTER GHA ƣ 01A3 LATIN SMALL LETTER OI % LATIN SMALL LETTER GHA * Pan-Turkic Latin alphabets ೞ 0CDE KANNADA LETTER FA % KANNADA LETTER LLLA * obsolete historic letter * name is a mistake for LLLA ຝ 0E9D LAO LETTER FO TAM % LAO LETTER FO FON = fo fa * name is a mistake for fo sung ຟ 0E9F LAO LETTER FO SUNG % LAO LETTER FO FAY * name is a mistake for fo tam ຣ 0EA3 LAO LETTER LO LING % LAO LETTER RO = ro rot * name is a mistake, lo ling is the mnemonic for 0EA5 ລ 0EA5 LAO LETTER LO LOOT % LAO LETTER LO = lo ling * name is a mistake, lo loot is the mnemonic for 0EA3 ࿐ 0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN % TIBETAN MARK BKA- SHOG GI MGO RGYAN * used in Bhutan ꀕ A015 YI SYLLABLE WU % YI SYLLABLE ITERATION MARK * name is a misnomer ︘ FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET % PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET * misspelling of "BRACKET" in character name is a known defect # <vertical> 3017 𝃅 1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS % BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS * misspelling of "FTHORA" in character name is a known defect There are only In Perl, \N{...} grants access to the single, shared, common namespace of Unicode character names, formal aliases, and named sequences without distinction: % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")' Ƣ % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")' Ƣ % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")' \| uniquote -x \x{1A2} % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")' \| uniquote -x \x{1A2} It is my suggestion that Python do the same thing. There are currently only 11 of these. The third element in this shared namespace of name, named sequences, are multiple code points masquerading under one name. They come from the NamedSequences.txt file in the Unicode Character Database. An example entry is: LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300 There are 418 of these named sequences as of Unicode 6.0.0. This shows that Perl can also access named sequences: $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' Ā̀ $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' \| uniquote -x \x{100}\x{300} $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' ㇷ゚ $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' \| uniquote -x \x{31F7}\x{309A} Since it is a single namespace, it makes sense that all members of that namespace should be accessible using \N{...} as a sort of equal-opportunity accessor mechanism, and it does not make sense that they not be. Just makes sure you take only the approved named sequences from the NamedSequences.txt file. It would be unwise to give users access to the provisional sequences located in a neighboring file I shall not name :) because those are not guaranteed never to be withdrawn the way the others are, and so you would risk introducing an incompatibility. If you look at the ICU UCharacter class, you can see that they provide a more

Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python.  (If this is construed to be an extant bug from than an absent feature, you probably want to change this from a wish to a bug in the ticket.)

This is a problem because aliases correct errors in the original names, and are the preferred versions.  For example, ISO screwed up when they called U+01A2 LATIN CAPITAL LETTER OI.  It is actually LATIN CAPITAL LETTER GHA according to the file NameAliases.txt in the Unicode Character Database.  However, Python blows up when you try to use this:

    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER OI}")'
    Ƣ

    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER GHA}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name
    Exit 1

This unfortunate, because the formal aliases correct egregious blunders, such as the Standard reading "BRAKCET" instead of "BRACKET":

$ uninames '^\s+%'
 Ƣ  01A2        LATIN CAPITAL LETTER OI
        % LATIN CAPITAL LETTER GHA
 ƣ  01A3        LATIN SMALL LETTER OI
        % LATIN SMALL LETTER GHA
        * Pan-Turkic Latin alphabets
 ೞ  0CDE        KANNADA LETTER FA
        % KANNADA LETTER LLLA
        * obsolete historic letter
        * name is a mistake for LLLA
 ຝ  0E9D        LAO LETTER FO TAM
        % LAO LETTER FO FON
        = fo fa
        * name is a mistake for fo sung
 ຟ  0E9F        LAO LETTER FO SUNG
        % LAO LETTER FO FAY
        * name is a mistake for fo tam
 ຣ  0EA3        LAO LETTER LO LING
        % LAO LETTER RO
        = ro rot
        * name is a mistake, lo ling is the mnemonic for 0EA5
 ລ  0EA5        LAO LETTER LO LOOT
        % LAO LETTER LO
        = lo ling
        * name is a mistake, lo loot is the mnemonic for 0EA3
 ࿐  0FD0        TIBETAN MARK BSKA- SHOG GI MGO RGYAN
        % TIBETAN MARK BKA- SHOG GI MGO RGYAN
        * used in Bhutan
 ꀕ A015        YI SYLLABLE WU
        % YI SYLLABLE ITERATION MARK
        * name is a misnomer
 ︘ FE18        PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
        % PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
        * misspelling of "BRACKET" in character name is a known defect
        # <vertical> 3017
 𝃅  1D0C5       BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
        % BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
        * misspelling of "FTHORA" in character name is a known defect

There are only 

In Perl, \N{...} grants access to the single, shared, common namespace of Unicode character names, formal aliases, and named sequences without distinction:

    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")'
    Ƣ
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")'
    Ƣ

    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")'  | uniquote -x
    \x{1A2}
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")' | uniquote -x
    \x{1A2}

It is my suggestion that Python do the same thing. There are currently only 11 of these.  

The third element in this shared namespace of name, named sequences, are multiple code points masquerading under one name.  They come from the NamedSequences.txt file in the Unicode Character Database.  An example entry is:

    LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300

There are 418 of these named sequences as of Unicode 6.0.0.  This shows that Perl can also access named sequences:

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")'
  Ā̀

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' | uniquote -x
  \x{100}\x{300}

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")'            
  ㇷ゚

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' | uniquote -x
   \x{31F7}\x{309A}


Since it is a single namespace, it makes sense that all members of that namespace should be accessible using \N{...} as a sort of equal-opportunity accessor mechanism, and it does not make sense that they not be.

Just makes sure you take only the approved named sequences from the NamedSequences.txt file. It would be unwise to give users access to the provisional sequences located in a neighboring file I shall not name :) because those are not guaranteed never to be withdrawn the way the others are, and so you would risk introducing an incompatibility.

If you look at the ICU UCharacter class, you can see that they provide a more

History
Date	User	Action	Args
2011-08-15 17:48:34	tchrist	set	recipients: + tchrist, mrabarnett
2011-08-15 17:48:34	tchrist	set	messageid: <1313430514.3.0.983525514499.issue12753@psf.upfronthosting.co.za>
2011-08-15 17:48:33	tchrist	link	issue12753 messages
2011-08-15 17:48:32	tchrist	create