classification
Title: Can't portably use Unicode in Python identifiers
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, daniel.urban, ezio.melotti, haypo, lemburg, loewis, mrabarnett, python-dev, tchrist, terry.reedy
Priority: normal Keywords:

Created on 2011-08-11 19:49 by tchrist, last changed 2011-08-13 03:18 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
badidents.python tchrist, 2011-08-11 19:49 demo of how unreliable python is due to narrow/wide build dependencies
Messages (4)
msg141923 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-11 19:49
You cannot reliably use Unicode in Python identifiers because of the narrow/wide build issue.  The enclosed file is fine on wide builds but gets compiler errors on narrow ones during compilation.

Go, Ruby, Java, and Perl all handle this situation without any problem; only Python has the bug.
msg141996 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-12 23:05
Ouch!
Do the rejected characters qualify as identifier characters as defined in Reference 2.3 Identifiers and keywords?
http://docs.python.org/py3k/reference/lexical_analysis.html#identifiers
If some interpreter version accepts extra characters, beyond the definition (as happened in 2.x), it is not a bug for for another version to only accept what is defined.

Side question: That section has "A non-normative HTML file listing all valid identifier characters for Unicode 4.1 can be found at http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html." Is the set of identifier characters now larger, and if so, has the table been enlarged?
msg142003 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-13 01:13
"Terry J. Reedy" <report@bugs.python.org> wrote
   on Fri, 12 Aug 2011 23:05:27 -0000: 

> Ouch!

> Do the rejected characters qualify as identifier characters as defined
> in Reference 2.3 Identifiers and keywords?

> http://docs.python.org/py3k/reference/lexical_analysis.html#identifiers

Yes, that's right, they do.  You're using the standard IDS and IDC, and
XIDS and XIDC, definitions.  Here were the three identifiers that were
a problem:

    ๐”˜๐”ซ๐”ฆ๐” ๐”ฌ๐”ก๐”ข     = "super"
    ๐”๐ฏ๐‘…๐จ๐‘‰๐ฏ๐ป     = "Deseret"
    ๐Œฐ๐„๐„๐Œฐโ€ฟ๐Œฟ๐Œฝ๐ƒ๐Œฐ๐‚  = "Gothic our father"

If you cannot read those, then when piped through `uniquote -v` they are:

    \N{MATHEMATICAL FRAKTUR CAPITAL U}\N{MATHEMATICAL FRAKTUR SMALL N}\N{MATHEMATICAL FRAKTUR SMALL I}\N{MATHEMATICAL FRAKTUR SMALL C}\N{MATHEMATICAL FRAKTUR SMALL O}\N{MATHEMATICAL FRAKTUR SMALL D}\N{MATHEMATICAL FRAKTUR SMALL E}     = "super"
    \N{DESERET CAPITAL LETTER DEE}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER ES}\N{DESERET SMALL LETTER LONG I}\N{DESERET SMALL LETTER ER}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER TEE}     = "Deseret"
    \N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER TEIWS}\N{GOTHIC LETTER TEIWS}\N{GOTHIC LETTER AHSA}\N{UNDERTIE}\N{GOTHIC LETTER URUS}\N{GOTHIC LETTER NAUTHS}\N{GOTHIC LETTER SAUIL}\N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER RAIDA}  = "Gothic our father"

I'm not sure whether you recognize the scripts they belong to, but they're
all in the astral planes.  Using `uniquote -x` on them shows:

    \x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522}     = "super"
    \x{10414}\x{1042F}\x{10445}\x{10428}\x{10449}\x{1042F}\x{1043B}     = "Deseret"
    \x{10330}\x{10344}\x{10344}\x{10330}\x{203F}\x{1033F}\x{1033D}\x{10343}\x{10330}\x{10342}  = "Gothic our father"

As to whether they're proper identifiers per your reference above, I
will take the first letter from each of ๐”˜๐”ซ๐”ฆ๐” ๐”ฌ๐”ก๐”ข, ๐”๐ฏ๐‘…๐จ๐‘‰๐ฏ๐ป, and ๐Œฐ๐„๐„๐Œฐโ€ฟ๐Œฟ๐Œฝ๐ƒ๐Œฐ๐‚, 
which are repsectively ๐”˜, ๐”, and ๐Œฐ, or

    MATHEMATICAL FRAKTUR CAPITAL U
    DESERET CAPITAL LETTER DEE
    GOTHIC LETTER AHSA

or 

    1D518
    10414
    10330

and show you their full Unicode properties of these reject code points.
This requires the uniprops command, given which, these three commands 
are then completely identical:

    % uniprops -ga "๐”˜" "๐”" "๐Œฐ"
    % uniprops -ga 1D518 10414 10330
    % uniprops -ga "MATHEMATICAL FRAKTUR CAPITAL U" "DESERET CAPITAL LETTER DEE" "GOTHIC LETTER AHSA"

and produce this output:

U+1D518 โ€น๐”˜โ€บ \N{MATHEMATICAL FRAKTUR CAPITAL U}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
    All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded
       CWKCF Common Zyyy Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Math
       Mathematical_Alphanumeric_Symbols Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
       X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
    Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Mathematical_Alphanumeric_Symbols Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR General_Category=Cased_Letter Script=Common
       Decomposition_Type=Font DT=Font Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon
       East_Asian_Width=Neutral GC=LC General_Category=L General_Category=Letter General_Category=L_ General_Category=LC GC=L
       General_Category=Lu General_Category=Uppercase_Letter GC=Lu Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
       Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup
       Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
       Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
       Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy
       Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
U+10414 โ€น๐”โ€บ \N{DESERET CAPITAL LETTER DEE}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
    All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped
       CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase
       ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS
       X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
    Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR General_Category=Cased_Letter
       Decomposition_Type=None DT=None Script=Deseret East_Asian_Width=Neutral GC=LC General_Category=L General_Category=Letter
       General_Category=L_ General_Category=LC GC=L General_Category=Lu General_Category=Uppercase_Letter GC=Lu
       Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable
       HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL
       Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2
       Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
       Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE
       Word_Break=LE _X_Begin
U+10330 โ€น๐Œฐโ€บ \N{GOTHIC LETTER AHSA}
    \w \pL \p{L_} \p{Lo}
    All Any Alnum Alpha Alphabetic Assigned InGothic Gothic Is_Gothic L Lo Goth Gr_Base Grapheme_Base Graph GrBase ID_Continue
       IDC ID_Start IDS Letter L_ Other_Letter Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
       X_POSIX_Graph X_POSIX_Print X_POSIX_Word
    Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Gothic Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None
       East_Asian_Width=Neutral General_Category=L General_Category=Letter General_Category=L_ GC=L General_Category=Lo
       General_Category=Other_Letter GC=Lo Script=Gothic Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
       Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup
       Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
       Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
       Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 Script=Goth SC=Goth
       Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

As you see, all three are all of IDS, IDC, XIDS, and XIDC.

The reason they're failing in because on a narrow build, Python bizarrely 
splits the code points into two surrogates, and then rejects them because
surrogates aren't IDanything.

This strikes me as super crazy because you are reading from a UTF-8 source,
which therefore can handle all of Unicode.  And you actually split the things
into two UTF-16 code units, too; you don't blow up just because there is
something that UCS-2 can't cope with.  Indeed, I can switch things
around into literals and there is no choking:

    % cat astral-literals.python
    #!/usr/bin/env python3.2
    # -*- coding: UTF-8 -*-
    super = "๐”˜๐”ซ๐”ฆ๐” ๐”ฌ๐”ก๐”ข" 
    Deseret = "๐”๐ฏ๐‘…๐จ๐‘‰๐ฏ๐ป"     
    Gothic_our_father = "๐Œฐ๐„๐„๐Œฐโ€ฟ๐Œฟ๐Œฝ๐ƒ๐Œฐ๐‚"
    print(super)
    print(Deseret)
    print(Gothic_our_father)

because watch what happens when I run that: mirabile visu, it behaves
completely correctly despite having code points in the narrow-build-
forbidden astral planes:

    % python3.2 astral-literals.python
    ๐”˜๐”ซ๐”ฆ๐” ๐”ฌ๐”ก๐”ข
    ๐”๐ฏ๐‘…๐จ๐‘‰๐ฏ๐ป
    ๐Œฐ๐„๐„๐Œฐโ€ฟ๐Œฟ๐Œฝ๐ƒ๐Œฐ๐‚

Again, I uniquote those for you (in case you can't read them because you
haven't gotten George Douros's free Symbola font for Unicode 6.0.0 yet from
http://users.teilar.gr/~g1951d/ . BTW, I also recommend his Alfios font for
general text, and he has several others that may be of interest):

    % python3.2 astral-literals.python | uniquote -x
    \x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522}
    \x{10414}\x{1042F}\x{10445}\x{10428}\x{10449}\x{1042F}\x{1043B}
    \x{10330}\x{10344}\x{10344}\x{10330}\x{203F}\x{1033F}\x{1033D}\x{10343}\x{10330}\x{10342}

    % python3.2 astral-literals.python | uniquote -v
    \N{MATHEMATICAL FRAKTUR CAPITAL U}\N{MATHEMATICAL FRAKTUR SMALL N}\N{MATHEMATICAL FRAKTUR SMALL I}\N{MATHEMATICAL FRAKTUR SMALL C}\N{MATHEMATICAL FRAKTUR SMALL O}\N{MATHEMATICAL FRAKTUR SMALL D}\N{MATHEMATICAL FRAKTUR SMALL E}
    \N{DESERET CAPITAL LETTER DEE}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER ES}\N{DESERET SMALL LETTER LONG I}\N{DESERET SMALL LETTER ER}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER TEE}
    \N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER TEIWS}\N{GOTHIC LETTER TEIWS}\N{GOTHIC LETTER AHSA}\N{UNDERTIE}\N{GOTHIC LETTER URUS}\N{GOTHIC LETTER NAUTHS}\N{GOTHIC LETTER SAUIL}\N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER RAIDA}

Isn't that the weirest thing?  You're acting like a perfectly fine
full-unicode language even on a narrow build, and yet you reject
as identifiers those same strings that you have just faithfully
reproduced above.  

I really do not understand this.  It must be a bug.  Because they are
find on a wide build.  And because the literals are fine even on a
narrow one.

> If some interpreter version accepts extra characters, beyond the
> definition (as happened in 2.x), it is not a bug for for another version
> to only accept what is defined.

> Side question: That section has 

>	"A non-normative HTML file listing all valid identifier characters 
>        for Unicode 4.1 can be found 
>        at http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html ." 

Gosh that's old.  Unicode *4.1*, really?  I sure hope you don't rely 
on something like that to know what's what!!

> Is the set of identifier characters now larger, and if so, has the table been enlarged?

Certainly!  I know this for certain science because I know that more
letters have been added.  But here is the proof:

    % unichars -ua '\p{ID_Start}' | wc -l
    100747

    % unichars -ua '\p{ID_Continue}' | wc -l
    102675

Notice how Unicode has been a bit oopsy with making sure that there 
all IDS/ISC code points also have the word property:

    % unichars -ua '\w' | wc -l
    102724

I'm pretty sure a few of those are slated for "fixing", because
I've seen proposals to that effect.  Here's one "letter" they 
even left out:

    % unichars -gs '\w' '\pL' '\P{IDC}'
    โธฏ  U+02E2F GC=Lm SC=Common       VERTICAL TILDE

That says to show me the code points which are

    word chars
    letters
    *not* IDC

Before I started poking at them, nobody ran these sort of analyses
so things could slip through.  Even more annoyingly, here are non-word
characters that are indeed IDC characters:

    % unichars -gs '\W' '\p{IDC}'
    ยท  U+000B7 GC=Po SC=Common       MIDDLE DOT
    ยท  U+00387 GC=Po SC=Common       GREEK ANO TELEIA
    แฉ  U+01369 GC=No SC=Ethiopic     ETHIOPIC DIGIT ONE
    แช  U+0136A GC=No SC=Ethiopic     ETHIOPIC DIGIT TWO
    แซ  U+0136B GC=No SC=Ethiopic     ETHIOPIC DIGIT THREE
    แฌ  U+0136C GC=No SC=Ethiopic     ETHIOPIC DIGIT FOUR
    แญ  U+0136D GC=No SC=Ethiopic     ETHIOPIC DIGIT FIVE
    แฎ  U+0136E GC=No SC=Ethiopic     ETHIOPIC DIGIT SIX
    แฏ  U+0136F GC=No SC=Ethiopic     ETHIOPIC DIGIT SEVEN
    แฐ  U+01370 GC=No SC=Ethiopic     ETHIOPIC DIGIT EIGHT
    แฑ  U+01371 GC=No SC=Ethiopic     ETHIOPIC DIGIT NINE
    แงš  U+019DA GC=No SC=New_Tai_Lue  NEW TAI LUE THAM DIGIT ONE
    โ„˜  U+02118 GC=Sm SC=Common       SCRIPT CAPITAL P
    โ„ฎ  U+0212E GC=So SC=Common       ESTIMATED SYMBOL
    ใ‚› U+0309B GC=Sk SC=Common       KATAKANA-HIRAGANA VOICED SOUND MARK
    ใ‚œ U+0309C GC=Sk SC=Common       KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Digits are supposed to be GC=Nd not GC=No, so I don't know what the story 
is there. Maybe they don't have a zero?

    % uniprops 'ETHIOPIC DIGIT ZERO'
    uniprops: no character named โ€นETHIOPIC DIGIT ZEROโ€บ

    % uniprops 'ETHIOPIC DIGIT ONE'
    U+1369 โ€นแฉโ€บ \N{ETHIOPIC DIGIT ONE}
        \pN \p{No}
        All Any Assigned InEthiopic Ethiopic Is_Ethiopic Ethi N No Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC Number
           Other_Number Print XID_Continue XIDC X_POSIX_Graph X_POSIX_Print

I guess that's it then: you can't assemble bigendian base-10 integers 
out of digits whose set lacks a zero, which is the key criterion for GC=Nd
and for NT=De (Numeric_Type=Decimal), as in regular digits:

    % uniprops -ga 1
U+0031 โ€น1โ€บ \N{DIGIT ONE}
    \w \d \pN \p{Nd}
    AHex ASCII_Hex_Digit All Any Alnum ASCII Assigned Basic_Latin Common Zyyy Decimal_Number Digit Nd N Gr_Base Grapheme_Base
       Graph GrBase Hex XDigit Hex_Digit ID_Continue IDC Number PerlWord POSIX_Alnum POSIX_Digit POSIX_Graph POSIX_Print
       POSIX_Word POSIX_XDigit Print Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Digit X_POSIX_Graph X_POSIX_Print X_POSIX_Word
       X_POSIX_XDigit
    Age=1.1 Block=Basic_Latin Bidi_Class=EN Bidi_Class=European_Number BC=EN Block=ASCII BLK=ASCII Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common General_Category=Decimal_Number
       Decomposition_Type=None DT=None East_Asian_Width=Na East_Asian_Width=Narrow EA=Na GC=Nd General_Category=Digit
       General_Category=Number General_Category=Nd Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
       Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup
       Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=NU Line_Break=Numeric LB=NU Numeric_Type=De Numeric_Type=Decimal
       NT=De Numeric_Value=1 NV=1 Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0
       Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
       Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=NU
       Sentence_Break=Numeric SB=NU Word_Break=NU Word_Break=Numeric WB=NU _X_Begin

But back to the mismatch between \w and IDC. Isn't that really annoying?

The first two occur in NFD, which is a pain.  So you start with
something that is all \w, then you decompose it, and whammo.

This brings up a whole nother issue.  Insofar as they may differ, should
one track \w or should one track IDS/IDC?  It is extremely vexing,
because whatever you do, you will change the lexical texture of your
language.  We recently swapped around from one to the other, and there
was one person whose module had to change.  He had no idea he was even
using one of these things in an ident, and was happy to change it.  But
we lucked out that time.  It is terribly bothersome situation, and I do
not know what the best strategy actually is, or the safest one, nor
whether those differ.  It troubles us, too, I do promise you.

One thing I might do to trouble you less...

If you don't have the uniquote, uniprops, or unichars commands yet,
and would like them, then well *those* I can tell you where to find. :)

--tom
msg142007 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-13 03:18
New changeset 787ed1a7aba8 by Benjamin Peterson in branch '3.2':
in narrow builds, make sure to test codepoints as identifier characters (closes #12732)
http://hg.python.org/cpython/rev/787ed1a7aba8

New changeset 5af15f018e20 by Benjamin Peterson in branch 'default':
merge 3.2 (#12732)
http://hg.python.org/cpython/rev/5af15f018e20
History
Date User Action Args
2011-08-13 03:18:31python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg142007

resolution: fixed
stage: needs patch -> resolved
2011-08-13 01:13:24tchristsetmessages: + msg142003
2011-08-13 00:57:26mrabarnettsetnosy: + mrabarnett
2011-08-12 23:05:27terry.reedysetnosy: + terry.reedy, lemburg, haypo, loewis

messages: + msg141996
stage: needs patch
2011-08-12 18:03:32Arfreversetnosy: + Arfrever
2011-08-12 16:43:18daniel.urbansetnosy: + daniel.urban
2011-08-12 00:19:09ezio.melottisetnosy: + ezio.melotti
2011-08-11 19:49:35tchristcreate