This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenize.py parses unicode identifiers incorrectly
Type: behavior Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
View: 24194
Assigned To: Nosy List: cheryl.sabella, ezio.melotti, serhiy.storchaka, steve, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2018-03-02 23:32 by steve, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (6)
msg313168 - (view) Author: Steve B (steve) Date: 2018-03-02 23:32
Here is an example involving the unicode character MIDDLE DOT · : The line

ab·cd = 7

is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that:

    import tokenize
    from io import BytesIO
    
    test = 'ab·cd = 7'.encode('utf-8')
    
    x = tokenize.tokenize(BytesIO(test).readline)
    for i in x: print(i)

For reference, the official definition of identifiers is: 

https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers

and details about MIDDLE DOT are at

https://www.unicode.org/Public/10.0.0/ucd/PropList.txt

MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong.
msg313496 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-03-09 20:19
I verified on Win10 with 3.5 (which cannot be patched) and 3.7.0b2 that ab·cd is accepted as a name and that tokenize fails as described.
msg313792 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2018-03-13 22:57
I believe this may be a duplicate of issue 12486.
msg313797 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-03-14 01:08
I think the issues are slightly different.  #12486 is about the awkwardness of the API.  This is about a false error after jumping through the hoops, which I think Steve B did correctly.

Following the link, the Other_ID_Continue chars are

00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE

# Total code points: 12

The 2 Po chars fail, the 2 No chars work.  After looking at the tokenize module, I believe the problem is the re for Name is r'\w+' and the Po chars are not seen as \w word characters.

>>> r = re.compile(r'\w+', re.U)  
>>> re.match(r, 'ab\u0387cd')
<re.Match object; span=(0, 2), match='ab'>

I don't know if the bug is a too narrow definition of \w in the re module("most characters that can be part of a word in any language, as well as numbers and the underscore") or of Name in the tokenize module.

Before patching anything, I would like to know if the 2 Po Other chars are the only 2 not matched by \w.  Unless someone has done so already, at least a sample of chars from each category included in the definition of 'identifier' should be tested.
msg313814 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-03-14 08:02
This issue and issue12486 doesn't have any common except that both are related to the tokenize module.

There are two bugs: a too narrow definition of \w in the re module (see  issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module.


>>> allchars = list(map(chr, range(0x110000)))
>>> start = [c for c in allchars if c.isidentifier()]
>>> cont = [c for c in allchars if ('a'+c).isidentifier()]
>>> import re, regex, unicodedata

>>> for c in regex.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in regex.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'·'  U+0387  GREEK ANO TELEIA
'፩'  U+1369  ETHIOPIC DIGIT ONE
'፪'  U+136A  ETHIOPIC DIGIT TWO
'፫'  U+136B  ETHIOPIC DIGIT THREE
'፬'  U+136C  ETHIOPIC DIGIT FOUR
'፭'  U+136D  ETHIOPIC DIGIT FIVE
'፮'  U+136E  ETHIOPIC DIGIT SIX
'፯'  U+136F  ETHIOPIC DIGIT SEVEN
'፰'  U+1370  ETHIOPIC DIGIT EIGHT
'፱'  U+1371  ETHIOPIC DIGIT NINE
'᧚'  U+19DA  NEW TAI LUE THAM DIGIT ONE
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'ᢅ'  U+1885  MONGOLIAN LETTER ALI GALI BALUDA
'ᢆ'  U+1886  MONGOLIAN LETTER ALI GALI THREE BALUDA
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'̀'  U+0300  COMBINING GRAVE ACCENT
'́'  U+0301  COMBINING ACUTE ACCENT
'̂'  U+0302  COMBINING CIRCUMFLEX ACCENT
'̃'  U+0303  COMBINING TILDE
...
[total 2177 characters]

The second bug can be solved by adding 14 more characters in the pattern for Name.

    Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+'

or

    Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*'

But first the issue with \w should be resolved (if we don't want to add 2177 characters).

The other solution is implementing property support in re (issue12734).
msg313852 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-03-15 00:58
#24194 is about tokenize failing, including on middle dot.  There is another tokenize name issue, already closed.  I referenced Serhiy's analysis there and on the two \w issues, and closed one of them.
History
Date User Action Args
2022-04-11 14:58:58adminsetgithub: 77168
2018-03-15 00:58:29terry.reedysetstatus: open -> closed
superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
messages: + msg313852

resolution: duplicate
stage: needs patch -> resolved
2018-03-14 08:02:11serhiy.storchakasetmessages: + msg313814
2018-03-14 01:08:57terry.reedysetnosy: + serhiy.storchaka
messages: + msg313797
2018-03-13 22:57:54cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg313792
2018-03-09 20:19:50terry.reedysetversions: + Python 3.7, Python 3.8
nosy: + terry.reedy

messages: + msg313496

stage: needs patch
2018-03-02 23:32:49stevecreate