Issue 32987: tokenize.py parses unicode identifiers incorrectly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77168

classification

Title:	tokenize.py parses unicode identifiers incorrectly
Type:	behavior	Stage:	resolved
Components:	Library (Lib), Unicode	Versions:	Python 3.8, Python 3.7, Python 3.6

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Make tokenize recognize Other_ID_Start and Other_ID_Continue chars View: 24194
Assigned To:		Nosy List:	cheryl.sabella, ezio.melotti, serhiy.storchaka, steve, terry.reedy, vstinner
Priority:	normal	Keywords:

Created on 2018-03-02 23:32 by steve, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (6)
msg313168 - (view)	Author: Steve B (steve)	Date: 2018-03-02 23:32
Here is an example involving the unicode character MIDDLE DOT · : The line ab·cd = 7 is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that: import tokenize from io import BytesIO test = 'ab·cd = 7'.encode('utf-8') x = tokenize.tokenize(BytesIO(test).readline) for i in x: print(i) For reference, the official definition of identifiers is: https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers and details about MIDDLE DOT are at https://www.unicode.org/Public/10.0.0/ucd/PropList.txt MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong.
msg313496 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-03-09 20:19
I verified on Win10 with 3.5 (which cannot be patched) and 3.7.0b2 that ab·cd is accepted as a name and that tokenize fails as described.
msg313792 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2018-03-13 22:57
I believe this may be a duplicate of issue 12486.
msg313797 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-03-14 01:08
I think the issues are slightly different. #12486 is about the awkwardness of the API. This is about a false error after jumping through the hoops, which I think Steve B did correctly. Following the link, the Other_ID_Continue chars are 00B7 ; Other_ID_Continue # Po MIDDLE DOT 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE # Total code points: 12 The 2 Po chars fail, the 2 No chars work. After looking at the tokenize module, I believe the problem is the re for Name is r'\w+' and the Po chars are not seen as \w word characters. >>> r = re.compile(r'\w+', re.U) >>> re.match(r, 'ab\u0387cd') <re.Match object; span=(0, 2), match='ab'> I don't know if the bug is a too narrow definition of \w in the re module("most characters that can be part of a word in any language, as well as numbers and the underscore") or of Name in the tokenize module. Before patching anything, I would like to know if the 2 Po Other chars are the only 2 not matched by \w. Unless someone has done so already, at least a sample of chars from each category included in the definition of 'identifier' should be tested.
msg313814 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-03-14 08:02
This issue and issue12486 doesn't have any common except that both are related to the tokenize module. There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module. >>> allchars = list(map(chr, range(0x110000))) >>> start = [c for c in allchars if c.isidentifier()] >>> cont = [c for c in allchars if ('a'+c).isidentifier()] >>> import re, regex, unicodedata >>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '·' U+0387 GREEK ANO TELEIA '፩' U+1369 ETHIOPIC DIGIT ONE '፪' U+136A ETHIOPIC DIGIT TWO '፫' U+136B ETHIOPIC DIGIT THREE '፬' U+136C ETHIOPIC DIGIT FOUR '፭' U+136D ETHIOPIC DIGIT FIVE '፮' U+136E ETHIOPIC DIGIT SIX '፯' U+136F ETHIOPIC DIGIT SEVEN '፰' U+1370 ETHIOPIC DIGIT EIGHT '፱' U+1371 ETHIOPIC DIGIT NINE '᧚' U+19DA NEW TAI LUE THAM DIGIT ONE '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... 'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA 'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '̀' U+0300 COMBINING GRAVE ACCENT '́' U+0301 COMBINING ACUTE ACCENT '̂' U+0302 COMBINING CIRCUMFLEX ACCENT '̃' U+0303 COMBINING TILDE ... [total 2177 characters] The second bug can be solved by adding 14 more characters in the pattern for Name. Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+' or Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*' But first the issue with \w should be resolved (if we don't want to add 2177 characters). The other solution is implementing property support in re (issue12734).
msg313852 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-03-15 00:58
#24194 is about tokenize failing, including on middle dot. There is another tokenize name issue, already closed. I referenced Serhiy's analysis there and on the two \w issues, and closed one of them.

History
Date	User	Action	Args
2022-04-11 14:58:58	admin	set	github: 77168
2018-03-15 00:58:29	terry.reedy	set	status: open -> closed superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars messages: + msg313852 resolution: duplicate stage: needs patch -> resolved
2018-03-14 08:02:11	serhiy.storchaka	set	messages: + msg313814
2018-03-14 01:08:57	terry.reedy	set	nosy: + serhiy.storchaka messages: + msg313797
2018-03-13 22:57:54	cheryl.sabella	set	nosy: + cheryl.sabella messages: + msg313792
2018-03-09 20:19:50	terry.reedy	set	versions: + Python 3.7, Python 3.8 nosy: + terry.reedy messages: + msg313496 stage: needs patch
2018-03-02 23:32:49	steve	create