This issue and issue12486 doesn't have any common except that both are related to the tokenize module.
There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module.
>>> allchars = list(map(chr, range(0x110000)))
>>> start = [c for c in allchars if c.isidentifier()]
>>> cont = [c for c in allchars if ('a'+c).isidentifier()]
>>> import re, regex, unicodedata
>>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
...
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
...
'·' U+00B7 MIDDLE DOT
'·' U+0387 GREEK ANO TELEIA
'፩' U+1369 ETHIOPIC DIGIT ONE
'፪' U+136A ETHIOPIC DIGIT TWO
'፫' U+136B ETHIOPIC DIGIT THREE
'፬' U+136C ETHIOPIC DIGIT FOUR
'፭' U+136D ETHIOPIC DIGIT FIVE
'፮' U+136E ETHIOPIC DIGIT SIX
'፯' U+136F ETHIOPIC DIGIT SEVEN
'፰' U+1370 ETHIOPIC DIGIT EIGHT
'፱' U+1371 ETHIOPIC DIGIT NINE
'᧚' U+19DA NEW TAI LUE THAM DIGIT ONE
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
...
'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA
'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
'℘' U+2118 SCRIPT CAPITAL P
'℮' U+212E ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?')))
...
'·' U+00B7 MIDDLE DOT
'̀' U+0300 COMBINING GRAVE ACCENT
'́' U+0301 COMBINING ACUTE ACCENT
'̂' U+0302 COMBINING CIRCUMFLEX ACCENT
'̃' U+0303 COMBINING TILDE
...
[total 2177 characters]
The second bug can be solved by adding 14 more characters in the pattern for Name.
Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+'
or
Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*'
But first the issue with \w should be resolved (if we don't want to add 2177 characters).
The other solution is implementing property support in re (issue12734).
|