This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author serhiy.storchaka
Recipients cheryl.sabella, ezio.melotti, serhiy.storchaka, steve, terry.reedy, vstinner
Date 2018-03-14.08:02:10
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1521014531.52.0.467229070634.issue32987@psf.upfronthosting.co.za>
In-reply-to
Content
This issue and issue12486 doesn't have any common except that both are related to the tokenize module.

There are two bugs: a too narrow definition of \w in the re module (see  issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module.


>>> allchars = list(map(chr, range(0x110000)))
>>> start = [c for c in allchars if c.isidentifier()]
>>> cont = [c for c in allchars if ('a'+c).isidentifier()]
>>> import re, regex, unicodedata

>>> for c in regex.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in regex.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'·'  U+0387  GREEK ANO TELEIA
'፩'  U+1369  ETHIOPIC DIGIT ONE
'፪'  U+136A  ETHIOPIC DIGIT TWO
'፫'  U+136B  ETHIOPIC DIGIT THREE
'፬'  U+136C  ETHIOPIC DIGIT FOUR
'፭'  U+136D  ETHIOPIC DIGIT FIVE
'፮'  U+136E  ETHIOPIC DIGIT SIX
'፯'  U+136F  ETHIOPIC DIGIT SEVEN
'፰'  U+1370  ETHIOPIC DIGIT EIGHT
'፱'  U+1371  ETHIOPIC DIGIT NINE
'᧚'  U+19DA  NEW TAI LUE THAM DIGIT ONE
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'ᢅ'  U+1885  MONGOLIAN LETTER ALI GALI BALUDA
'ᢆ'  U+1886  MONGOLIAN LETTER ALI GALI THREE BALUDA
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'̀'  U+0300  COMBINING GRAVE ACCENT
'́'  U+0301  COMBINING ACUTE ACCENT
'̂'  U+0302  COMBINING CIRCUMFLEX ACCENT
'̃'  U+0303  COMBINING TILDE
...
[total 2177 characters]

The second bug can be solved by adding 14 more characters in the pattern for Name.

    Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+'

or

    Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*'

But first the issue with \w should be resolved (if we don't want to add 2177 characters).

The other solution is implementing property support in re (issue12734).
History
Date User Action Args
2018-03-14 08:02:11serhiy.storchakasetrecipients: + serhiy.storchaka, terry.reedy, vstinner, ezio.melotti, cheryl.sabella, steve
2018-03-14 08:02:11serhiy.storchakasetmessageid: <1521014531.52.0.467229070634.issue32987@psf.upfronthosting.co.za>
2018-03-14 08:02:11serhiy.storchakalinkissue32987 messages
2018-03-14 08:02:10serhiy.storchakacreate