Message 313814 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	cheryl.sabella, ezio.melotti, serhiy.storchaka, steve, terry.reedy, vstinner
Date	2018-03-14.08:02:10
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1521014531.52.0.467229070634.issue32987@psf.upfronthosting.co.za>
In-reply-to

Content
This issue and issue12486 doesn't have any common except that both are related to the tokenize module. There are two bugs: a too narrow definition of \w in the re module (see issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module. >>> allchars = list(map(chr, range(0x110000))) >>> start = [c for c in allchars if c.isidentifier()] >>> cont = [c for c in allchars if ('a'+c).isidentifier()] >>> import re, regex, unicodedata >>> for c in regex.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in regex.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '·' U+0387 GREEK ANO TELEIA '፩' U+1369 ETHIOPIC DIGIT ONE '፪' U+136A ETHIOPIC DIGIT TWO '፫' U+136B ETHIOPIC DIGIT THREE '፬' U+136C ETHIOPIC DIGIT FOUR '፭' U+136D ETHIOPIC DIGIT FIVE '፮' U+136E ETHIOPIC DIGIT SIX '፯' U+136F ETHIOPIC DIGIT SEVEN '፰' U+1370 ETHIOPIC DIGIT EIGHT '፱' U+1371 ETHIOPIC DIGIT NINE '᧚' U+19DA NEW TAI LUE THAM DIGIT ONE '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(start)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... 'ᢅ' U+1885 MONGOLIAN LETTER ALI GALI BALUDA 'ᢆ' U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA '℘' U+2118 SCRIPT CAPITAL P '℮' U+212E ESTIMATED SYMBOL >>> for c in re.findall(r'\W', ''.join(cont)): print('%r U+%04X %s' % (c, ord(c), unicodedata.name(c, '?'))) ... '·' U+00B7 MIDDLE DOT '̀' U+0300 COMBINING GRAVE ACCENT '́' U+0301 COMBINING ACUTE ACCENT '̂' U+0302 COMBINING CIRCUMFLEX ACCENT '̃' U+0303 COMBINING TILDE ... [total 2177 characters] The second bug can be solved by adding 14 more characters in the pattern for Name. Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+' or Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*' But first the issue with \w should be resolved (if we don't want to add 2177 characters). The other solution is implementing property support in re (issue12734).

This issue and issue12486 doesn't have any common except that both are related to the tokenize module.

There are two bugs: a too narrow definition of \w in the re module (see  issue12731 and issue1693050) and a too narrow definition of Name in the tokenize module.


>>> allchars = list(map(chr, range(0x110000)))
>>> start = [c for c in allchars if c.isidentifier()]
>>> cont = [c for c in allchars if ('a'+c).isidentifier()]
>>> import re, regex, unicodedata

>>> for c in regex.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in regex.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'·'  U+0387  GREEK ANO TELEIA
'፩'  U+1369  ETHIOPIC DIGIT ONE
'፪'  U+136A  ETHIOPIC DIGIT TWO
'፫'  U+136B  ETHIOPIC DIGIT THREE
'፬'  U+136C  ETHIOPIC DIGIT FOUR
'፭'  U+136D  ETHIOPIC DIGIT FIVE
'፮'  U+136E  ETHIOPIC DIGIT SIX
'፯'  U+136F  ETHIOPIC DIGIT SEVEN
'፰'  U+1370  ETHIOPIC DIGIT EIGHT
'፱'  U+1371  ETHIOPIC DIGIT NINE
'᧚'  U+19DA  NEW TAI LUE THAM DIGIT ONE
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(start)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'ᢅ'  U+1885  MONGOLIAN LETTER ALI GALI BALUDA
'ᢆ'  U+1886  MONGOLIAN LETTER ALI GALI THREE BALUDA
'℘'  U+2118  SCRIPT CAPITAL P
'℮'  U+212E  ESTIMATED SYMBOL
>>> for c in re.findall(r'\W', ''.join(cont)): print('%r  U+%04X  %s' % (c, ord(c), unicodedata.name(c, '?')))
... 
'·'  U+00B7  MIDDLE DOT
'̀'  U+0300  COMBINING GRAVE ACCENT
'́'  U+0301  COMBINING ACUTE ACCENT
'̂'  U+0302  COMBINING CIRCUMFLEX ACCENT
'̃'  U+0303  COMBINING TILDE
...
[total 2177 characters]

The second bug can be solved by adding 14 more characters in the pattern for Name.

    Name = r'[\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]+'

or

    Name = r'[\w\u2118\u212e][\w\xb7\u0387\u1369-\u1371\u19da\u2118\u212e]*'

But first the issue with \w should be resolved (if we don't want to add 2177 characters).

The other solution is implementing property support in re (issue12734).

History
Date	User	Action	Args
2018-03-14 08:02:11	serhiy.storchaka	set	recipients: + serhiy.storchaka, terry.reedy, vstinner, ezio.melotti, cheryl.sabella, steve
2018-03-14 08:02:11	serhiy.storchaka	set	messageid: <1521014531.52.0.467229070634.issue32987@psf.upfronthosting.co.za>
2018-03-14 08:02:11	serhiy.storchaka	link	issue32987 messages
2018-03-14 08:02:10	serhiy.storchaka	create