Message 297603 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	davidism
Recipients	davidism, ezio.melotti, mrabarnett, vstinner
Date	2017-07-03.15:39:29
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1499096369.93.0.459837201925.issue30838@psf.upfronthosting.co.za>
In-reply-to

Content
This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches all valid identifier characters except for 4 special cases: import unicodedata import re import sys cre = re.compile(r'\w') for cp in range(sys.maxunicode + 1): s = chr(cp) if s.isidentifier() and not cre.match(s): print(hex(cp), unicodedata.name(s)) 0x1885 MONGOLIAN LETTER ALI GALI BALUDA 0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA 0x2118 SCRIPT CAPITAL P 0x212e ESTIMATED SYMBOL Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them. For our case, we just added them to a character set, `[\w\u1885\u1886\u2118\u212e]`. It can cause unexpected behavior when using `\b`, since that's defined as the transition from `\w` to `\W` and those 4 characters aren't in `\w`. `re.match(r'\b[\w\u212e', '℮')` fails to match.

This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches all valid identifier characters except for 4 special cases:

import unicodedata
import re
import sys

cre = re.compile(r'\w')

for cp in range(sys.maxunicode + 1):
    s = chr(cp)

    if s.isidentifier() and not cre.match(s):
        print(hex(cp), unicodedata.name(s))

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them.

For our case, we just added them to a character set, `[\w\u1885\u1886\u2118\u212e]`.

It can cause unexpected behavior when using `\b`, since that's defined as the transition from `\w` to `\W` and those 4 characters aren't in `\w`. `re.match(r'\b[\w\u212e', '℮')` fails to match.

History
Date	User	Action	Args
2017-07-03 15:39:30	davidism	set	recipients: + davidism, vstinner, ezio.melotti, mrabarnett
2017-07-03 15:39:29	davidism	set	messageid: <1499096369.93.0.459837201925.issue30838@psf.upfronthosting.co.za>
2017-07-03 15:39:29	davidism	link	issue30838 messages
2017-07-03 15:39:29	davidism	create