Message297603
This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches all valid identifier characters except for 4 special cases:
import unicodedata
import re
import sys
cre = re.compile(r'\w')
for cp in range(sys.maxunicode + 1):
s = chr(cp)
if s.isidentifier() and not cre.match(s):
print(hex(cp), unicodedata.name(s))
0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL
Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them.
For our case, we just added them to a character set, `[\w\u1885\u1886\u2118\u212e]`.
It can cause unexpected behavior when using `\b`, since that's defined as the transition from `\w` to `\W` and those 4 characters aren't in `\w`. `re.match(r'\b[\w\u212e', '℮')` fails to match. |
|
Date |
User |
Action |
Args |
2017-07-03 15:39:30 | davidism | set | recipients:
+ davidism, vstinner, ezio.melotti, mrabarnett |
2017-07-03 15:39:29 | davidism | set | messageid: <1499096369.93.0.459837201925.issue30838@psf.upfronthosting.co.za> |
2017-07-03 15:39:29 | davidism | link | issue30838 messages |
2017-07-03 15:39:29 | davidism | create | |
|