This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re \w does not match some valid Unicode characters
Type: behavior Stage: resolved
Components: Regular Expressions, Unicode Versions: Python 3.7, Python 3.6, Python 3.3, Python 3.4, Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ThiefMaster, davidism, ezio.melotti, mrabarnett, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2017-07-03 15:39 by davidism, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (5)
msg297603 - (view) Author: David Lord (davidism) Date: 2017-07-03 15:39
This came up while writing a regex to match characters that are valid in Python identifiers for Jinja. https://github.com/pallets/jinja/pull/731 `\w` matches all valid identifier characters except for 4 special cases:

import unicodedata
import re
import sys

cre = re.compile(r'\w')

for cp in range(sys.maxunicode + 1):
    s = chr(cp)

    if s.isidentifier() and not cre.match(s):
        print(hex(cp), unicodedata.name(s))

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Python < 3.6 matches the two Mongolian characters, not sure why 3.6 stopped matching them.

For our case, we just added them to a character set, `[\w\u1885\u1886\u2118\u212e]`.

It can cause unexpected behavior when using `\b`, since that's defined as the transition from `\w` to `\W` and those 4 characters aren't in `\w`. `re.match(r'\b[\w\u212e', '℮')` fails to match.
msg297604 - (view) Author: David Lord (davidism) Date: 2017-07-03 16:21
Adding `or ('a' + s).isidentifer()`, to catch valid id_continue characters, to the test in the previous script reveals many more characters that seem like valid word characters but aren't matched by `\w`.
msg297613 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2017-07-03 20:21
In Unicode 9.0.0, U+1885 and U+1886 changed from being General_Category=Other_Letter (Lo) to General_Category=Nonspacing_Mark (Mn).

U+2118 is General_Category=Math_Symbol (Sm) and U+212E is General_Category=Other_Symbol (So).

\w doesn't include Mn, Sm or So.

The .identifier method uses the Unicode properties XID_Start and XID_Continue, which include these codepoints.
msg297766 - (view) Author: David Lord (davidism) Date: 2017-07-05 15:19
After thinking about it more, I guess I misunderstood what \w was doing compared to isidentifier. Since Python just relies on the Unicode database, there's not much to be done anyway. Closing this.

For anyone interested, we ended up with a hybrid approach for lexing identifiers: build a regex group that includes all valid ranges not matched by \w, then validate with isidentifier later. https://github.com/pallets/jinja/pull/731/files
msg297773 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2017-07-05 17:22
Python identifiers match the regex:

    [_\p{XID_Start}]\p{XID_Continue}*

The standard re module doesn't support \p{...}, but the third-party "regex" module does.
History
Date User Action Args
2022-04-11 14:58:48adminsetgithub: 75021
2017-07-05 17:22:30mrabarnettsetmessages: + msg297773
2017-07-05 15:19:42davidismsetstatus: open -> closed
resolution: not a bug
messages: + msg297766

stage: resolved
2017-07-03 20:21:45mrabarnettsetmessages: + msg297613
2017-07-03 16:21:56davidismsetmessages: + msg297604
2017-07-03 15:43:54ThiefMastersetnosy: + ThiefMaster
2017-07-03 15:41:45vstinnersetnosy: + serhiy.storchaka
2017-07-03 15:39:29davidismcreate