This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author snoopjedi
Recipients docs@python, snoopjedi
Date 2019-10-23.16:28:37
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1571848118.32.0.881781017255.issue38566@roundup.psfhosted.org>
In-reply-to
Content
The documentation for the `re` library¹ describes the behavior of the specifier '\w' as matching "Unicode word characters," which is very vague. The closest thing I can find that corresponds to this language is the guidance offered in Unicode Technical Standard #18², which defines the class `<word_character>` to include all alphabetic and decimal codepoints, as well as U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER. This does not appear to be a correct description of `re`, however, as these zero-width characters are not counted when matching '\w', e.g.:

```
>>> re.match('\w*', 'Auf\u200Clage')
<re.Match object; span=(0, 3), match='Auf'>
```

It seems from examining the CPython source³ that SRE treats '\w' as meaning any alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any Unicode class definition I've been able to find.

Can anyone provide clarification on what part of Unicode this documentation is referring to? If there is some other definition, the documentation should be more specific about referring to it (and including a link would be preferred). If instead the documentation is incorrect, this language should be changed to describe the true meaning of \w.

¹ https://docs.python.org/3/library/re.html#index-32
² http://unicode.org/reports/tr18/
³ https://github.com/python/cpython/blob/master/Modules/_sre.c#L125
History
Date User Action Args
2019-10-23 16:28:38snoopjedisetrecipients: + snoopjedi, docs@python
2019-10-23 16:28:38snoopjedisetmessageid: <1571848118.32.0.881781017255.issue38566@roundup.psfhosted.org>
2019-10-23 16:28:38snoopjedilinkissue38566 messages
2019-10-23 16:28:37snoopjedicreate