Message 355239 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	snoopjedi
Recipients	docs@python, snoopjedi
Date	2019-10-23.16:28:37
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1571848118.32.0.881781017255.issue38566@roundup.psfhosted.org>
In-reply-to

Content
The documentation for the `re` library¹ describes the behavior of the specifier '\w' as matching "Unicode word characters," which is very vague. The closest thing I can find that corresponds to this language is the guidance offered in Unicode Technical Standard #18², which defines the class `<word_character>` to include all alphabetic and decimal codepoints, as well as U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER. This does not appear to be a correct description of `re`, however, as these zero-width characters are not counted when matching '\w', e.g.: ``` >>> re.match('\w*', 'Auf\u200Clage') <re.Match object; span=(0, 3), match='Auf'> ``` It seems from examining the CPython source³ that SRE treats '\w' as meaning any alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any Unicode class definition I've been able to find. Can anyone provide clarification on what part of Unicode this documentation is referring to? If there is some other definition, the documentation should be more specific about referring to it (and including a link would be preferred). If instead the documentation is incorrect, this language should be changed to describe the true meaning of \w. ¹ https://docs.python.org/3/library/re.html#index-32 ² http://unicode.org/reports/tr18/ ³ https://github.com/python/cpython/blob/master/Modules/_sre.c#L125

The documentation for the `re` library¹ describes the behavior of the specifier '\w' as matching "Unicode word characters," which is very vague. The closest thing I can find that corresponds to this language is the guidance offered in Unicode Technical Standard #18², which defines the class `<word_character>` to include all alphabetic and decimal codepoints, as well as U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER. This does not appear to be a correct description of `re`, however, as these zero-width characters are not counted when matching '\w', e.g.:

```
>>> re.match('\w*', 'Auf\u200Clage')
<re.Match object; span=(0, 3), match='Auf'>
```

It seems from examining the CPython source³ that SRE treats '\w' as meaning any alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any Unicode class definition I've been able to find.

Can anyone provide clarification on what part of Unicode this documentation is referring to? If there is some other definition, the documentation should be more specific about referring to it (and including a link would be preferred). If instead the documentation is incorrect, this language should be changed to describe the true meaning of \w.

¹ https://docs.python.org/3/library/re.html#index-32
² http://unicode.org/reports/tr18/
³ https://github.com/python/cpython/blob/master/Modules/_sre.c#L125

History
Date	User	Action	Args
2019-10-23 16:28:38	snoopjedi	set	recipients: + snoopjedi, docs@python
2019-10-23 16:28:38	snoopjedi	set	messageid: <1571848118.32.0.881781017255.issue38566@roundup.psfhosted.org>
2019-10-23 16:28:38	snoopjedi	link	issue38566 messages
2019-10-23 16:28:37	snoopjedi	create