classification
Title: Clarify exactly what \w matches in UNICODE mode
Type: enhancement Stage: needs patch
Components: Documentation, Regular Expressions Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Andi McClure, docs@python, ezio.melotti, mrabarnett, zwol
Priority: normal Keywords:

Created on 2015-11-27 15:50 by zwol, last changed 2016-01-04 03:52 by ezio.melotti.

Messages (3)
msg255463 - (view) Author: Zack Weinberg (zwol) * Date: 2015-11-27 15:50
The `re` module documentation does not do a good job of explaining exactly what `\w` matches.  Quoting https://docs.python.org/3.5/library/re.html :

> \w
> For Unicode (str) patterns:
> Matches Unicode word characters; this includes most characters
> that can be part of a word in any language, as well as numbers
> and the underscore.

Empirically, this appears to mean "everything in Unicode general categories L* and N*, plus U+005F (underscore)".  That is a perfectly sensible definition and the documentation should state it in those terms.  "Unicode word characters" could mean any number of different things; note for instance that UTS#18 gives a very different definition.

(Further reading: https://gist.github.com/zackw/3077f387591376c7bf67 plus links therefrom).
msg255464 - (view) Author: Andi McClure (Andi McClure) Date: 2015-11-27 16:14
I would like to request also a clear explanation be given for the documentation in the 2.7 branch. From https://docs.python.org/2.7/library/re.html :

"\w ... If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database"

This is ambiguous. Does it mean the "Alphabetic" property from UAX#44? Does it mean something else?
msg255465 - (view) Author: Zack Weinberg (zwol) * Date: 2015-11-27 16:40
FWIW, the actual behavior of \w matching "everything in Unicode general categories L* and N*, plus U+005F (underscore)" is consistent across all versions I can conveniently test (2.7, 3.4, 3.5).

In 2.7, there are four characters in general category Nl that \w doesn't match, but I believe that is just a bug, not an intentional difference of behavior.
History
Date User Action Args
2016-01-04 03:52:01ezio.melottisetversions: - Python 3.2, Python 3.3, Python 3.4
nosy: + ezio.melotti, mrabarnett

components: + Regular Expressions
type: enhancement
stage: needs patch
2015-11-27 16:40:30zwolsetmessages: + msg255465
2015-11-27 16:14:25Andi McCluresetnosy: + Andi McClure
messages: + msg255464
2015-11-27 15:50:58zwolcreate