The `re` module documentation does not do a good job of explaining exactly what `\w` matches.  Quoting :

> \w
> For Unicode (str) patterns:
> Matches Unicode word characters; this includes most characters
> that can be part of a word in any language, as well as numbers
> and the underscore.

Empirically, this appears to mean "everything in Unicode general categories L* and N*, plus U+005F (underscore)".  That is a perfectly sensible definition and the documentation should state it in those terms.  "Unicode word characters" could mean any number of different things; note for instance that UTS#18 gives a very different definition.

(Further reading: plus links therefrom).
