Message 190326 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	BreamoreBoy, ezio.melotti, l0nwlf, lemburg, loewis, mrabarnett, nathanlmiles, rsc, terry.reedy, timehorse, vstinner
Date	2013-05-29.20:33:57
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1369859638.44.0.439395953588.issue1693050@psf.upfronthosting.co.za>
In-reply-to

Content
Let see Modules/_sre.c: #define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch) #define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) \|\| (ch) == '_') >>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940'] [True, False, True, False, True, False] >>> import unicodedata >>> [unicodedata.category(ch) for ch in '\u0939\u093f\u0928\u094d\u0926\u0940'] ['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Mc'] So the matching ends at U+093f because its category is a "spacing combining" (Mc), which is part of the Mark category, where the re module expects an alphanumeric character. msg76557: """ Unicode TR#18 defines \w as a shorthand for \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} """ So if we want to respect this standard, the re module needs to be modified to accept other Unicode categories.

Let see Modules/_sre.c:

#define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch)
#define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) || (ch) == '_')

>>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
[True, False, True, False, True, False]
>>> import unicodedata
>>> [unicodedata.category(ch) for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Mc']

So the matching ends at U+093f because its category is a "spacing combining" (Mc), which is part of the Mark category, where the re module expects an alphanumeric character.

msg76557:

"""
Unicode TR#18 defines \w as a shorthand for

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}
"""

So if we want to respect this standard, the re module needs to be modified to accept other Unicode categories.

History
Date	User	Action	Args
2013-05-29 20:33:58	vstinner	set	recipients: + vstinner, lemburg, loewis, terry.reedy, nathanlmiles, rsc, timehorse, ezio.melotti, mrabarnett, l0nwlf, BreamoreBoy
2013-05-29 20:33:58	vstinner	set	messageid: <1369859638.44.0.439395953588.issue1693050@psf.upfronthosting.co.za>
2013-05-29 20:33:58	vstinner	link	issue1693050 messages
2013-05-29 20:33:58	vstinner	create