This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients tchrist
Date 2011-08-11.19:18:30
SpamBayes Score 1.4691809e-06
Marked as misclassified No
Message-id <1313090311.62.0.0473644856742.issue12731@psf.upfronthosting.co.za>
In-reply-to
Content
You cannot use Python's lib re for handling Unicode regular expressions because it violates the standard set out for the same in UTS#18 on Unicode Regular Expressions in RL1.2a on compatibility properties.  What \w is allowed to match is clearly explained there, but Python has its own idea. Because it is in clear violation of the standard, it is misleading and wrong for Python to claim that the re.UNICODE flag makes \w and friends match Unicode.  Here are the failed test cases when the attached file is run under v3.2; there are further failures when run under v2.7.

FAIL lib re    found non alphanumeric string café
FAIL lib re    found non alphanumeric string Ⓚ
FAIL lib re    found non alphanumeric string ͅ
FAIL lib re    found non alphanumeric string ְ
FAIL lib re    found non alphanumeric string 𝟘
FAIL lib re    found non alphanumeric string 𐍁
FAIL lib re    found non alphanumeric string 𝔘𝔫𝔦𝔠𝔬𝔡𝔢
FAIL lib re    found non alphanumeric string 𐐔𐐯𐑅𐐨𐑉𐐯𐐻
FAIL lib re    found non alphanumeric string connector‿punctuation
FAIL lib re    found non alphanumeric string Ὰͅ_Στο_Διάολο
FAIL lib re    found non alphanumeric string 𐌰𐍄𐍄𐌰‿𐌿𐌽𐍃𐌰𐍂‿𐌸𐌿‿𐌹𐌽‿𐌷𐌹𐌼𐌹𐌽𐌰𐌼
FAIL lib re    found all alphanumeric string ¹²³
FAIL lib re    found all alphanumeric string ₁₂₃
FAIL lib re    found all alphanumeric string ¼½¾
FAIL lib re    found all alphanumeric string ⑶

Note that Matthew Barnett's regex lib for Python handles all of these cases in comformance with The Unicode Standard.
History
Date User Action Args
2011-08-11 19:18:31tchristsetrecipients: + tchrist
2011-08-11 19:18:31tchristsetmessageid: <1313090311.62.0.0473644856742.issue12731@psf.upfronthosting.co.za>
2011-08-11 19:18:31tchristlinkissue12731 messages
2011-08-11 19:18:30tchristcreate