Message141920
You cannot use Python's lib re for handling Unicode regular expressions because it violates the standard set out for the same in UTS#18 on Unicode Regular Expressions in RL1.2a on compatibility properties. What \w is allowed to match is clearly explained there, but Python has its own idea. Because it is in clear violation of the standard, it is misleading and wrong for Python to claim that the re.UNICODE flag makes \w and friends match Unicode. Here are the failed test cases when the attached file is run under v3.2; there are further failures when run under v2.7.
FAIL lib re found non alphanumeric string café
FAIL lib re found non alphanumeric string Ⓚ
FAIL lib re found non alphanumeric string ͅ
FAIL lib re found non alphanumeric string ְ
FAIL lib re found non alphanumeric string 𝟘
FAIL lib re found non alphanumeric string 𐍁
FAIL lib re found non alphanumeric string 𝔘𝔫𝔦𝔠𝔬𝔡𝔢
FAIL lib re found non alphanumeric string 𐐔𐐯𐑅𐐨𐑉𐐯𐐻
FAIL lib re found non alphanumeric string connector‿punctuation
FAIL lib re found non alphanumeric string Ὰͅ_Στο_Διάολο
FAIL lib re found non alphanumeric string 𐌰𐍄𐍄𐌰‿𐌿𐌽𐍃𐌰𐍂‿𐌸𐌿‿𐌹𐌽‿𐌷𐌹𐌼𐌹𐌽𐌰𐌼
FAIL lib re found all alphanumeric string ¹²³
FAIL lib re found all alphanumeric string ₁₂₃
FAIL lib re found all alphanumeric string ¼½¾
FAIL lib re found all alphanumeric string ⑶
Note that Matthew Barnett's regex lib for Python handles all of these cases in comformance with The Unicode Standard. |
|
Date |
User |
Action |
Args |
2011-08-11 19:18:31 | tchrist | set | recipients:
+ tchrist |
2011-08-11 19:18:31 | tchrist | set | messageid: <1313090311.62.0.0473644856742.issue12731@psf.upfronthosting.co.za> |
2011-08-11 19:18:31 | tchrist | link | issue12731 messages |
2011-08-11 19:18:30 | tchrist | create | |
|