Author tchrist
Recipients tchrist
Date 2011-08-11.18:48:19
SpamBayes Score 1.23235e-14
Marked as misclassified No
Message-id <1313088501.39.0.822875623158.issue12728@psf.upfronthosting.co.za>
In-reply-to
Content
The Python re library is broken in its approach to case-insensitive matches. It erroneously attempts to compare lowercase mappings.  This is wrong. You must compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get wrong answers.  I include a small test case that illustrates this bug.  The bug exists on both 2.7 and 3.2, and on both wide builds and narrow builds.  For comparison, I also show results using Matthew Barnett's regex library, which gets all 5 tests correct where re gets all 5 tests wrong.

A sample run is:

FAIL: re    pattern Ι is    not the same as string ͅ
PASS: regex pattern Ι is indeed the same as string ͅ
FAIL: re    pattern Μ is    not the same as string µ
PASS: regex pattern Μ is indeed the same as string µ
FAIL: re    pattern ſ is    not the same as string s
PASS: regex pattern ſ is indeed the same as string s
FAIL: re    pattern ΣΤΙΓΜΑΣ is    not the same as string στιγμας
PASS: regex pattern ΣΤΙΓΜΑΣ is indeed the same as string στιγμας
FAIL: re    pattern POST is    not the same as string poſt
PASS: regex pattern POST is indeed the same as string poſt

re    lib passed 0 of 5 tests
regex lib passed 5 of 5 tests
History
Date User Action Args
2011-08-11 18:48:21tchristsetrecipients: + tchrist
2011-08-11 18:48:21tchristsetmessageid: <1313088501.39.0.822875623158.issue12728@psf.upfronthosting.co.za>
2011-08-11 18:48:20tchristlinkissue12728 messages
2011-08-11 18:48:20tchristcreate