Message119995
Summary: Somewhere between 2.6.5 r79063 and 3.1 r79147 a regression in the unicode NFC normalization has been introduces. This regression leads to bot edit wars on wikipedia [1]. It is reproducable with a simple script [2]. Mediawiki/PHP [3] and C# [4] test scripts both show the old behaviour, which leads me to believe this is a python bug.
A search for older bugs shows bug #1054943 [5] which has commits in the suspected region.
The regression causes certain NFC-normalized strings to become mangled. Because of the wide range of unicode strings on wikipedia, this causes several problems. Details of those can be found at [1].
Example strings include: (these strings have been NFC-normalized by mediawiki)
* u'Li\u030dt-s\u1e73\u0301'
* u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
* u'\u0915\u093f\u0930\u094d\u0917\u093f\u091c\u093c\u0938\u094d\u0924\u093e\u0928'
The bug can be shown simply with
unicodedata.normalize('NFC', s) == s
where s is one of the strings above. This will return True on older python versions, False on newer versions. There is a script available that does this [2].
The bug has been tested on the following machines and python versions. OK indicates the bug is not present, FAIL indicates the bug is present.
Host: SunOS willow 5.10 Generic_142910-17 i86pc i386 i86pc Solaris
'2.3.3 (#1, Dec 16 2004, 14:38:56) [C]' OK
'2.6.5 (r265:79063, Jul 10 2010, 17:50:38) [C]' OK
'2.7 (r27:82500, Aug 5 2010, 04:28:45) [C]' FAIL
'3.1.2 (r312:79147, Sep 24 2010, 05:34:04) [C]' FAIL
Host: Linux nightshade 2.6.26-2-amd64 #1 SMP Thu Sep 16 15:56:38 UTC 2010 x86_64 GNU/Linux
'2.4.6 (#2, Jan 24 2010, 12:20:41) \n[GCC 4.3.2]' OK
'2.5.2 (r252:60911, Jan 24 2010, 17:44:40) \n[GCC 4.3.2]' OK
'2.6.4+ (r264:75706, Feb 16 2010, 05:11:28) \n[GCC 4.4.3]' OK
Host: Linux dorthonion 2.6.22.18-co-0.7.4 #1 PREEMPT Wed Apr 15 18:57:39 UTC 2009 i686 GNU/Linux
'2.5.4 (r254:67916, Jan 20 2010, 21:44:03) \n[GCC 4.3.3]' OK
'2.6.2 (release26-maint, Apr 19 2009, 01:56:41) \n[GCC 4.3.3]' OK
'3.0.1+ (r301:69556, Apr 15 2009, 15:59:22) \n[GCC 4.3.3]' OK
[1] https://sourceforge.net/tracker/index.php?func=detail&aid=3081100&group_id=93107&atid=603138# ; http://fr.wikipedia.org/w/index.php?title=Mark_Zuckerberg&action=historysubmit&diff=57753004&oldid=57751674
[2] http://pastebin.ca/1977285 (py2.x), http://pastebin.ca/1977287 (py3.x)
[3] http://pastebin.ca/1977292 (PHP, placed in http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/normal/),
[4] http://pastebin.ca/1977261 (C#)
[5] http://bugs.python.org/issue1054943# |
|
Date |
User |
Action |
Args |
2010-10-30 15:42:14 | valhallasw | set | recipients:
+ valhallasw |
2010-10-30 15:42:14 | valhallasw | set | messageid: <1288453334.74.0.770473212932.issue10254@psf.upfronthosting.co.za> |
2010-10-30 15:42:13 | valhallasw | link | issue10254 messages |
2010-10-30 15:42:11 | valhallasw | create | |
|