This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author valhallasw
Recipients valhallasw
Date 2010-10-30.15:42:11
SpamBayes Score 3.6409764e-13
Marked as misclassified No
Message-id <1288453334.74.0.770473212932.issue10254@psf.upfronthosting.co.za>
In-reply-to
Content
Summary: Somewhere between 2.6.5 r79063 and 3.1 r79147 a regression in the unicode NFC normalization has been introduces. This regression leads to bot edit wars on wikipedia [1]. It is reproducable with a simple script [2]. Mediawiki/PHP [3] and C# [4] test scripts both show the old behaviour, which leads me to believe this is a python bug.
A search for older bugs shows bug #1054943 [5] which has commits in the suspected region.

The regression causes certain NFC-normalized strings to become mangled. Because of the wide range of unicode strings on wikipedia, this causes several problems. Details of those can be found at [1].

Example strings include: (these strings have been NFC-normalized by mediawiki)
 * u'Li\u030dt-s\u1e73\u0301'
 * u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
 * u'\u0915\u093f\u0930\u094d\u0917\u093f\u091c\u093c\u0938\u094d\u0924\u093e\u0928'

The bug can be shown simply with
unicodedata.normalize('NFC', s) == s
where s is one of the strings above. This will return True on older python versions, False on newer versions. There is a script available that does this [2].

The bug has been tested on the following machines and python versions. OK indicates the bug is not present, FAIL indicates the bug is present.

Host: SunOS willow 5.10 Generic_142910-17 i86pc i386 i86pc Solaris
'2.3.3 (#1, Dec 16 2004, 14:38:56) [C]' OK
'2.6.5 (r265:79063, Jul 10 2010, 17:50:38) [C]' OK
'2.7 (r27:82500, Aug  5 2010, 04:28:45) [C]' FAIL
'3.1.2 (r312:79147, Sep 24 2010, 05:34:04) [C]' FAIL

Host: Linux nightshade 2.6.26-2-amd64 #1 SMP Thu Sep 16 15:56:38 UTC 2010 x86_64 GNU/Linux
'2.4.6 (#2, Jan 24 2010, 12:20:41) \n[GCC 4.3.2]' OK
'2.5.2 (r252:60911, Jan 24 2010, 17:44:40) \n[GCC 4.3.2]' OK
'2.6.4+ (r264:75706, Feb 16 2010, 05:11:28) \n[GCC 4.4.3]' OK

Host: Linux dorthonion 2.6.22.18-co-0.7.4 #1 PREEMPT Wed Apr 15 18:57:39 UTC 2009 i686 GNU/Linux
'2.5.4 (r254:67916, Jan 20 2010, 21:44:03) \n[GCC 4.3.3]' OK
'2.6.2 (release26-maint, Apr 19 2009, 01:56:41) \n[GCC 4.3.3]' OK
'3.0.1+ (r301:69556, Apr 15 2009, 15:59:22) \n[GCC 4.3.3]' OK

[1] https://sourceforge.net/tracker/index.php?func=detail&aid=3081100&group_id=93107&atid=603138# ; http://fr.wikipedia.org/w/index.php?title=Mark_Zuckerberg&action=historysubmit&diff=57753004&oldid=57751674
[2] http://pastebin.ca/1977285 (py2.x), http://pastebin.ca/1977287 (py3.x)
[3] http://pastebin.ca/1977292 (PHP, placed in http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/normal/), 
[4] http://pastebin.ca/1977261 (C#)
[5] http://bugs.python.org/issue1054943#
History
Date User Action Args
2010-10-30 15:42:14valhallaswsetrecipients: + valhallasw
2010-10-30 15:42:14valhallaswsetmessageid: <1288453334.74.0.770473212932.issue10254@psf.upfronthosting.co.za>
2010-10-30 15:42:13valhallaswlinkissue10254 messages
2010-10-30 15:42:11valhallaswcreate