classification
Title: Invalid behavior of unicode.lower
Type: behavior Stage: patch review
Components: Unicode Versions: Python 3.0, Python 3.1, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: loewis Nosy List: amaury.forgeotdarc, doerwalter, jarek, loewis, terry.reedy
Priority: normal Keywords: patch

Created on 2009-04-24 10:39 by jarek, last changed 2009-04-25 14:46 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
diff.txt doerwalter, 2009-04-24 12:57
diff2.txt doerwalter, 2009-04-24 14:15
diff3.txt doerwalter, 2009-04-25 09:16
mud.diff loewis, 2009-04-25 11:38
diff4.txt doerwalter, 2009-04-25 13:37
Messages (14)
msg86400 - (view) Author: Jarek Sobieszek (jarek) Date: 2009-04-24 10:39
u'\u1d79'.lower() returns u'\x00'

I think it should return u'\u1d79', at least according to my
understanding of UnicodeData.txt (the lowercase field is empty).
msg86401 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-24 10:49
It *does* return u'\u1d79' for me on Python 2.5.2:

>>> u'\u1d79'.lower()
u'\u1d79'
>>> import sys
>>> sys.version
'2.5.2 (r252:60911, Apr  8 2008, 18:54:00) \n[GCC 3.3.5 (Debian
1:3.3.5-13)]'

However on 2.6.2 it's broken:

>>> u'\u1d79'.lower()
u'\x00'
>>> import sys
>>> sys.version
'2.6.2 (r262:71600, Apr 19 2009, 18:38:49) \n[GCC 4.0.1 (Apple Inc.
build 5490)]'
msg86405 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-24 12:57
The following patch fixes the problem for me, however it breaks the test
suite. The change seems to have been introduced in r66362.

Assigning to Martin.
msg86406 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-04-24 13:05
The same change should be applied to _PyUnicode_ToTitlecase as well.
msg86411 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-24 14:15
Updated the patch (diff2.txt) as requested by Amaury.
msg86425 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2009-04-24 18:51
Py3.0.1
>>> '\u1d79'.lower()
'\x00'

I am guessing that this bug is in 2.7 and 3.1 as well.
msg86447 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-25 09:16
Here is a third version of the patch. AFAICT the logic of the unicode
database is as follows:

* If the NODELTA_MASK is not set, delta is an offset.
* If NODELTA_MASK is set and delta is != 0, delta is the
upper/lower/title case character.
* If NODELTA_MASK is set and delta is == 0, there is no
upper/lower/title case variant (i.e. the method returns the original
character.

Is this the correct interpretation?

I've also updated the testsuite (changed the checksum and added a new test).

(BTW, the patch is against the py3k branch).
msg86476 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-04-25 11:38
I think the patch is incorrect; the bug is already in
makeunicodedata.py. For U+1d79, it should set the lowercase letter to
U+1d79.

If you look at makeunicodedata.py, you see that the entire logic is
bogus: when the column is absent, it should default it to the character
itself (except for titlecase, where it should default it to uppercase).
Then, if it finds that one of the characters can't be delta-encoded, it
should go back to changing the previous mappings as well.

I'm attaching an untested patch that should do that.

Also see issue4971, which is related.
msg86506 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-25 13:37
I've merged your version of the patch with my changes to the test suite
and regenerated the Unicode database. Attached is the resulting patch
(diff4.txt)
msg86507 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-04-25 13:47
Feel free to check it into trunk, and merge into the other three
branches from there. If you don't want to do that, assign it back to me.
msg86511 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-25 14:10
Checked in:
r71894 (trunk)
r71895 (release26-maint)
msg86512 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-25 14:17
Checked in:
r71896 (py3k)
r71897 (release30-maint)
msg86513 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-04-25 14:20
BTW, are the steps to regenerate the Unicode database documented
somewhere? What I did was:

cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/UnicodeData.txt .
cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/CompositionExclusions.txt .
cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/EastAsianWidth.txt .
cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/DerivedCoreProperties.txt .
cp /Volumes/ftp.unicode.org/Public/3.2-Update/ucd/UnicodeData-3.2.0.txt .
cp /Volumes/ftp.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt .
cp
/Volumes/ftp.unicode.org/Public/3.2-Update/CompositionExclusions-3.2.0.txt .
cp /Volumes/ftp.unicode.org/Public/3.2-Update/EastAsianWidth-3.2.0.txt .
cp
/Volumes/ftp.unicode.org/Public/3.2-Update/DerivedCoreProperties-3.2.0.txt .
./python.exe Tools/unicode/makeunicodedata.py
msg86514 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-04-25 14:46
> BTW, are the steps to regenerate the Unicode database documented
> somewhere? 

I don't think so - your procedure looks right, though.

Regenerating the database is often more difficult, though, in particular
when we upgrade to a new version. Often, the new version will add new
complications which have to be dealt with, so a deep understanding of
makeunicodata.py is often needed to be able to use it. Welcome to the
club :-)
History
Date User Action Args
2009-04-25 14:46:47loewissetmessages: + msg86514
2009-04-25 14:21:04doerwaltersetassignee: doerwalter -> loewis
2009-04-25 14:20:53doerwaltersetmessages: + msg86513
2009-04-25 14:17:33doerwaltersetstatus: open -> closed
resolution: fixed
messages: + msg86512
2009-04-25 14:10:37doerwaltersetmessages: + msg86511
2009-04-25 13:47:12loewissetassignee: loewis -> doerwalter
messages: + msg86507
2009-04-25 13:37:12doerwaltersetfiles: + diff4.txt

messages: + msg86506
2009-04-25 11:38:30loewissetfiles: + mud.diff
keywords: + patch
messages: + msg86476
2009-04-25 09:16:31doerwaltersetfiles: + diff3.txt

messages: + msg86447
2009-04-24 18:51:31terry.reedysetnosy: + terry.reedy

messages: + msg86425
versions: + Python 3.0, Python 3.1, Python 2.7
2009-04-24 14:15:39doerwaltersetfiles: + diff2.txt

messages: + msg86411
2009-04-24 13:05:49amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg86406
2009-04-24 12:57:23doerwaltersetfiles: + diff.txt

nosy: + loewis
messages: + msg86405

assignee: loewis
stage: patch review
2009-04-24 10:49:32doerwaltersetnosy: + doerwalter
messages: + msg86401
2009-04-24 10:39:58jarekcreate