classification
Title: Latin Capital Letter I with Dot Above
Type: behavior Stage:
Components: Unicode Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, christian.heimes, ezio.melotti, firatozgul, haypo, lemburg, pitrou, r.david.murray
Priority: normal Keywords:

Created on 2013-02-20 09:14 by firatozgul, last changed 2013-02-20 15:21 by christian.heimes. This issue is now closed.

Messages (19)
msg182485 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 09:14
lower() method of strings gives different output for 'Latin Capital Letter I with Dot Above' on Python 3.2 and Python 3.3. 

On Python 3.2 (Windows XP):

>>> "\u0130".lower()
'i' #this is correct

On Python 3.3 (Windows XP):

>>> "\u0130".lower()
'i\u0307' #this is wrong

Why is this difference? This breaks code, because 'i' and 'i\u0307' are different letters.
msg182486 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 10:57
I thought this would just be a difference in the unicode database, but that appears not to be the case.  Ezio, this is related to the infamous Turkic dotless lower case i problem (see, eg, http://mail.python.org/pipermail/python-bugs-list/2005-October/030686.html).

The SpecialCasing.txt file entries for these characters seems to be the same in 6.0.0 (3.2) and 6.1.0 (3.3).  So the question is, why did the Python behavior change, and is it indeed a bug?  What python3.3 is returning is the canonical version, which would seem to be correct.  Have we been buggy up to this point and something got fixed?

And, referencing that thread above, how does one do a locale dependent lower case?
msg182487 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 11:31
In Python, things like lowercasing-uppercasing and sorting were always problematic with regard to Turkish language. For instance, whatever the locale is, you cannot lowercase the word 'KADIN' (woman) in Turkish correctly::

    >>> "KADIN".lower()

    'kadin'

... which is wrong. That should be 'kadın' ('kad\u0131n'). Likewise 'kitap' (book)::

    >>> "kitap".upper()

    'KITAP'

... which is wrong. That should be 'KİTAP' ('K\u0130TAP').

As for this thread, in 3.3, Python does a completely different thing::

    >>> "KİTAP".lower()

    'ki\u0307tap' #wrong

In Python 3.2, this was::

    >>> "KİTAP".lower()

    'kitap' #correct

'i' and 'i\u0307' are not the same. 

Turkish Python programmers define their own upper(), lower(), title(), swapcase() and casefold() methods and use their own sorting techniques.
msg182491 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 12:00
Right, and the unicode consortium says that that weird thing 3.3 is doing is the "canonical" lowercasing, and this is the case exactly because in 3.3 "\u0130".lower().upper() == "\u0130".  Which I why I asked Ezio if we ever came up with a way to do lower/upper in a locale specific manner.

The behavior change is an issue, but I'm thinking the 3.3 behavior is probably the "correct" behavior per the unicode standard.
msg182494 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 12:24
r.david.murray: '(...) because in 3.3 "\u0130".lower().upper() == "\u0130"'

Do you mean in Python 3.3 "\u0130".lower() returns "\u0130"?

If you are saying so, this is not the case, because in Python 3.3::

    >>> '\u0130'.lower()

    'i\u0307'
msg182495 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-20 12:28
Yes, I think 3.3 is correct here. I think it was Benjamin who fixed/improved the behaviour of casing methods. Compare 3.3:

>>> "ß".upper()
'SS'

with 3.2:

>>> "ß".upper()
'ß'

Also, 3.2 loses information:

>>> "KİTAP".lower().upper()
'KITAP'
>>> ascii("KİTAP".lower().upper())
"'KITAP'"

while 3.3 retains it:

>>> "KİTAP".lower().upper()
'KİTAP'
>>> ascii("KİTAP".lower().upper())
"'KI\\u0307TAP'"

You can get the combined form again with unicodedata.normalize:

>>> unicodedata.normalize("NFC", "KİTAP".lower().upper())
'KİTAP'
msg182497 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 12:36
Don't you think that there is a problem here?

>>> "KİTAP".lower().upper()
'KİTAP'
>>> ascii("KİTAP".lower().upper())
"'KI\\u0307TAP'"

"İ" is not "i\u0307". That's a different letter. "i\u0307"is 'i with combining dot above'. However, "İ" is "\u0130" (Latin Capital Letter I with Dot Above).
msg182498 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 12:44
ascii("KİTAP".lower().upper()) should return "K\u0130TAP".

Yes, Python 3.2 loses information, but Python 3.3 inserts faulty information, which, I think, is much worse than losing information.
msg182499 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 12:45
Ah, you are right, I did not decode it to see what the actual characters were.

That does contradict what I said, but I'm way out of my depth on unicode at this point, so we'll have to wait for someone more expert to weigh in.
msg182502 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 13:20
Excerpt from http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

So the code 0130 should be 0069 in lowercase; 0130 in uppercase; 0130 in titlecase; and again 0130 in uppercase.
msg182504 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-02-20 13:50
Notice the lines you pulled have "tr" and "az" at the end of them meaning they only apply for Turkish and Azeri. Since the lower() method has no idea whether the user intends to be in a Turkish or Azeri locale or not, we just have to use the generic lowering mapping which simply preserves canonical equivalence.
msg182505 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 13:59
Even if you set Turkish locale, the output is still "generic".

Furthermore, does "canonical equivalence" really dictate that 'Latin Capital Letter I with Dot Above' should be mapped to 'I With Combining Dot Above' in lowercase?

Note: 'Uppercase Dotted i' only exists in Turkish and Azeri.
msg182509 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 14:25
Whatever the behavior of Python is in 'generic' terms, I believe, we should be able to do locale-dependent uppercasing-lowercasing, which we cannot do at the moment.
msg182514 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 14:40
Yes, earlier in that file is the generic translation:

# Preserve canonical equivalence for I with dot. Turkic is handled below.
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

You see that Python is following the standard, here.

Agreed about the locale-aware upper/lower, etc, but that's a feature request.  There's been some discussion about this kind of thing, but I don't remember what the status is.  A search of the python-ideas and/or python-dev mailing lists might yield some clues.  It's a discussion for one of those mailing lists rather than the bug tracker, in any case.
msg182517 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 14:49
Apparently, what Python did wrong in the past was somewhat good for Turkish Python developers! This means Turkish developers now have one more problem to solve. Bad.
msg182518 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-20 14:52
> "İ" is not "i\u0307". That's a different letter. "i\u0307"is 'i with
> combining dot above'. However, "İ" is "\u0130" (Latin Capital Letter
> I with Dot Above).

Did you actually read my message? You can reconcile the two using
unicodedata.normalize().
msg182519 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-02-20 14:58
The "locale" module does not affect Unicode operations. That's C locale; I'm talking about concept of Unicode locale, which Python doesn't currently know anything about.

I agree it would be useful to customize the locale of various unicode operations. That's a much broader language-level issue, though, in need of careful design.

As for the useless generic mapping of LATIN CAPITAL LETTER I WITH DOT ABOVE, the idea is there is no LATIN SMALL LETTER I WITH DOT ABOVE so the generic lower casing comes from decomposing the character then lowering the latin one.
msg182520 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-02-20 15:13
On 20.02.2013 15:58, Benjamin Peterson wrote:
> 
> Benjamin Peterson added the comment:
> 
> The "locale" module does not affect Unicode operations. That's C locale; I'm talking about concept of Unicode locale, which Python doesn't currently know anything about.
> 
> I agree it would be useful to customize the locale of various unicode operations. That's a much broader language-level issue, though, in need of careful design.

We'd need to add the CLDR for locale aware operations and a Python
interface for it:

http://cldr.unicode.org/

The Babel project provides such an interface:

http://babel.edgewall.org/

The project appears to have stalled, though.
msg182521 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-02-20 15:21
In the meantime you can use PyICU https://pypi.python.org/pypi/PyICU for locale aware transformations:

>>> from icu import UnicodeString, Locale
>>> tr = Locale("TR")
>>> s = UnicodeString("KADIN")
>>> print(unicode(s.toLower(tr)))
kadın
>>> unicode(s.toLower(tr))
u'kad\u0131n'
History
Date User Action Args
2013-02-20 15:21:37christian.heimessetnosy: + christian.heimes
messages: + msg182521
2013-02-20 15:13:47lemburgsetmessages: + msg182520
2013-02-20 14:58:18benjamin.petersonsetmessages: + msg182519
2013-02-20 14:52:14pitrousetmessages: + msg182518
2013-02-20 14:49:34firatozgulsetmessages: + msg182517
2013-02-20 14:40:46r.david.murraysetmessages: + msg182514
2013-02-20 14:25:33firatozgulsetmessages: + msg182509
2013-02-20 13:59:24firatozgulsetmessages: + msg182505
2013-02-20 13:50:17benjamin.petersonsetstatus: open -> closed
resolution: works for me
messages: + msg182504
2013-02-20 13:20:15firatozgulsetmessages: + msg182502
2013-02-20 13:14:20firatozgulsetstatus: closed -> open
resolution: not a bug -> (no value)
2013-02-20 12:45:16r.david.murraysetmessages: + msg182499
2013-02-20 12:44:43firatozgulsetmessages: + msg182498
2013-02-20 12:36:42firatozgulsetmessages: + msg182497
2013-02-20 12:28:22pitrousetstatus: open -> closed

nosy: + lemburg, pitrou, haypo, benjamin.peterson
messages: + msg182495

resolution: not a bug
2013-02-20 12:24:09firatozgulsetmessages: + msg182494
2013-02-20 12:00:48r.david.murraysetmessages: + msg182491
2013-02-20 11:31:59firatozgulsetmessages: + msg182487
2013-02-20 10:57:33r.david.murraysetnosy: + ezio.melotti, r.david.murray
messages: + msg182486
components: + Unicode
2013-02-20 09:14:15firatozgulcreate