classification
Title: lower() on Turkish letter "İ" returns a 2-chars-long string
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, vstinner, zamsalak
Priority: normal Keywords:

Created on 2018-09-18 14:02 by zamsalak, last changed 2018-09-18 14:15 by vstinner. This issue is now closed.

Messages (2)
msg325646 - (view) Author: Dogan (zamsalak) Date: 2018-09-18 14:02
Hey there,

I believe I've come across a bug. It occurs when you try to lower() the Turkish uppercase letter "İ". Gonna explain it with example code since it's easier:

>>> len("Ş")
1
>>> len("Ş".lower())
1
>>> len("Ğ")
1
>>> len("Ğ".lower())
1
>>> len("Ö")
1
>>> len("Ö".lower())
1
>>> len("Ç")
1
>>> len("Ç".lower())
1
>>> len("İ")
1
>>> len("İ".lower())
2

When you lower() the Turkish uppercase letter “İ”, it returns a 2 chars long string with the first character being “i”, and the second being chr(775).

Should it not simply return “i”?
msg325649 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-18 14:15
> Should it not simply return “i”?

Python implements the Unicode standard.

>>> "U+%04x" % ord("İ")
'U+0130'
>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']

>>> unicodedata.name("İ")
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> [unicodedata.name(ch) for ch in "İ".lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database.

U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case".

Well, at the end, Python uses the following data file from the Unicode standard:

https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt

Extract:
"""
# Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
"""


If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode...

I close the issue as NOT A BUG.
History
Date User Action Args
2018-09-18 14:15:43vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg325649

stage: resolved
2018-09-18 14:02:16zamsalakcreate