classification
Title: lower() on Turkish letter "İ" returns a 2-chars-long string
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, ezio.melotti, pombredanne, vstinner, zamsalak
Priority: normal Keywords:

Created on 2018-09-18 14:02 by zamsalak, last changed 2020-01-07 20:34 by pombredanne. This issue is now closed.

Messages (6)
msg325646 - (view) Author: Dogan (zamsalak) Date: 2018-09-18 14:02
Hey there,

I believe I've come across a bug. It occurs when you try to lower() the Turkish uppercase letter "İ". Gonna explain it with example code since it's easier:

>>> len("Ş")
1
>>> len("Ş".lower())
1
>>> len("Ğ")
1
>>> len("Ğ".lower())
1
>>> len("Ö")
1
>>> len("Ö".lower())
1
>>> len("Ç")
1
>>> len("Ç".lower())
1
>>> len("İ")
1
>>> len("İ".lower())
2

When you lower() the Turkish uppercase letter “İ”, it returns a 2 chars long string with the first character being “i”, and the second being chr(775).

Should it not simply return “i”?
msg325649 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-18 14:15
> Should it not simply return “i”?

Python implements the Unicode standard.

>>> "U+%04x" % ord("İ")
'U+0130'
>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']

>>> unicodedata.name("İ")
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> [unicodedata.name(ch) for ch in "İ".lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database.

U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case".

Well, at the end, Python uses the following data file from the Unicode standard:

https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt

Extract:
"""
# Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
"""


If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode...

I close the issue as NOT A BUG.
msg359514 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-01-07 15:40
There is a weird thing though (using Python 3.6.8):

>>> [x.lower() for x in 'İ']
['i̇']
>>> [x for x in 'İ'.lower()]
['i', '̇']

I would expect that the results would be the same in both cases. (And this is a source of a bug for some code of mine)
msg359518 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-07 16:34
> I would expect that the results would be the same in both cases.

It's not. Read again my previous comment.

>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']
msg359519 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-01-07 16:39
PS: The first entry of the result is a decomposed string, too:

>>> r = [x.lower() for x in 'İ']
>>> hex(ord(r[0][0]))
'0x69'
>>> hex(ord(r[0][1]))
'0x307'
msg359538 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-01-07 20:34
Thank for the (re) explanation. Unicode is tough!
Basically this is the issue i have really in the end with the folding: what used to be a proper alpha string is not longer one after a lower() because the second codepoint is a punctuation and I use a regex split on the \W word class that then behaves differently when the string is lowercased as we have an extra punctuation then to break on. I will find a way around these (rare) cases alright! 

Sorry for the noise.

```
>>> 'İ'.isalpha()
True
>>> 'İ'.lower().isalpha()
False
```
History
Date User Action Args
2020-01-07 20:34:35pombredannesetmessages: + msg359538
2020-01-07 16:39:34christian.heimessetnosy: + christian.heimes
messages: + msg359519
2020-01-07 16:34:52vstinnersetmessages: + msg359518
2020-01-07 15:40:25pombredannesetnosy: + pombredanne
messages: + msg359514
2018-09-18 14:15:43vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg325649

stage: resolved
2018-09-18 14:02:16zamsalakcreate