This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: lower() on Turkish letter "İ" returns a 2-chars-long string
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, ezio.melotti, pombredanne, qdinar, vstinner, zamsalak, Şahin Kureta
Priority: normal Keywords:

Created on 2018-09-18 14:02 by zamsalak, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (10)
msg325646 - (view) Author: Dogan (zamsalak) Date: 2018-09-18 14:02
Hey there,

I believe I've come across a bug. It occurs when you try to lower() the Turkish uppercase letter "İ". Gonna explain it with example code since it's easier:

>>> len("Ş")
1
>>> len("Ş".lower())
1
>>> len("Ğ")
1
>>> len("Ğ".lower())
1
>>> len("Ö")
1
>>> len("Ö".lower())
1
>>> len("Ç")
1
>>> len("Ç".lower())
1
>>> len("İ")
1
>>> len("İ".lower())
2

When you lower() the Turkish uppercase letter “İ”, it returns a 2 chars long string with the first character being “i”, and the second being chr(775).

Should it not simply return “i”?
msg325649 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-18 14:15
> Should it not simply return “i”?

Python implements the Unicode standard.

>>> "U+%04x" % ord("İ")
'U+0130'
>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']

>>> unicodedata.name("İ")
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> [unicodedata.name(ch) for ch in "İ".lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database.

U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case".

Well, at the end, Python uses the following data file from the Unicode standard:

https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt

Extract:
"""
# Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
"""


If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode...

I close the issue as NOT A BUG.
msg359514 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-01-07 15:40
There is a weird thing though (using Python 3.6.8):

>>> [x.lower() for x in 'İ']
['i̇']
>>> [x for x in 'İ'.lower()]
['i', '̇']

I would expect that the results would be the same in both cases. (And this is a source of a bug for some code of mine)
msg359518 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-07 16:34
> I would expect that the results would be the same in both cases.

It's not. Read again my previous comment.

>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']
msg359519 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-01-07 16:39
PS: The first entry of the result is a decomposed string, too:

>>> r = [x.lower() for x in 'İ']
>>> hex(ord(r[0][0]))
'0x69'
>>> hex(ord(r[0][1]))
'0x307'
msg359538 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-01-07 20:34
Thank for the (re) explanation. Unicode is tough!
Basically this is the issue i have really in the end with the folding: what used to be a proper alpha string is not longer one after a lower() because the second codepoint is a punctuation and I use a regex split on the \W word class that then behaves differently when the string is lowercased as we have an extra punctuation then to break on. I will find a way around these (rare) cases alright! 

Sorry for the noise.

```
>>> 'İ'.isalpha()
True
>>> 'İ'.lower().isalpha()
False
```
msg374323 - (view) Author: Şahin Kureta (Şahin Kureta) Date: 2020-07-26 15:19
I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'.

[Here is the document](https://www.unicode.org/Public/14.0.0/ucd/NamesList-14.0.0d1.txt)

> 0049	LATIN CAPITAL LETTER I
	* Turkish and Azerbaijani use 0131 for lowercase

> 0069	LATIN SMALL LETTER I
	* Turkish and Azerbaijani use 0130 for uppercase
msg374367 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-07-27 08:47
Şahin Kureta you wrote:
> I know it is not finalized and released yet but are you going to
> implement Version 14.0.0 of the Unicode Standard? 
> It finally solves the issue of Turkish lower/upper case 'I' and 'i'.

Thank you for the pointer!

I guess this spec could likely be under consideration for Python when it becomes final (but unlikely before?).
msg374370 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-07-27 09:26
We don't update the unicodedata database in patch releases because updates are backwards incompatible. Python 3.9 will ship with 13.0. Python 3.10 is going to ship with 14.0.
msg396779 - (view) Author: (qdinar) Date: 2021-06-30 15:01
Şahin Kureta said: "I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'." .

this looks like that unicode version 14 has some new things about that. it is not so. it same as version 13. compare https://www.unicode.org/Public/13.0.0/ucd/SpecialCasing.txt and https://www.unicode.org/Public/14.0.0/ucd/SpecialCasing-14.0.0d8.txt ( if it is 404 try to enter from https://www.unicode.org/Public/14.0.0/ucd/ ).
History
Date User Action Args
2022-04-11 14:59:06adminsetgithub: 78904
2021-06-30 15:01:06qdinarsetnosy: + qdinar
messages: + msg396779
2020-07-27 09:26:01christian.heimessetmessages: + msg374370
2020-07-27 08:47:23pombredannesetmessages: + msg374367
2020-07-26 15:19:57Şahin Kuretasetnosy: + Şahin Kureta
messages: + msg374323
2020-01-07 20:34:35pombredannesetmessages: + msg359538
2020-01-07 16:39:34christian.heimessetnosy: + christian.heimes
messages: + msg359519
2020-01-07 16:34:52vstinnersetmessages: + msg359518
2020-01-07 15:40:25pombredannesetnosy: + pombredanne
messages: + msg359514
2018-09-18 14:15:43vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg325649

stage: resolved
2018-09-18 14:02:16zamsalakcreate