This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: 'I'.lower() should give non dotted i for LANG=tr_TR
Type: Stage:
Components: Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, eric.araujo, fbacher, serhiy.storchaka
Priority: normal Keywords:

Created on 2022-01-05 03:43 by fbacher, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
foo.py fbacher, 2022-01-05 03:43 Simple test program
Messages (6)
msg409733 - (view) Author: Frank Feuerbacher (fbacher) Date: 2022-01-05 03:43
This blasted Turkish I will be the death of us all...

https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf has a lovely graphic on page 238 of the behavior of upper/lower casing of the various I's and when locale is Turkish or not. It seems that Python 3.9.5 is broken, and I see no evidence that version 10 has fixed it. 

Basically, U-0049 (I) should lower case to U-131 (ı) and vice-versa, when locale is tr_TR. The rules are different for other locales.
msg409743 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2022-01-05 09:53
Python's stdlib does not support locale aware unicode transformations. I recommend that you check out https://pypi.org/project/PyICU .
msg409750 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-01-05 11:16
If you are looking for case-insensitive string comparison, look at locale.strcoll() and locale.strxfrm(). They are locale-aware.
msg409940 - (view) Author: Frank Feuerbacher (fbacher) Date: 2022-01-06 23:48
Oh joy. Kodi media server is having unicode issues and this won't help. I'm trying to see how bad it is.

The main use for case transformations is for internal keyword lookup/monocasing. Settings, filenames on moncased filesystems, etc. are caseless. On the main things work okay until you hit a language, such as Turkish, that does not obey the usual rules. So, ToLower('I') does not map to 'i'. There are ways to work around this, but it depends upon the robustness of the unicode implementation.

I've spent the past several days looking into C++ behavior. It seemed to be similarly broken until I discovered that writing to both cout and wcout tends to break things, including unicode encoding.


It will take a few days to investigate further. Thanks for the info.
msg409996 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2022-01-07 18:36
I suppose the casefold method does not help?
https://docs.python.org/3.10/library/stdtypes.html#str.casefold
msg410149 - (view) Author: Frank Feuerbacher (fbacher) Date: 2022-01-09 13:53
Using casefold did not help

ubuntu Lang is en_US.UTF-8
[GCC 9.3.0] on linux
>>> folded_1: str = "Turkish I: İı".casefold()
>>> folded_2: str = "tUrkİsh i: iI".casefold()
>>> print(folded_1)
turkish i: i̇ı
>>> print(folded_2)
turki̇sh i: ii
>>> print(folded_1==folded_2)
False

It exhibits the same shortcoming as toLower.
multi-language support ain't easy, especially when everything you learned about strings ain't true.
History
Date User Action Args
2022-04-11 14:59:54adminsetgithub: 90422
2022-01-09 13:53:54fbachersetmessages: + msg410149
2022-01-07 18:36:38eric.araujosetnosy: + eric.araujo
messages: + msg409996
2022-01-06 23:48:31fbachersetmessages: + msg409940
2022-01-05 11:16:13serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg409750
2022-01-05 09:53:21christian.heimessetnosy: + christian.heimes
messages: + msg409743
2022-01-05 03:43:54fbachercreate