Issue 46264: 'I'.lower() should give non dotted i for LANG=tr_TR

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90422

classification

Title:	'I'.lower() should give non dotted i for LANG=tr_TR
Type:		Stage:
Components:		Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	christian.heimes, eric.araujo, fbacher, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2022-01-05 03:43 by fbacher, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
foo.py	fbacher, 2022-01-05 03:43	Simple test program

Messages (6)
msg409733 - (view)	Author: Frank Feuerbacher (fbacher)	Date: 2022-01-05 03:43
This blasted Turkish I will be the death of us all... https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf has a lovely graphic on page 238 of the behavior of upper/lower casing of the various I's and when locale is Turkish or not. It seems that Python 3.9.5 is broken, and I see no evidence that version 10 has fixed it. Basically, U-0049 (I) should lower case to U-131 (ı) and vice-versa, when locale is tr_TR. The rules are different for other locales.
msg409743 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2022-01-05 09:53
Python's stdlib does not support locale aware unicode transformations. I recommend that you check out https://pypi.org/project/PyICU .
msg409750 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-01-05 11:16
If you are looking for case-insensitive string comparison, look at locale.strcoll() and locale.strxfrm(). They are locale-aware.
msg409940 - (view)	Author: Frank Feuerbacher (fbacher)	Date: 2022-01-06 23:48
Oh joy. Kodi media server is having unicode issues and this won't help. I'm trying to see how bad it is. The main use for case transformations is for internal keyword lookup/monocasing. Settings, filenames on moncased filesystems, etc. are caseless. On the main things work okay until you hit a language, such as Turkish, that does not obey the usual rules. So, ToLower('I') does not map to 'i'. There are ways to work around this, but it depends upon the robustness of the unicode implementation. I've spent the past several days looking into C++ behavior. It seemed to be similarly broken until I discovered that writing to both cout and wcout tends to break things, including unicode encoding. It will take a few days to investigate further. Thanks for the info.
msg409996 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2022-01-07 18:36
I suppose the casefold method does not help? https://docs.python.org/3.10/library/stdtypes.html#str.casefold
msg410149 - (view)	Author: Frank Feuerbacher (fbacher)	Date: 2022-01-09 13:53
Using casefold did not help ubuntu Lang is en_US.UTF-8 [GCC 9.3.0] on linux >>> folded_1: str = "Turkish I: İı".casefold() >>> folded_2: str = "tUrkİsh i: iI".casefold() >>> print(folded_1) turkish i: i̇ı >>> print(folded_2) turki̇sh i: ii >>> print(folded_1==folded_2) False It exhibits the same shortcoming as toLower. multi-language support ain't easy, especially when everything you learned about strings ain't true.

History
Date	User	Action	Args
2022-04-11 14:59:54	admin	set	github: 90422
2022-01-09 13:53:54	fbacher	set	messages: + msg410149
2022-01-07 18:36:38	eric.araujo	set	nosy: + eric.araujo messages: + msg409996
2022-01-06 23:48:31	fbacher	set	messages: + msg409940
2022-01-05 11:16:13	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg409750
2022-01-05 09:53:21	christian.heimes	set	nosy: + christian.heimes messages: + msg409743
2022-01-05 03:43:54	fbacher	create