Issue 33108: Unicode char 304 in lowercase has len = 2

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77289

classification

Title:	Unicode char 304 in lowercase has len = 2
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.6

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Latin Capital Letter I with Dot Above View: 17252
Assigned To:		Nosy List:	Kiril Dimitrov, ezio.melotti, malin, methane, serhiy.storchaka, steven.daprano, vstinner
Priority:	normal	Keywords:

Created on 2018-03-20 13:22 by Kiril Dimitrov, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg314142 - (view)	Author: Kiril Dimitrov (Kiril Dimitrov)	Date: 2018-03-20 13:22
>>> chr(304) 'İ' >>> chr(304).lower() 'i̇' >>> len(chr(304).lower()) 2 This breaks unicode text matching. There is no other unicode character with the same behaviour (in 3.6.2 and 3.6.4).
msg314143 - (view)	Author: Inada Naoki (methane) *	Date: 2018-03-20 13:28
Another example: >>> s = "ß" >>> len(s) 1 >>> len(s.upper()) 2 >>> s.upper() 'SS' >>> ord(s) 223 > This breaks unicode text matching. What do you talking about? re module?
msg314146 - (view)	Author: Kiril Dimitrov (Kiril Dimitrov)	Date: 2018-03-20 14:18
This is roughly my use case: zip( "ßx", [0.5, 0.3]) is [('ß', 0.5), ('x', 0.3)] zip("ßx".upper(), [0.5, 0.3]) will be [('S', 0.5), ('S', 0.3)] in later case you never get to see the value for 'x'. At least my expectation was that lower and upper should preserve text length. At least this seemed to be the case in python2.7 2018-03-20 15:28 GMT+02:00 INADA Naoki <report@bugs.python.org>: > > INADA Naoki <songofacandy@gmail.com> added the comment: > > Another example: > > >>> s = "ß" > >>> len(s) > 1 > >>> len(s.upper()) > 2 > >>> s.upper() > 'SS' > >>> ord(s) > 223 > > > > This breaks unicode text matching. > > What do you talking about? re module? > > ---------- > nosy: +inada.naoki > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue33108> > _______________________________________ >
msg314150 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2018-03-20 16:12
It has never been the case that upper() or lower() are guaranteed to preserve string length in Unicode. For example, some characters decompose into a base plus combining characters. Ligatures are another example. See here for more details: https://unicode.org/faq/casemap_charprop.html However, this example surprises me. In Python 2, I get the result I expected: py> c = unichr(304) py> unicodedata.name(c) 'LATIN CAPITAL LETTER I WITH DOT ABOVE' py> unicodedata.name(c.lower()) 'LATIN SMALL LETTER I' If I am reading the UnicodeData.txt file correctly, I think that the right behaviour is for LATIN CAPITAL LETTER I WITH DOT ABOVE to lowercase to LATIN SMALL LETTER I, as it did in Python 2. ftp://ftp.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
msg314230 - (view)	Author: Inada Naoki (methane) *	Date: 2018-03-22 00:43
Maybe, we should update UnicodeData?
msg314232 - (view)	Author: Ma Lin (malin) *	Date: 2018-03-22 01:59
There was a discussion about "Latin Capital Letter I with Dot Above" https://bugs.python.org/issue17252
msg314234 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-03-22 06:09
Thank you Ma Lin. Closed as a duplicate of issue17252.

History
Date	User	Action	Args
2022-04-11 14:58:58	admin	set	github: 77289
2018-03-22 06:09:55	serhiy.storchaka	set	status: open -> closed superseder: Latin Capital Letter I with Dot Above nosy: + serhiy.storchaka messages: + msg314234 resolution: duplicate stage: resolved
2018-03-22 01:59:09	malin	set	nosy: + malin messages: + msg314232
2018-03-22 00:43:38	methane	set	messages: + msg314230
2018-03-20 16:12:41	steven.daprano	set	nosy: + steven.daprano messages: + msg314150
2018-03-20 14:18:08	Kiril Dimitrov	set	messages: + msg314146
2018-03-20 13:28:48	methane	set	nosy: + methane messages: + msg314143
2018-03-20 13:22:23	Kiril Dimitrov	set	title: Unicode char 304 in lowercase has len 2 -> Unicode char 304 in lowercase has len = 2
2018-03-20 13:22:06	Kiril Dimitrov	create