classification
Title: Unicode char 304 in lowercase has len = 2
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Latin Capital Letter I with Dot Above
View: 17252
Assigned To: Nosy List: Kiril Dimitrov, ezio.melotti, malin, methane, serhiy.storchaka, steven.daprano, vstinner
Priority: normal Keywords:

Created on 2018-03-20 13:22 by Kiril Dimitrov, last changed 2018-03-22 06:09 by serhiy.storchaka. This issue is now closed.

Messages (7)
msg314142 - (view) Author: Kiril Dimitrov (Kiril Dimitrov) Date: 2018-03-20 13:22
>>> chr(304)
'İ'
>>> chr(304).lower()
'i̇'
>>> len(chr(304).lower())
2

This breaks unicode text matching. There is no other unicode character with the same behaviour (in 3.6.2 and 3.6.4).
msg314143 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-03-20 13:28
Another example:

>>> s = "ß"
>>> len(s)
1
>>> len(s.upper())
2
>>> s.upper()
'SS'
>>> ord(s)
223


> This breaks unicode text matching.

What do you talking about? re module?
msg314146 - (view) Author: Kiril Dimitrov (Kiril Dimitrov) Date: 2018-03-20 14:18
This is roughly my use case:
zip( "ßx", [0.5, 0.3]) is [('ß', 0.5), ('x', 0.3)]
zip("ßx".upper(), [0.5, 0.3])  will be [('S', 0.5), ('S', 0.3)] in later
case you never get to see the value for 'x'.

At least my expectation was that lower and upper should preserve text
length. At least this seemed to be the case in python2.7

2018-03-20 15:28 GMT+02:00 INADA Naoki <report@bugs.python.org>:

>
> INADA Naoki <songofacandy@gmail.com> added the comment:
>
> Another example:
>
> >>> s = "ß"
> >>> len(s)
> 1
> >>> len(s.upper())
> 2
> >>> s.upper()
> 'SS'
> >>> ord(s)
> 223
>
>
> > This breaks unicode text matching.
>
> What do you talking about? re module?
>
> ----------
> nosy: +inada.naoki
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue33108>
> _______________________________________
>
msg314150 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-03-20 16:12
It has never been the case that upper() or lower() are guaranteed to preserve string length in Unicode. For example, some characters decompose into a base plus combining characters. Ligatures are another example. See here for more details:

https://unicode.org/faq/casemap_charprop.html


However, this example surprises me. In Python 2, I get the result I expected:

py> c = unichr(304)
py> unicodedata.name(c)
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
py> unicodedata.name(c.lower())
'LATIN SMALL LETTER I'


If I am reading the UnicodeData.txt file correctly, I think that the right behaviour is for LATIN CAPITAL LETTER I WITH DOT ABOVE to lowercase to LATIN SMALL LETTER I, as it did in Python 2.

ftp://ftp.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
msg314230 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-03-22 00:43
Maybe, we should update UnicodeData?
msg314232 - (view) Author: Ma Lin (malin) * Date: 2018-03-22 01:59
There was a discussion about "Latin Capital Letter I with Dot Above"
https://bugs.python.org/issue17252
msg314234 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-03-22 06:09
Thank you Ma Lin.

Closed as a duplicate of issue17252.
History
Date User Action Args
2018-03-22 06:09:55serhiy.storchakasetstatus: open -> closed

superseder: Latin Capital Letter I with Dot Above

nosy: + serhiy.storchaka
messages: + msg314234
resolution: duplicate
stage: resolved
2018-03-22 01:59:09malinsetnosy: + malin
messages: + msg314232
2018-03-22 00:43:38methanesetmessages: + msg314230
2018-03-20 16:12:41steven.dapranosetnosy: + steven.daprano
messages: + msg314150
2018-03-20 14:18:08Kiril Dimitrovsetmessages: + msg314146
2018-03-20 13:28:48methanesetnosy: + methane
messages: + msg314143
2018-03-20 13:22:23Kiril Dimitrovsettitle: Unicode char 304 in lowercase has len 2 -> Unicode char 304 in lowercase has len = 2
2018-03-20 13:22:06Kiril Dimitrovcreate