Message314150
It has never been the case that upper() or lower() are guaranteed to preserve string length in Unicode. For example, some characters decompose into a base plus combining characters. Ligatures are another example. See here for more details:
https://unicode.org/faq/casemap_charprop.html
However, this example surprises me. In Python 2, I get the result I expected:
py> c = unichr(304)
py> unicodedata.name(c)
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
py> unicodedata.name(c.lower())
'LATIN SMALL LETTER I'
If I am reading the UnicodeData.txt file correctly, I think that the right behaviour is for LATIN CAPITAL LETTER I WITH DOT ABOVE to lowercase to LATIN SMALL LETTER I, as it did in Python 2.
ftp://ftp.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt |
|
Date |
User |
Action |
Args |
2018-03-20 16:12:41 | steven.daprano | set | recipients:
+ steven.daprano, vstinner, ezio.melotti, methane, Kiril Dimitrov |
2018-03-20 16:12:41 | steven.daprano | set | messageid: <1521562361.54.0.467229070634.issue33108@psf.upfronthosting.co.za> |
2018-03-20 16:12:41 | steven.daprano | link | issue33108 messages |
2018-03-20 16:12:41 | steven.daprano | create | |
|