Message325649
> Should it not simply return “i”?
Python implements the Unicode standard.
>>> "U+%04x" % ord("İ")
'U+0130'
>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']
>>> unicodedata.name("İ")
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> [unicodedata.name(ch) for ch in "İ".lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database.
U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case".
Well, at the end, Python uses the following data file from the Unicode standard:
https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt
Extract:
"""
# Preserve canonical equivalence for I with dot. Turkic is handled below.
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
"""
If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode...
I close the issue as NOT A BUG. |
|
Date |
User |
Action |
Args |
2018-09-18 14:15:43 | vstinner | set | recipients:
+ vstinner, ezio.melotti, zamsalak |
2018-09-18 14:15:43 | vstinner | set | messageid: <1537280143.02.0.956365154283.issue34723@psf.upfronthosting.co.za> |
2018-09-18 14:15:43 | vstinner | link | issue34723 messages |
2018-09-18 14:15:42 | vstinner | create | |
|