Message 325649 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	ezio.melotti, vstinner, zamsalak
Date	2018-09-18.14:15:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537280143.02.0.956365154283.issue34723@psf.upfronthosting.co.za>
In-reply-to

Content
> Should it not simply return “i”? Python implements the Unicode standard. >>> "U+%04x" % ord("İ") 'U+0130' >>> ["U+%04x" % ord(ch) for ch in "İ".lower()] ['U+0069', 'U+0307'] >>> unicodedata.name("İ") 'LATIN CAPITAL LETTER I WITH DOT ABOVE' >>> [unicodedata.name(ch) for ch in "İ".lower()] ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE'] At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database. U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case". Well, at the end, Python uses the following data file from the Unicode standard: https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt Extract: """ # Preserve canonical equivalence for I with dot. Turkic is handled below. 0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE """ If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode... I close the issue as NOT A BUG.

> Should it not simply return “i”?

Python implements the Unicode standard.

>>> "U+%04x" % ord("İ")
'U+0130'
>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']

>>> unicodedata.name("İ")
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> [unicodedata.name(ch) for ch in "İ".lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database.

U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case".

Well, at the end, Python uses the following data file from the Unicode standard:

https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt

Extract:
"""
# Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
"""


If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode...

I close the issue as NOT A BUG.

History
Date	User	Action	Args
2018-09-18 14:15:43	vstinner	set	recipients: + vstinner, ezio.melotti, zamsalak
2018-09-18 14:15:43	vstinner	set	messageid: <1537280143.02.0.956365154283.issue34723@psf.upfronthosting.co.za>
2018-09-18 14:15:43	vstinner	link	issue34723 messages
2018-09-18 14:15:42	vstinner	create