Issue 12846: unicodedata.normalize turkish letter problem

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57055

classification

Title:	unicodedata.normalize turkish letter problem
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, fizymania, terry.reedy
Priority:	normal	Keywords:

Created on 2011-08-26 13:53 by fizymania, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (3)
msg143008 - (view)	Author: Cem YILDIZ (fizymania)	Date: 2011-08-26 13:53
unicodedata.normalize cannot convert turkish letter "ı" into "i": import unicodedata s = u"üfürükçü ağaç ve ıslıkçı çeşme" print(shoehorn_unicode_into_ascii(s)) print unicodedata.normalize('NFKD', s).encode('ascii','ignore') >> ufurukcu agac ve slkc cesme but the result should be >> ufurukcu agac ve islikci cesme
msg143009 - (view)	Author: Cem YILDIZ (fizymania)	Date: 2011-08-26 13:54
unicodedata.normalize cannot convert turkish letter "ı" into "i": import unicodedata s = u"üfürükçü ağaç ve ıslıkçı çeşme" print unicodedata.normalize('NFKD', s).encode('ascii','ignore') >> ufurukcu agac ve slkc cesme but the result should be >> ufurukcu agac ve islikci cesme
msg143122 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-08-28 20:37
You are doing two different things to the original string: normalizing and encoding to ascii with errors ignored. Each should be tested separately. On 3.2: import unicodedata s1 = "üfürükçü ağaç ve ıslıkçı çeşme" s2 = unicodedata.normalize('NFKD', s1) print(s2) print(s2.encode('ascii','ignore')) #prints üfürükçü ağaç ve ıslıkçı çeşme b'ufurukcu agac ve slkc cesme' The dotless i (== '\u0131') in s2 does not encode to ascii and is properly dropped when the error is ignored. I believe you are mistaken to think that unicodedata.normalize should turn turkish letter "ı" == "\u131" into "i". Unicodedata.decomposition("ı") returns an empty string, as it should (see below) because that character has no decomposition normalization in Unicode 6. So I am closing this issue as invalid. Here is the entry from http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 That is explained here http://www.unicode.org/reports/tr44/tr44-6.html#UnicodeData.txt The blank after 'L' (bidi class - left to right) is for decomposition type and mapping. There is none, so unicodedata.decomposition is correct. The last three entries are for uppercase, lowercase, and titlecase conversions. Those are different from normalizations. To reinforce this, http://www.unicode.org/Public/6.0.0/ucd/NormalizationTest.txt says explicitly "@Part1 # Character by character test # All characters not explicitly occurring in c1 of Part 1 have identical NFC, D, KC, KD forms." 'c1' is column 1, starting from 1. In this list, 0130 is followed by 0132, omitting 0131, so the line above applies. After writing this, I discovered that Lib/test/test_normalization.py runs the complete test specified in NormalizationTest.txt for code points that have and do not have normalization forms. Side note" Python 2.6 is in security-fix-only mode.

History
Date	User	Action	Args
2022-04-11 14:57:21	admin	set	github: 57055
2011-08-29 00:47:17	ezio.melotti	set	nosy: + ezio.melotti stage: resolved
2011-08-28 20:37:43	terry.reedy	set	status: open -> closed versions: + Python 2.7, Python 3.2, - Python 2.6 nosy: + terry.reedy messages: + msg143122 resolution: not a bug
2011-08-26 13:54:43	fizymania	set	messages: + msg143009
2011-08-26 13:53:37	fizymania	set	type: behavior
2011-08-26 13:53:26	fizymania	create