Issue 5200: unicode.normalize gives wrong result for some characters

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49450

classification

Title:	unicode.normalize gives wrong result for some characters
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 2.5

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	PeterL, loewis
Priority:	normal	Keywords:

Created on 2009-02-10 10:45 by PeterL, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unnamed	PeterL, 2009-02-10 20:03
unnamed	PeterL, 2009-02-10 20:50
unnamed	PeterL, 2009-02-11 08:24
unnamed	PeterL, 2009-02-11 19:26

Messages (10)
msg81536 - (view)	Author: Peter Landgren (PeterL)	Date: 2009-02-10 10:45
If any of the Swedish characters "åäöÅÄÖ" are input to unicode.normalize(form, ustr) with form = "NFD" or "NFKD" the result will be "aaoAAO". "åäöÅÄÖ" are normal character and should be the same after normalize. They are not connected to aaoAAO other than for historic reasons, but not in modern languages. It's a common misinterpretation that the dots and circle above them are diacritic signs, but those letters should behave as the (Danish) "Ø" which is normalized correctly. From Wikipedia: Å is often perceived as an A with a ring, interpreting the ring as a diacritical mark. However, in the languages that use it, the ring is not considered a diacritic but part of the letter. The letter Ö in the Swedish and Icelandic alphabets historically arises from the Germanic umlaut, but it is considered a separate letter from O. See http://en.wikipedia.org/wiki/%C3%85 I think this is pobably impossible to solve as it will be mixed up with "umlaut" and you don't know what language the specific word is connected to.
msg81580 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-02-10 18:59
It is not true that normalize produces "aaoAAO". Instead, it produces u'a\u030aa\u0308o\u0308A\u030aA\u0308O\u0308' This is the correct result, according to the Unicode specification. It would be incorrect to normalize them unchanged under the Unicode Normal Form D (for decomposed); the decomposed character for 'LATIN SMALL LETTER A WITH RING ABOVE' (for example) is 'LATIN SMALL LETTER A' + 'COMBINING RING ABOVE'. The wikipedia article is irrelevant; refer to the Unicode specification for a normative reference. Closing as invalid.
msg81595 - (view)	Author: Peter Landgren (PeterL)	Date: 2009-02-10 20:03
Thanks for the fast response. I understand that python follows the unicode specification. I think the unicode standard is not correct in this case for the Swedish letters. I have asked unicode.org for an explanation. Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD normalizations? Regards, Peter Landgren
msg81596 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-02-10 20:15
> Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD > normalizations? I think you have a fundamental misunderstanding what a "decomposition" is. "Ø" should not be decomposed as "O", because clearly, "Ø" and "O" are different letters. If anything, it would be decomposed as "O" + PLUS SOME COMBINING MARK Now, in the specific case of 00D8;LATIN CAPITAL LETTER O WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER O SLASH;;;00F8; no canonical decomposition is specified. Compare this to 00D5;LATIN CAPITAL LETTER O WITH TILDE;Lu;0;L;004F 0303;;;;N;LATIN CAPITAL LETTER O TILDE;;;00F5; which decomposes to U+004F followed by U+0303, i.e. LATIN CAPITAL LETTER O followed by COMBINING TILDE. If "Ø" was to be decomposed, it should use a mark COMBINING STROKE, but no such combining mark exists in Unicode. I don't know why that is; you would have to ask the Unicode consortium. In any case, Unicode guarantees stability wrt. decompositions, so even if some combining mark gets added later on, the existing decomposition remain stable.
msg81598 - (view)	Author: Peter Landgren (PeterL)	Date: 2009-02-10 20:50
The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O" which also are also different letters as "Ø" and "O" are. ("Ø" is the Danish version of "Ö" ) Maybe not in the unicode world but in treal life. That's why I'm a little confused. Will wait and see what/if the unicode people says. In any case, thanks for the discussion. Regards, /Peter
msg81603 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-02-10 21:32
> The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O" > which also are also different letters as "Ø" and "O" are. Sure. And rightfully, they "Å" is not (I repeat: not) normalized as "A", under NFD: py> unicodedata.normalize("NFD", u"Å") u'A\u030a' > Maybe not in the unicode world but in treal life. They are different letters also in the Unicode world. > That's why I'm a little confused. I think the confusion comes from your assumption that normalizing "Å" produces "A". It does not. Really not.
msg81632 - (view)	Author: Peter Landgren (PeterL)	Date: 2009-02-11 08:24
> Martin v. Löwis <martin@v.loewis.de> added the comment: > > The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O" > > which also are also different letters as "Ø" and "O" are. > > Sure. And rightfully, they "Å" is not (I repeat: not) > normalized as "A", under NFD: > > py> unicodedata.normalize("NFD", u"Å") > u'A\u030a' > > > Maybe not in the unicode world but in treal life. > > They are different letters also in the Unicode world. > > > That's why I'm a little confused. > > I think the confusion comes from your assumption that > normalizing "Å" produces "A". It does not. Really not. Yes, you are right. However the confusion/problem shows up when it is used in the application to build an alphabet and group for example all version of E, É, È, Ë, Ê together under E. The first character in the result of normalize is used to build alphabet labels for surnames: letter = normalize("NFD", surname)[0].upper() if letter != last_letter: last_letter = letter .... and this is why I get "A" when the surname begins with "Å". This way it works for all variations of E to be grouped under "E", but fails as "Å" is shown under the label "A", not the "A" in the beginning of the alphabet but after "Z", where "ÅÄÖ" comes. So a previous sorting of the surnames works correctly. (The Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Å,Ä,Ö) Can you think of any solution to this conflict? u'\xd8' u'A\u030a' u'\xc5' This is obviously the result of how the unicode spec is written interpreting "Å" as a variation of "A". which it is not. I have asked the unicode people, but not got any answer yet. The application is GRAMPS: http://gramps-project.org/ Once again thanks for make some of the unicode stuff clear! Regards, Peter Landgren
msg81654 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-02-11 18:32
> Can you think of any solution to this conflict? I don't quite understand why you want to place É, È, Ë, Ê all along with E, yet Å,Ä,Ö after Z. Because that's what the Swedish alphabet says? Please understand that collation varies across languages. For example in German, we also have Ä, but it does not come after Z. Instead, there are two ways to collate Ä (telephone book vs. dictionary): 1. Ä sorts exactly like A 2. Ä sorts as if it was transcribed as Ae So there is no one true collation of Ä, but you have to take into account what language rules you want to follow. If you want to implement Swedish rules, why then do you also want to support É, È, Ë, Ê? Do you have these letters in Swedish at all? If you want to use obscure collation rules, you might have to implement the collation algorithm yourself. For example, assign each letter a unique number (different from the Unicode ordinal), and then sort by these numbers. Take a look at ICU, which already includes collation algorithms for many locales.
msg81656 - (view)	Author: Peter Landgren (PeterL)	Date: 2009-02-11 19:26
The È... comes from French surnames and our French developer wants to group all versions of E together. The É... can be found in French surnames in Sweden as well as in Germany. The program, GRAMPS is a genealogy program used in about 20 languages, so there is no preferred language. I know. However, Swedish telephone books and dictionaries are sorted the same: A,B,C... X,Y,Z,Å,Ä,Ö. True. I agree. GRAMPS runs in the locale of the user, but must be able to handle information coming from many other languages/countries. That's why it's hard to be universal. We can have them in names. See above. I think we have found a solution that can handle most cases. We treat surnames beginning with "ÅÄÖ" special. I don't think that there are many surnames outside the Nordic countries that starts with any of these three letters. Vielen dank! /Peter
msg81661 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-02-11 19:54
> The È... comes from French surnames and our French developer wants to group all versions > of E together. The É... can be found in French surnames in Sweden as well as in Germany. > The program, GRAMPS is a genealogy program used in about 20 languages, so there is no > preferred language. I think you'll find that you have to think much harder about collation, then. If you assume that the Unicode ordinal order will give right collation, it will be wrong many times, I predict. For example, it appears that Croatian puts Dž as a single letter between D and Đ. > I think we have found a solution that can handle most cases. > We treat surnames beginning with "ÅÄÖ" special. I don't think that there are many surnames > outside the Nordic countries that starts with any of these three letters. It seems they are also common in Turkish (Öksüz, Ölcüm, Önal, ..., taken from the Berlin phonebook), and Turkish puts Ö after O. Hungarian also uses Ö and Ü (as well as Ó, Ú, Ő, Ű), but I don't know how common they are as first letters of surnames.

History
Date	User	Action	Args
2022-04-11 14:56:45	admin	set	github: 49450
2009-02-11 19:54:24	loewis	set	messages: + msg81661
2009-02-11 19:26:20	PeterL	set	files: + unnamed messages: + msg81656
2009-02-11 18:32:32	loewis	set	messages: + msg81654
2009-02-11 08:24:05	PeterL	set	files: + unnamed messages: + msg81632
2009-02-10 21:32:17	loewis	set	messages: + msg81603
2009-02-10 20:50:09	PeterL	set	files: + unnamed messages: + msg81598
2009-02-10 20:15:00	loewis	set	messages: + msg81596
2009-02-10 20:03:36	PeterL	set	files: + unnamed messages: + msg81595
2009-02-10 18:59:22	loewis	set	status: open -> closed resolution: not a bug messages: + msg81580 nosy: + loewis
2009-02-10 10:45:56	PeterL	create