This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: unicode.normalize gives wrong result for some characters
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: PeterL, loewis
Priority: normal Keywords:

Created on 2009-02-10 10:45 by PeterL, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unnamed PeterL, 2009-02-10 20:03
unnamed PeterL, 2009-02-10 20:50
unnamed PeterL, 2009-02-11 08:24
unnamed PeterL, 2009-02-11 19:26
Messages (10)
msg81536 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-10 10:45
If any of the Swedish characters "åäöÅÄÖ" are input to
unicode.normalize(form, ustr) with form = "NFD" or "NFKD" the result
will be "aaoAAO". "åäöÅÄÖ" are normal character and should be the same
after normalize. They are not connected to aaoAAO other than for
historic reasons, but not in modern languages. It's a common
misinterpretation that the dots and circle above them are diacritic
signs, but those letters should behave as the (Danish)
"Ø" which is normalized correctly.

From Wikipedia:
Å is often perceived as an A with a ring, interpreting the ring as a
diacritical mark. However, in the languages that use it, the ring is not
considered a diacritic but part of the letter.
The letter Ö in the Swedish and Icelandic alphabets historically arises
from the Germanic umlaut, but it is considered a separate letter from O.
See http://en.wikipedia.org/wiki/%C3%85

I think this is pobably impossible to solve as it will be mixed up with
"umlaut" and you don't know what language the specific word is connected to.
msg81580 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-10 18:59
It is not true that normalize produces "aaoAAO". Instead, it produces

u'a\u030aa\u0308o\u0308A\u030aA\u0308O\u0308'

This is the correct result, according to the Unicode specification. It
would be incorrect to normalize them unchanged under the Unicode Normal
Form D (for decomposed); the decomposed character for 'LATIN SMALL
LETTER A WITH RING ABOVE' (for example) is 'LATIN SMALL LETTER A' +
'COMBINING RING ABOVE'.

The wikipedia article is irrelevant; refer to the Unicode specification
for a normative reference.

Closing as invalid.
msg81595 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-10 20:03
Thanks for the fast response.

I understand that python follows the unicode specification. I think the unicode standard 
is not correct in this case for the Swedish letters. I have asked unicode.org for an 
explanation. 

Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD 
normalizations?

Regards,
Peter Landgren
msg81596 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-10 20:15
> Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD 
> normalizations?

I think you have a fundamental misunderstanding what a "decomposition"
is. "Ø" should *not* be decomposed as "O", because clearly, "Ø" and "O"
are different letters. If anything, it would be decomposed as
"O" + PLUS SOME COMBINING MARK

Now, in the specific case of

00D8;LATIN CAPITAL LETTER O WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL
LETTER O SLASH;;;00F8;

no canonical decomposition is specified. Compare this to

00D5;LATIN CAPITAL LETTER O WITH TILDE;Lu;0;L;004F 0303;;;;N;LATIN
CAPITAL LETTER O TILDE;;;00F5;

which decomposes to U+004F followed by U+0303, i.e.
LATIN CAPITAL LETTER O followed by COMBINING TILDE.

If "Ø" was to be decomposed, it should use a mark COMBINING STROKE,
but no such combining mark exists in Unicode. I don't know why that
is; you would have to ask the Unicode consortium. In any case, Unicode
guarantees stability wrt. decompositions, so even if some combining
mark gets added later on, the existing decomposition remain stable.
msg81598 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-10 20:50
The same applies  "Å" and "A", "Ä" and "A" and "Ö" and "O"
which also are also different letters as "Ø" and "O" are. ("Ø" is the Danish version of 
"Ö" )
Maybe not in the unicode world but in treal life.

That's why I'm a little confused.
Will wait and see what/if the unicode people says.
In any case, thanks for the discussion.

Regards,
/Peter
msg81603 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-10 21:32
> The same applies  "Å" and "A", "Ä" and "A" and "Ö" and "O"
> which also are also different letters as "Ø" and "O" are. 

Sure. And rightfully, they "Å" is *not* (I repeat: not)
normalized as "A", under NFD:

py> unicodedata.normalize("NFD", u"Å")
u'A\u030a'

> Maybe not in the unicode world but in treal life.

They are different letters also in the Unicode world.

> That's why I'm a little confused.

I think the confusion comes from your assumption that
normalizing "Å" produces "A". It does not. Really not.
msg81632 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-11 08:24
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> > The same applies  "Å" and "A", "Ä" and "A" and "Ö" and "O"
> > which also are also different letters as "Ø" and "O" are.
>
> Sure. And rightfully, they "Å" is *not* (I repeat: not)
> normalized as "A", under NFD:
>
> py> unicodedata.normalize("NFD", u"Å")
> u'A\u030a'
>
> > Maybe not in the unicode world but in treal life.
>
> They are different letters also in the Unicode world.
>
> > That's why I'm a little confused.
>
> I think the confusion comes from your assumption that
> normalizing "Å" produces "A". It does not. Really not.

Yes, you are right.

However the confusion/problem shows up when it is used in the application to
build an alphabet and group for example all version of E, É, È, Ë, Ê
together under E. The first character in the result of normalize is
used to build alphabet labels for surnames:

letter = normalize("NFD", surname)[0].upper()
if letter != last_letter:
    last_letter = letter
....
and this is why I get "A" when the surname begins with "Å".

This way it works for all variations of E to be grouped under "E",
but fails as "Å" is shown under the label "A", not the "A" in the
beginning of the alphabet but after "Z", where "ÅÄÖ" comes.
So a previous sorting of the surnames works correctly.
(The Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Å,Ä,Ö)

Can you think of any solution to this conflict? 

u'\xd8'

u'A\u030a'

u'\xc5'

This is obviously the result of how the unicode spec is written
interpreting "Å" as a variation of "A". which it is not.

I have asked the unicode people, but not got any answer yet.

The application is GRAMPS: http://gramps-project.org/

Once again thanks for make some of the unicode stuff clear!
Regards,
Peter Landgren
msg81654 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-11 18:32
> Can you think of any solution to this conflict? 

I don't quite understand why you want to place É, È, Ë, Ê all along
with E, yet Å,Ä,Ö after Z. Because that's what the Swedish alphabet
says?

Please understand that collation varies across languages. For example
in German, we also have Ä, but it does *not* come after Z. Instead,
there are two ways to collate Ä (telephone book vs. dictionary):
1. Ä sorts exactly like A
2. Ä sorts as if it was transcribed as Ae

So there is no one true collation of Ä, but you have to take into
account what language rules you want to follow.

If you want to implement Swedish rules, why then do you also want
to support É, È, Ë, Ê? Do you have these letters in Swedish at all?

If you want to use obscure collation rules, you might have to
implement the collation algorithm yourself. For example, assign
each letter a unique number (different from the Unicode ordinal),
and then sort by these numbers.

Take a look at ICU, which already includes collation algorithms
for many locales.
msg81656 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-11 19:26
The È... comes from French surnames and our French developer wants to group all versions 
of E together. The É... can be found in French surnames in Sweden as well as in Germany.
The program, GRAMPS is a genealogy program used in about 20 languages, so there is no 
preferred language.

I know. However, Swedish telephone books and dictionaries are sorted the same:
A,B,C... X,Y,Z,Å,Ä,Ö.

True. I agree. 
GRAMPS runs in the locale of the user, but must be able to handle information coming from 
many other languages/countries. That's why it's hard to be universal.

We can have them in names. See above.

I think we have found a solution that can handle most cases.
We treat surnames beginning with "ÅÄÖ" special. I don't think that there are many surnames 
outside the Nordic countries that starts with any of these three letters.

Vielen dank!

/Peter
msg81661 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-11 19:54
> The È... comes from French surnames and our French developer wants to group all versions 
> of E together. The É... can be found in French surnames in Sweden as well as in Germany.
> The program, GRAMPS is a genealogy program used in about 20 languages, so there is no 
> preferred language.

I think you'll find that you have to think much harder about collation,
then. If you assume that the Unicode ordinal order will give right
collation, it will be wrong many times, I predict.

For example, it appears that Croatian puts Dž as a single letter between
D and Đ.

> I think we have found a solution that can handle most cases.
> We treat surnames beginning with "ÅÄÖ" special. I don't think that there are many surnames 
> outside the Nordic countries that starts with any of these three letters.

It seems they are also common in Turkish (Öksüz, Ölcüm, Önal, ..., taken
from the Berlin phonebook), and Turkish puts Ö after O. Hungarian also
uses Ö and Ü (as well as Ó, Ú, Ő, Ű), but I don't know how common they
are as first letters of surnames.
History
Date User Action Args
2022-04-11 14:56:45adminsetgithub: 49450
2009-02-11 19:54:24loewissetmessages: + msg81661
2009-02-11 19:26:20PeterLsetfiles: + unnamed
messages: + msg81656
2009-02-11 18:32:32loewissetmessages: + msg81654
2009-02-11 08:24:05PeterLsetfiles: + unnamed
messages: + msg81632
2009-02-10 21:32:17loewissetmessages: + msg81603
2009-02-10 20:50:09PeterLsetfiles: + unnamed
messages: + msg81598
2009-02-10 20:15:00loewissetmessages: + msg81596
2009-02-10 20:03:36PeterLsetfiles: + unnamed
messages: + msg81595
2009-02-10 18:59:22loewissetstatus: open -> closed
resolution: not a bug
messages: + msg81580
nosy: + loewis
2009-02-10 10:45:56PeterLcreate