> Martin v. Löwis <martin@v.loewis.de> added the comment:
> > The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O"
> > which also are also different letters as "Ø" and "O" are.
>
> Sure. And rightfully, they "Å" is *not* (I repeat: not)
> normalized as "A", under NFD:
>
> py> unicodedata.normalize("NFD", u"Å")
> u'A\u030a'
>
> > Maybe not in the unicode world but in treal life.
>
> They are different letters also in the Unicode world.
>
> > That's why I'm a little confused.
>
> I think the confusion comes from your assumption that
> normalizing "Å" produces "A". It does not. Really not.
Yes, you are right.
However the confusion/problem shows up when it is used in the application to
build an alphabet and group for example all version of E, É, È, Ë, Ê
together under E. The first character in the result of normalize is
used to build alphabet labels for surnames:
letter = normalize("NFD", surname)[0].upper()
if letter != last_letter:
last_letter = letter
....
and this is why I get "A" when the surname begins with "Å".
This way it works for all variations of E to be grouped under "E",
but fails as "Å" is shown under the label "A", not the "A" in the
beginning of the alphabet but after "Z", where "ÅÄÖ" comes.
So a previous sorting of the surnames works correctly.
(The Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Å,Ä,Ö)
Can you think of any solution to this conflict?
I still think "Å" or "Ä" or "Ö" should behave as "Ø":
>>> unicodedata.normalize("NFD",u"Ø")
u'\xd8'
Now, as you said:
>>> unicodedata.normalize("NFD",u"Å")
u'A\u030a'
But it should be (in my opinion):
>>> unicodedata.normalize("NFD",u"Å")
u'\xc5'
This is obviously the result of how the unicode spec is written
interpreting "Å" as a variation of "A". which it is not.
I have asked the unicode people, but not got any answer yet.
The application is GRAMPS: http://gramps-project.org/
Once again thanks for make some of the unicode stuff clear!
Regards,
Peter Landgren