Message 81596 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	PeterL, loewis
Date	2009-02-10.20:15:00
SpamBayes Score	9.698746e-08
Marked as misclassified	No
Message-id	<4991E042.8050103@v.loewis.de>
In-reply-to	<200902102104.52469.peter.talken@telia.com>

Content
> Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD > normalizations? I think you have a fundamental misunderstanding what a "decomposition" is. "Ø" should not be decomposed as "O", because clearly, "Ø" and "O" are different letters. If anything, it would be decomposed as "O" + PLUS SOME COMBINING MARK Now, in the specific case of 00D8;LATIN CAPITAL LETTER O WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER O SLASH;;;00F8; no canonical decomposition is specified. Compare this to 00D5;LATIN CAPITAL LETTER O WITH TILDE;Lu;0;L;004F 0303;;;;N;LATIN CAPITAL LETTER O TILDE;;;00F5; which decomposes to U+004F followed by U+0303, i.e. LATIN CAPITAL LETTER O followed by COMBINING TILDE. If "Ø" was to be decomposed, it should use a mark COMBINING STROKE, but no such combining mark exists in Unicode. I don't know why that is; you would have to ask the Unicode consortium. In any case, Unicode guarantees stability wrt. decompositions, so even if some combining mark gets added later on, the existing decomposition remain stable.

> Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD 
> normalizations?

I think you have a fundamental misunderstanding what a "decomposition"
is. "Ø" should *not* be decomposed as "O", because clearly, "Ø" and "O"
are different letters. If anything, it would be decomposed as
"O" + PLUS SOME COMBINING MARK

Now, in the specific case of

00D8;LATIN CAPITAL LETTER O WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL
LETTER O SLASH;;;00F8;

no canonical decomposition is specified. Compare this to

00D5;LATIN CAPITAL LETTER O WITH TILDE;Lu;0;L;004F 0303;;;;N;LATIN
CAPITAL LETTER O TILDE;;;00F5;

which decomposes to U+004F followed by U+0303, i.e.
LATIN CAPITAL LETTER O followed by COMBINING TILDE.

If "Ø" was to be decomposed, it should use a mark COMBINING STROKE,
but no such combining mark exists in Unicode. I don't know why that
is; you would have to ask the Unicode consortium. In any case, Unicode
guarantees stability wrt. decompositions, so even if some combining
mark gets added later on, the existing decomposition remain stable.

History
Date	User	Action	Args
2009-02-10 20:15:01	loewis	set	recipients: + loewis, PeterL
2009-02-10 20:15:00	loewis	link	issue5200 messages
2009-02-10 20:15:00	loewis	create