Message 137181 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	belopolsky, ezio.melotti, lemburg, py.user
Date	2011-05-29.11:56:24
SpamBayes Score	1.599132e-11
Marked as misclassified	No
Message-id	<4DE23466.5050503@egenix.com>
In-reply-to	<1306656314.88.0.590844584798.issue12204@psf.upfronthosting.co.za>

Content
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > '\u1ff3'.upper() returns '\u1ffc', so we have: > U+1FF3 (ῳ - GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI) > U+1FFC (ῼ - GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI) > The first belongs to the Ll (Letter, lowercase) category, whereas the second belongs to the Lt (Letter, titlecase) category. > > The entries for these two chars in the UnicodeData.txt[0] files are: > 1FF3;GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI;Ll;0;L;03C9 0345;;;;N;;;1FFC;;1FFC > 1FFC;GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI;Lt;0;L;03A9 0345;;;;N;;;;1FF3; > > U+1FF3 has U+1FFC in both the third last and last field (Simple_Uppercase_Mapping and Simple_Titlecase_Mapping respectively -- see [1]), so .upper() is doing the right thing here. > U+1FFC has U+1FF3 in the second last field (Simple_Lowercase_Mapping), but since it's category is not Lu, but Lt, .isupper() returns False. > > The Unicode Standard Annex #44[2] defines the Lt category as: > Lt Titlecase_Letter a digraphic character, with first part uppercase > > I'm not sure there's anything to fix here, both function behave as documented, and it might indeed be the case that .upper() returns chars with category Lt, that then return False with .isupper() > > [0]: http://unicode.org/Public/UNIDATA/UnicodeData.txt > [1]: http://www.unicode.org/reports/tr44/#UnicodeData.txt > [2]: http://www.unicode.org/reports/tr44/#GC_Values_Table I think there's a misunderstanding here: title cased characters are ones typically used in titles of a document. They don't necessarily have to be upper case, though, since some characters are never used as first letters of a word. Note that .upper() also does not guarantee to return an upper case character. It just applies the mapping defined in the Unicode standard and if there is no such mapping, or Python does not support the mapping, the method returns the original character. The German ß is such a character (U+00DF). It doesn't have an uppercase mapping in actual use and only received such a mapping in Unicode 5.1 based on rather controversial grounds (see http://en.wikipedia.org/wiki/ẞ). The character is normally mapped to 'SS' when converting it to upper case or title case. This multi-character mapping is not supported by Python, so .upper() just returns U+00DF. I suggest to close this ticket as invalid or to add a note to the documentation explaining how the mapping is applied (and when not).

Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> '\u1ff3'.upper() returns '\u1ffc', so we have:
>   U+1FF3 (ῳ - GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI)
>   U+1FFC (ῼ - GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI)
> The first belongs to the Ll (Letter, lowercase) category, whereas the second belongs to the Lt (Letter, titlecase) category.
> 
> The entries for these two chars in the UnicodeData.txt[0] files are:
> 1FF3;GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI;Ll;0;L;03C9 0345;;;;N;;;1FFC;;1FFC
> 1FFC;GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI;Lt;0;L;03A9 0345;;;;N;;;;1FF3;
> 
> U+1FF3 has U+1FFC in both the third last and last field (Simple_Uppercase_Mapping and Simple_Titlecase_Mapping respectively -- see [1]), so .upper() is doing the right thing here.
> U+1FFC has U+1FF3 in the second last field (Simple_Lowercase_Mapping), but since it's category is not Lu, but Lt, .isupper() returns False.
> 
> The Unicode Standard Annex #44[2] defines the Lt category as:
>   Lt  Titlecase_Letter  a digraphic character, with first part uppercase
> 
> I'm not sure there's anything to fix here, both function behave as documented, and it might indeed be the case that .upper() returns chars with category Lt, that then return False with .isupper()
> 
> [0]: http://unicode.org/Public/UNIDATA/UnicodeData.txt
> [1]: http://www.unicode.org/reports/tr44/#UnicodeData.txt
> [2]: http://www.unicode.org/reports/tr44/#GC_Values_Table

I think there's a misunderstanding here: title cased characters
are ones typically used in titles of a document. They don't
necessarily have to be upper case, though, since some characters
are never used as first letters of a word.

Note that .upper() also does not guarantee to return an upper
case character. It just applies the mapping defined in the
Unicode standard and if there is no such mapping, or Python
does not support the mapping, the method returns the
original character.

The German ß is such a character (U+00DF). It doesn't have
an uppercase mapping in actual use and only received such
a mapping in Unicode 5.1 based on rather controversial
grounds (see http://en.wikipedia.org/wiki/ẞ).

The character is normally mapped to 'SS' when converting it
to upper case or title case. This multi-character mapping
is not supported by Python, so .upper() just returns U+00DF.

I suggest to close this ticket as invalid or to add a note
to the documentation explaining how the mapping is applied
(and when not).

History
Date	User	Action	Args
2011-05-29 11:56:25	lemburg	set	recipients: + lemburg, belopolsky, ezio.melotti, py.user
2011-05-29 11:56:24	lemburg	link	issue12204 messages
2011-05-29 11:56:24	lemburg	create