Message 80029 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	loewis, mrabarnett
Date	2009-01-17.17:51:51
SpamBayes Score	6.6880475e-06
Marked as misclassified	No
Message-id	<1232214793.88.0.130017892579.issue4971@psf.upfronthosting.co.za>
In-reply-to

Content
I do think this is a bug in the Unicode database. The current approach (of falling back to uppercase if there is no title case in the Unicode database) goes back to r17708. However, even the prior version only contained explicitly the cases where a titlecase was specified and different from the uppercase. I think part of the motivation is this note from http://www.unicode.org/Public/UNIDATA/UCD.html Note: The simple titlecase may be omitted in the data file if the titlecase is the same as the uppercase. (notice that for uppercase, it says instead "The simple uppercase is omitted in the data file if the uppercase is the same as the code point itself", likewise for lowercase) Considering this note, the simple titlecase of U+01C5 is U+01C4: the titlecase value is omitted, hence it is the same as uppercase, hence it is U+01C4. Most likely, the algorithm to produce the database was different from the documented algorithm, and it is a bug in UCD.html. However, if UCD.html is correct, it is likely a bug in UnicodeData.txt.

I do think this is a bug in the Unicode database. The current approach
(of falling back to uppercase if there is no title case in the Unicode
database) goes back to r17708. However, even the prior version only
contained explicitly the cases where a titlecase was specified and
different from the uppercase.

I think part of the motivation is this note from

http://www.unicode.org/Public/UNIDATA/UCD.html

Note: The simple titlecase may be omitted in the data file if the
titlecase is the same as the uppercase.

(notice that for uppercase, it says instead "The simple uppercase is
omitted in the data file if the uppercase is the same as the code point
itself", likewise for lowercase)

Considering this note, the simple titlecase of U+01C5 *is* U+01C4: the
titlecase value is omitted, hence it is the same as uppercase, hence it
is U+01C4.

Most likely, the algorithm to produce the database was different from
the documented algorithm, and it is a bug in UCD.html. However, if
UCD.html is correct, it is likely a bug in UnicodeData.txt.

History
Date	User	Action	Args
2009-01-17 17:53:13	loewis	set	recipients: + loewis, mrabarnett
2009-01-17 17:53:13	loewis	set	messageid: <1232214793.88.0.130017892579.issue4971@psf.upfronthosting.co.za>
2009-01-17 17:51:52	loewis	link	issue4971 messages
2009-01-17 17:51:51	loewis	create