Issue4971
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009-01-17 16:13 by mrabarnett, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (5) | |||
---|---|---|---|
msg80020 - (view) | Author: Matthew Barnett (mrabarnett) * | Date: 2009-01-17 16:13 | |
I've found that the following 4 Unicode characters/codepoints don't behave as I'd expect: Dž (U+01C5), Lj (U+01C8), Nj (U+01CB), Dz (U+01F2). For example, u'\u01C5'.istitle() returns True and unicodedata.category(u'\u01C5') returns 'Lt', but u'\u01C5'.title() returns u'\u01C4' (DŽ), which is the uppercase equivalent. I believe that these 4 codepoints are the only ones where the titlecase differs from uppercase. I thought it might be a mistake in the Unicode database. However John Machin says: Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c, function _PyUnicode_ToTitlecase. See http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362&view=markup The code that says: if (ctype->title) delta = ctype->title; else delta = ctype->upper; should IMHO merely be: delta = ctype->title; A value of zero for ctype->title should be interpreted simply as the offset to add to the ordinal, as it is in the sibling _PyUnicode_To (Upper|Lower)case functions. See also Tools/unicode/makeunicodedata.py which treats upper, lower and title identically when preparing the tables used by those 3 functions. AFAICT making that change will fix the problem for those four characters and not ruin any others. The error that you noticed occurs as far back as I've looked (2.1) and also occurs in 3.0. |
|||
msg80029 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2009-01-17 17:51 | |
I do think this is a bug in the Unicode database. The current approach (of falling back to uppercase if there is no title case in the Unicode database) goes back to r17708. However, even the prior version only contained explicitly the cases where a titlecase was specified and different from the uppercase. I think part of the motivation is this note from http://www.unicode.org/Public/UNIDATA/UCD.html Note: The simple titlecase may be omitted in the data file if the titlecase is the same as the uppercase. (notice that for uppercase, it says instead "The simple uppercase is omitted in the data file if the uppercase is the same as the code point itself", likewise for lowercase) Considering this note, the simple titlecase of U+01C5 *is* U+01C4: the titlecase value is omitted, hence it is the same as uppercase, hence it is U+01C4. Most likely, the algorithm to produce the database was different from the documented algorithm, and it is a bug in UCD.html. However, if UCD.html is correct, it is likely a bug in UnicodeData.txt. |
|||
msg80061 - (view) | Author: John Machin (sjmachin) | Date: 2009-01-17 23:46 | |
Martin:"""Considering this note, the simple titlecase of U+01C5 *is* U+01C4: the titlecase value is omitted, hence it is the same as uppercase, hence it is U+01C4.""" Perhaps we are looking at different files; in the Unicode 5.1 UnicodeData.txt that I downloaded (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), the title field for U+01C5 is *NOT* omitted, it is set to 01C5. AFAICT the intention is that the four characters in question are their own titlecase, which is not altogether unexpected given their visual representation. Here's the record for U+01C5: 01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5 The note (which I hadn't noticed and explains the mention of ctype->upper in the _PyUnicode_ToTitlecase function) says that the titlecase value may be omitted if it is the same as the uppercase. FWIW there are *no* examples in the current (5.1) file where the title field is empty and the upper field is not empty. ISTM the problem is that implementing the default-to-uppercase was not done in Tools/unicode/makeunicodedata.py where full information is available. This left no way in _PyUnicode_ToTitlecase of resolving the ambiguity of a zero value for ctype->title -- is it "no titlecase supplied so use uppercase" or is it "titlecase supplied, delta == 0, means ch.title() -> ch"? |
|||
msg80063 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2009-01-18 00:09 | |
> Perhaps we are looking at different files; Indeed, I was looking at the 3.2.0 database (assuming that it would be the same in subsequent versions). > ISTM the problem is that implementing the default-to-uppercase was not > done in Tools/unicode/makeunicodedata.py where full information is > available. This left no way in _PyUnicode_ToTitlecase of resolving the > ambiguity of a zero value for ctype->title -- is it "no titlecase > supplied so use uppercase" or is it "titlecase supplied, delta == 0, > means ch.title() -> ch"? Correct. So it seems this needs to be fixed in makeunicodedata.py already. This was not the case with earlier versions of Unicode (which never had a mapping to the same code point). The logic for using deltas is also incorrect, so makeunicodedata.py needs to be fixed anyway. |
|||
msg86582 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2009-04-26 01:05 | |
In r71894, makeunicodedata.py was fixed to correctly encode titlecase in the unicodectype database (see issue5828) In r71947, r71948, r71949, r71950, this issue is fixed by not having titlecase fall back to uppercase at run-time anymore. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:44 | admin | set | github: 49221 |
2009-04-26 01:05:02 | loewis | set | status: open -> closed resolution: fixed messages: + msg86582 |
2009-04-19 11:05:08 | loewis | link | issue5791 superseder |
2009-01-18 00:09:13 | loewis | set | messages: + msg80063 |
2009-01-17 23:46:23 | sjmachin | set | nosy:
+ sjmachin messages: + msg80061 |
2009-01-17 17:51:52 | loewis | set | nosy:
+ loewis messages: + msg80029 |
2009-01-17 16:55:41 | loewis | set | versions: - Python 2.5, Python 2.4 |
2009-01-17 16:13:42 | mrabarnett | create |