Message 80020 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mrabarnett
Recipients	mrabarnett
Date	2009-01-17.16:13:40
SpamBayes Score	1.8663398e-07
Marked as misclassified	No
Message-id	<1232208823.98.0.877639749298.issue4971@psf.upfronthosting.co.za>
In-reply-to

Content
I've found that the following 4 Unicode characters/codepoints don't behave as I'd expect: ǅ (U+01C5), ǈ (U+01C8), ǋ (U+01CB), ǲ (U+01F2). For example, u'\u01C5'.istitle() returns True and unicodedata.category(u'\u01C5') returns 'Lt', but u'\u01C5'.title() returns u'\u01C4' (Ǆ), which is the uppercase equivalent. I believe that these 4 codepoints are the only ones where the titlecase differs from uppercase. I thought it might be a mistake in the Unicode database. However John Machin says: Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c, function _PyUnicode_ToTitlecase. See http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362&view=markup The code that says: if (ctype->title) delta = ctype->title; else delta = ctype->upper; should IMHO merely be: delta = ctype->title; A value of zero for ctype->title should be interpreted simply as the offset to add to the ordinal, as it is in the sibling _PyUnicode_To (Upper\|Lower)case functions. See also Tools/unicode/makeunicodedata.py which treats upper, lower and title identically when preparing the tables used by those 3 functions. AFAICT making that change will fix the problem for those four characters and not ruin any others. The error that you noticed occurs as far back as I've looked (2.1) and also occurs in 3.0.

I've found that the following 4 Unicode characters/codepoints don't
behave as I'd expect: ǅ (U+01C5), ǈ (U+01C8), ǋ (U+01CB), ǲ (U+01F2).

For example, u'\u01C5'.istitle() returns True and
unicodedata.category(u'\u01C5') returns 'Lt', but u'\u01C5'.title()
returns u'\u01C4' (Ǆ), which is the uppercase equivalent.

I believe that these 4 codepoints are the only ones where the titlecase
differs from uppercase.

I thought it might be a mistake in the Unicode database. However John
Machin says:

Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c,
function _PyUnicode_ToTitlecase.

See
http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362&view=markup

The code that says:
    if (ctype->title)
        delta = ctype->title;
    else
	delta = ctype->upper;
should IMHO merely be:
    delta = ctype->title;

A value of zero for ctype->title should be interpreted simply as the
offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions. See also Tools/unicode/makeunicodedata.py
which treats upper, lower and title identically when preparing the
tables used by those 3 functions.

AFAICT making that change will fix the problem for those four
characters and not ruin any others.

The error that you noticed occurs as far back as I've looked (2.1) and
also occurs in 3.0.

History
Date	User	Action	Args
2009-01-17 16:13:44	mrabarnett	set	recipients: + mrabarnett
2009-01-17 16:13:43	mrabarnett	set	messageid: <1232208823.98.0.877639749298.issue4971@psf.upfronthosting.co.za>
2009-01-17 16:13:42	mrabarnett	link	issue4971 messages
2009-01-17 16:13:40	mrabarnett	create