Message 94011 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	senn
Recipients	alexs, ezio.melotti, lemburg, loewis, senn
Date	2009-10-14.19:00:09
SpamBayes Score	5.8983596e-11
Marked as misclassified	No
Message-id	<1255546815.92.0.62328909151.issue4610@psf.upfronthosting.co.za>
In-reply-to

Content
> Feel free to upload it here. I'm fairly skeptical that it is > possible to implement casing "correctly" in a locale-independent > way. Ok. I will try to find time to complete it enough to be readable. Unicode (see sec 3.13) specifies the casing of unicode strings pretty completely -- i.e. it gives "Default Casing" rules to be used when no locale specific "tailoring" is available. The only dependencies on locale for the special casing rules are for Turkish, Azeri, and Lithuanian. And you only need to know that that is the language, no other details. So I'm sure that a complete implementation is possible without resort to a lot of locale munging -- at least for .lower() .upper() and .title(). .swapcase() is just ...err... dumb^h^h^h^h questionably useful. However .capitalize() is a bit weird; and I'm not sure it isn't incorrectly implemented now: It UPPERCASES the first character, rather than TITLECASING, which is probably wrong in the very few cases where it makes a difference: e.g. (using Croatian ligatures) >>> u'\u01c5amonjna'.title() u'\u01c4amonjna' >>> u'\u01c5amonjna'.capitalize() u'\u01c5amonjna' "Capitalization" is not precisely defined (by the Unicode standard) -- the currently python implementation doesn't even do what the docs say: "makes the first character have upper case" (it also lower-cases all other characters!), however I might argue that a more useful implementation "makes the first character have titlecase..."

> Feel free to upload it here. I'm fairly skeptical that it is
> possible to implement casing "correctly" in a locale-independent
> way.

Ok. I will try to find time to complete it enough to be readable.

Unicode (see sec 3.13) specifies the casing of unicode strings pretty 
completely -- i.e. it gives "Default Casing" rules to be used when no 
locale specific "tailoring" is available.  The only dependencies on 
locale for the special casing rules are for Turkish, Azeri, and 
Lithuanian.  And you only need to know that that is the language, no 
other details.  So I'm sure that a complete implementation is possible 
without resort to a lot of locale munging -- at least for .lower() 
.upper() and .title().

.swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

However .capitalize() is a bit weird; and I'm not sure it isn't 
incorrectly implemented now:

It UPPERCASES the first character, rather than TITLECASING, which is 
probably wrong in the very few cases where it makes a difference:
e.g. (using Croatian ligatures)

>>> u'\u01c5amonjna'.title()
u'\u01c4amonjna'
>>> u'\u01c5amonjna'.capitalize()
u'\u01c5amonjna'

"Capitalization" is not precisely defined (by the Unicode standard) -- 
the currently python implementation doesn't even do what the docs say: 
"makes the first character have upper case" (it also lower-cases all 
other characters!), however I might argue that a more useful 
implementation "makes the first character have titlecase..."

History
Date	User	Action	Args
2009-10-14 19:00:16	senn	set	recipients: + senn, lemburg, loewis, ezio.melotti, alexs
2009-10-14 19:00:15	senn	set	messageid: <1255546815.92.0.62328909151.issue4610@psf.upfronthosting.co.za>
2009-10-14 19:00:09	senn	link	issue4610 messages
2009-10-14 19:00:09	senn	create