classification
Title: Unicode case mappings are incorrect
Type: behavior Stage:
Components: Unicode Versions: Python 3.4
process
Status: closed Resolution: out of date
Dependencies: Superseder: Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation
View: 12736
Assigned To: Nosy List: alexs, belopolsky, ezio.melotti, lemburg, loewis, rhettinger, senn
Priority: normal Keywords:

Created on 2008-12-09 14:50 by alexs, last changed 2013-06-24 08:23 by lemburg. This issue is now closed.

Messages (18)
msg77417 - (view) Author: Alex Stapleton (alexs) Date: 2008-12-09 14:50
Following a discussion on reddit it seems that the unicode case
conversion algorithms are not being followed.

$ python3.0
Python 3.0rc1 (r30rc1:66499, Oct 10 2008, 02:33:36) 
[GCC 4.0.1 (Apple Inc. build 5488)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x='ß'
>>> print(x, x.upper())
ß ß

This conversion is correct as defined in UnicodeData.txt however
http://unicode.org/Public/UNIDATA/SpecialCasing.txt defines a more
complete set of case conversions.

According to this file "ß".upper() should be "SS". Presumably Python
simply isn't using this file to create it's mapping database.
msg77461 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-12-09 22:14
I have known this problem for years, and decided not to act; I don't
consider it an important problem. Implementing it properly is
complicated by the fact that some of the case mappings are conditional
on the locale.

If you consider it important, please submit a patch.

I'd rather see efforts put into an integration of ICU, which should
solve this problem and many others with Python's locale support.
msg77526 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-12-10 09:44
Python uses the Unicode database for the mapping and this only contains
1-1 mappings. The special cases (mostly 1-2 mappings) are not included.

It would be nice to have them available as well, but I guess we'd have
to write them in code rather than invent a new mapping table for them.

Furthermore, there are a few cases like e.g. the Turkish i where case
mappings depend on external context such as the language the code point
is used in - those cases are difficult to get right.

We may need to extend the .lower()/.upper()/.title() methods with an
optional parameter that allow providing this extra context information
to the methods.

BTW: 'ß' is being phased out in German. The new writing rules encourage
using 'ss' or 'SS' instead (which is not entirely correct, since 'ß'
originated from 'sz' used some hundred or so years ago, but those are
just details ;-).
msg77572 - (view) Author: Alex Stapleton (alexs) Date: 2008-12-10 22:28
I agree with loewis that ICU is probably the best way to get this 
functionality into Python.

lemburg, yes it seems like extending those methods would be required at 
the very least. We would probably also need to support ICUs collators as 
well I think.
msg78112 - (view) Author: Alex Stapleton (alexs) Date: 2008-12-20 16:19
I am trying to get a PEP together for this. Does anyone have any thoughts 
on how to handle comparison between unicode strings in a locale aware 
situation?

Should __lt__ and __gt__ be specified as ignoring locale? In which case do 
we need to add a new method for doing locale aware comparisons?

Should locale be a property of the string, an argument passed to 
upper/lower/isupper/islower/swapcase/capitalize/sort or global state 
(locale module...)?

Should doing a locale aware comparison of two strings with different 
locales throw an exception?

Should locales be represented as objects or just a string like "en_GB"?
msg78116 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-12-20 18:41
> I am trying to get a PEP together for this. Does anyone have any thoughts 
> on how to handle comparison between unicode strings in a locale aware 
> situation?

Implementation-wise, or specification-wise? Implementation-wise, you can
either try to use the C library, or ICU. For portability, ICU is better;
for maintenance, the C library. Specification-wise: it should just
Do The Right Thing, and probably be exposed either through the locale
module, or through locale objects (in case you want to operate on
multiple different locales in a single program) - see other OO languages
on how they provide locales.

> Should __lt__ and __gt__ be specified as ignoring locale?

Yes.

> In which case do 
> we need to add a new method for doing locale aware comparisons?

No. Collation is a feature of the locale, not of the strings.

> Should locale be a property of the string, an argument passed to 
> upper/lower/isupper/islower/swapcase/capitalize/sort or global state 
> (locale module...)?

Either global state, or the object *that gets the strings passed to it*.

> Should doing a locale aware comparison of two strings with different 
> locales throw an exception?

Strings should not be tied into locales.

> Should locales be represented as objects or just a string like "en_GB"?

If you want to have multiple of them simultaneously, you need objects.
You still need to identify them by name.
msg78122 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-12-20 19:52
On 2008-12-20 17:19, Alex Stapleton wrote:
> Alex Stapleton <alexs@prol.etari.at> added the comment:
> 
> I am trying to get a PEP together for this. Does anyone have any thoughts 
> on how to handle comparison between unicode strings in a locale aware 
> situation?

Some thoughts:

 * the Unicode implementation *must* stay locale independent

 * we should implement the Unicode collation algorithm
   (TR#10, http://unicode.org/reports/tr10/)

 * which collation to use should be a parameter of a function
   or object initializer and it should be possible to use
   multiple collations in the same application (without switching
   the locale)

 * the terms "locale" and "collation" should not be mixed;
   a (default) collation is a property of a locale and there can
   also be more than one collation per locale

The Unicode collation algorithm defines collation in terms of a
key function for each collation, so that already fits nicely with
the key function parameter of list.sort().

> Should __lt__ and __gt__ be specified as ignoring locale? In which case do 
> we need to add a new method for doing locale aware comparisons?

Unicode strings should not get any locale or collation specific
methods. Instead this feature should be implemented elsewhere
and the strings in question passed to this new function or
object.

> Should locale be a property of the string, an argument passed to 
> upper/lower/isupper/islower/swapcase/capitalize/sort or global state 
> (locale module...)?

No. See above.

> Should doing a locale aware comparison of two strings with different 
> locales throw an exception?

No, assigning locales to strings is not going to work and
we should not go down that road.

It's better to have locale aware functions for certain operations,
so that you can pass your Unicode strings to these function
instead of binding additional context information to the Unicode
strings themselves.

> Should locales be represented as objects or just a string like "en_GB"?

I think the easiest way to get the collation algorithm implemented
is by using a similar scheme as for codecs: you pass a collation
name to a central function and get back a collation object that
implements the collation in form of a key method and a compare
method.
msg93936 - (view) Author: Jeff Senn (senn) (Python committer) Date: 2009-10-13 19:57
Has there been any action on this? a PEP?

I disagree that using ICU is good way to simply get proper
unicode casing. (A heavy hammer for a small task...)

I agree locales are a different issue (and would prefer
optional arguments to the unicode object casing methods -- 
that could then be used within any future sort of locale object 
to handle correct casing -- but don't rely on such.)

Most of the special casing rules can be accomplished by 
a decomposition (or recursive decomposition) on the character
followed by casing the result -- so NO new table is necessary
-- only marking up the characters so implicated (there are
extra unused bits in the char type table that could be used 
for this purpose -- so no additional space needed there either).  

What remains are a tiny handful of cases that need to be handled
in code.

I have a half finished implementation of this, in case anyone
is interested.
msg93944 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-10-13 22:19
> I have a half finished implementation of this, in case anyone
> is interested.

Feel free to upload it here. I'm fairly skeptical that it is
possible to implement casing "correctly" in a locale-independent
way.
msg94011 - (view) Author: Jeff Senn (senn) (Python committer) Date: 2009-10-14 19:00
> Feel free to upload it here. I'm fairly skeptical that it is
> possible to implement casing "correctly" in a locale-independent
> way.

Ok. I will try to find time to complete it enough to be readable.

Unicode (see sec 3.13) specifies the casing of unicode strings pretty 
completely -- i.e. it gives "Default Casing" rules to be used when no 
locale specific "tailoring" is available.  The only dependencies on 
locale for the special casing rules are for Turkish, Azeri, and 
Lithuanian.  And you only need to know that that is the language, no 
other details.  So I'm sure that a complete implementation is possible 
without resort to a lot of locale munging -- at least for .lower() 
.upper() and .title().

.swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

However .capitalize() is a bit weird; and I'm not sure it isn't 
incorrectly implemented now:

It UPPERCASES the first character, rather than TITLECASING, which is 
probably wrong in the very few cases where it makes a difference:
e.g. (using Croatian ligatures)

>>> u'\u01c5amonjna'.title()
u'\u01c4amonjna'
>>> u'\u01c5amonjna'.capitalize()
u'\u01c5amonjna'

"Capitalization" is not precisely defined (by the Unicode standard) -- 
the currently python implementation doesn't even do what the docs say: 
"makes the first character have upper case" (it also lower-cases all 
other characters!), however I might argue that a more useful 
implementation "makes the first character have titlecase..."
msg94017 - (view) Author: Jeff Senn (senn) (Python committer) Date: 2009-10-14 19:25
Yikes! I just noticed that u''.title() is really broken! 

It doesn't really pay attention to word breaks -- 
only characters that "have case".  
Therefore when there are (caseless)
combining characters in a word it's really broken e.g.

>>> u'n\u0303on\u0303e'.title()
u'N\u0303On\u0303E'

That is (where '~' is combining-tilde-over)
n~on~e -title-cases-to-> N~On~E
msg94023 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-10-14 20:16
Jeff Senn wrote:
> 
> Jeff Senn <senn@users.sourceforge.net> added the comment:
> 
> Yikes! I just noticed that u''.title() is really broken! 
> 
> It doesn't really pay attention to word breaks -- 
> only characters that "have case".  
> Therefore when there are (caseless)
> combining characters in a word it's really broken e.g.
> 
>>>> u'n\u0303on\u0303e'.title()
> u'N\u0303On\u0303E'
> 
> That is (where '~' is combining-tilde-over)
> n~on~e -title-cases-to-> N~On~E

Please have a look at http://bugs.python.org/issue6412 - that patch
addresses many casing issues, at least up the extent that we can
actually fix them without breaking code relying on:

len(s.upper()) == len(s)

for upper/lower/title.

If we add support for 1-n code point mappings, then we can only
enable this support by using an option to the casing methods (perhaps
not a bad idea: the parameter could be used to signal the local
to assume).
msg94024 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-10-14 20:26
Jeff Senn wrote:
> However .capitalize() is a bit weird; and I'm not sure it isn't 
> incorrectly implemented now:
> 
> It UPPERCASES the first character, rather than TITLECASING, which is 
> probably wrong in the very few cases where it makes a difference:
> e.g. (using Croatian ligatures)
> 
>>>> u'\u01c5amonjna'.title()
> u'\u01c4amonjna'
>>>> u'\u01c5amonjna'.capitalize()
> u'\u01c5amonjna'
> 
> "Capitalization" is not precisely defined (by the Unicode standard) -- 
> the currently python implementation doesn't even do what the docs say: 
> "makes the first character have upper case" (it also lower-cases all 
> other characters!), however I might argue that a more useful 
> implementation "makes the first character have titlecase..."

You don't have to worry about .capitalize() and .swapcase() :-)

Those methods are defined by their implementation and don't resemble
anything defined in Unicode.

I agree that they are, well, not that useful.
msg94026 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-10-14 20:40
> .swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

FWIW, it appears that the original use case (as an Emacs macro) was to
correct blocks of text where touch typists had accidentally left the
CapsLocks key turned on:  tHE qUICK bROWN fOX jUMPED oVER tHE lAZY dOG.

I agree with the rest of you that Python would be better-off without
swapcase().
msg123488 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-06 18:42
>> .swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

> I agree with the rest of you that Python would be better-off
> without swapcase().

As long as str.upper/lower are based only on UnicodeData.txt 1-to-1 mappings, existence of str.swapcase() indicates to the users that they should not expect many-to-1 mappings.  Also it does seem to be occasionally used for testing.  -0 on removing it.
msg191738 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 22:52
There has been a relatively recent discussion of case mappings under #12753 (msg144836).

I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields.  More sophisticated case mapping algorithms belong to a specialized library module not python core.

The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation.
msg191740 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 23:56
It looks like at least the OP issue has been fixed in #12736:

>>> 'ß'.upper()
'SS'
msg191750 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-24 08:23
On 24.06.2013 00:52, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> There has been a relatively recent discussion of case mappings under #12753 (msg144836).
> 
> I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields.  More sophisticated case mapping algorithms belong to a specialized library module not python core.
> 
> The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation.

.title() and .capitalize() are 1-1 mappings as well. Python only supports
"Simple Case Operations" and does not support "Full Case Operations"
which require parsing context (SpecialCasing.txt).

ICU does provide support for both:
http://userguide.icu-project.org/transforms/casemappings

PyICU wraps ICU, but it is not clear to me how you'd access those
mappings (the package doesn't provide dcoumentation on the API, instead
just gives a description of how to map the C++ API to a Python one):
https://pypi.python.org/pypi/PyICU
History
Date User Action Args
2013-06-24 08:23:50lemburgsetmessages: + msg191750
2013-06-23 23:56:10belopolskysetstatus: open -> closed
superseder: Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation
resolution: out of date
messages: + msg191740
2013-06-23 22:52:54belopolskysetmessages: + msg191738
versions: + Python 3.4, - Python 2.6, Python 3.0
2013-06-23 22:32:03belopolskylinkissue12753 superseder
2010-12-06 18:42:16belopolskysetnosy: + belopolsky
messages: + msg123488
2009-10-14 20:40:29rhettingersetnosy: + rhettinger
messages: + msg94026
2009-10-14 20:26:09lemburgsetmessages: + msg94024
2009-10-14 20:16:27lemburgsetmessages: + msg94023
2009-10-14 19:25:28sennsetmessages: + msg94017
2009-10-14 19:00:09sennsetmessages: + msg94011
2009-10-13 22:19:29loewissetmessages: + msg93944
2009-10-13 19:57:02sennsetnosy: + senn
messages: + msg93936
2008-12-20 19:52:50lemburgsetmessages: + msg78122
2008-12-20 18:41:27loewissetmessages: + msg78116
2008-12-20 16:24:30ezio.melottisetnosy: + ezio.melotti
2008-12-20 16:19:13alexssetmessages: + msg78112
2008-12-10 22:28:53alexssetmessages: + msg77572
2008-12-10 09:44:10lemburgsetnosy: + lemburg
messages: + msg77526
2008-12-09 22:14:44loewissetnosy: + loewis
messages: + msg77461
2008-12-09 14:50:29alexscreate