Unicode case mappings are incorrect #48860

alexs · 2008-12-09T14:50:30Z

BPO	4610
Nosy	@malemburg, @loewis, @rhettinger, @abalkin, @ezio-melotti
Superseder	bpo-12736: Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-06-23.23:56:10.803>
created_at = <Date 2008-12-09.14:50:29.656>
labels = ['type-bug', 'expert-unicode']
title = 'Unicode case mappings are incorrect'
updated_at = <Date 2013-06-24.08:23:50.233>
user = 'https://bugs.python.org/alexs'

bugs.python.org fields:

activity = <Date 2013-06-24.08:23:50.233>
actor = 'lemburg'
assignee = 'none'
closed = True
closed_date = <Date 2013-06-23.23:56:10.803>
closer = 'belopolsky'
components = ['Unicode']
creation = <Date 2008-12-09.14:50:29.656>
creator = 'alexs'
dependencies = []
files = []
hgrepos = []
issue_num = 4610
keywords = []
message_count = 18.0
messages = ['77417', '77461', '77526', '77572', '78112', '78116', '78122', '93936', '93944', '94011', '94017', '94023', '94024', '94026', '123488', '191738', '191740', '191750']
nosy_count = 7.0
nosy_names = ['lemburg', 'loewis', 'rhettinger', 'belopolsky', 'senn', 'ezio.melotti', 'alexs']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = None
status = 'closed'
superseder = '12736'
type = 'behavior'
url = 'https://bugs.python.org/issue4610'
versions = ['Python 3.4']

alexs · 2008-12-09T14:50:27Z

Following a discussion on reddit it seems that the unicode case
conversion algorithms are not being followed.

$ python3.0
Python 3.0rc1 (r30rc1:66499, Oct 10 2008, 02:33:36) 
[GCC 4.0.1 (Apple Inc. build 5488)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x='ß'
>>> print(x, x.upper())
ß ß

This conversion is correct as defined in UnicodeData.txt however
http://unicode.org/Public/UNIDATA/SpecialCasing.txt defines a more
complete set of case conversions.

According to this file "ß".upper() should be "SS". Presumably Python
simply isn't using this file to create it's mapping database.

loewis · 2008-12-09T22:14:44Z

I have known this problem for years, and decided not to act; I don't
consider it an important problem. Implementing it properly is
complicated by the fact that some of the case mappings are conditional
on the locale.

If you consider it important, please submit a patch.

I'd rather see efforts put into an integration of ICU, which should
solve this problem and many others with Python's locale support.

malemburg · 2008-12-10T09:44:10Z

Python uses the Unicode database for the mapping and this only contains
1-1 mappings. The special cases (mostly 1-2 mappings) are not included.

It would be nice to have them available as well, but I guess we'd have
to write them in code rather than invent a new mapping table for them.

Furthermore, there are a few cases like e.g. the Turkish i where case
mappings depend on external context such as the language the code point
is used in - those cases are difficult to get right.

We may need to extend the .lower()/.upper()/.title() methods with an
optional parameter that allow providing this extra context information
to the methods.

BTW: 'ß' is being phased out in German. The new writing rules encourage
using 'ss' or 'SS' instead (which is not entirely correct, since 'ß'
originated from 'sz' used some hundred or so years ago, but those are
just details ;-).

alexs · 2008-12-10T22:28:52Z

I agree with loewis that ICU is probably the best way to get this
functionality into Python.

lemburg, yes it seems like extending those methods would be required at
the very least. We would probably also need to support ICUs collators as
well I think.

alexs · 2008-12-20T16:19:13Z

I am trying to get a PEP together for this. Does anyone have any thoughts
on how to handle comparison between unicode strings in a locale aware
situation?

Should __lt__ and __gt__ be specified as ignoring locale? In which case do
we need to add a new method for doing locale aware comparisons?

Should locale be a property of the string, an argument passed to
upper/lower/isupper/islower/swapcase/capitalize/sort or global state
(locale module...)?

Should doing a locale aware comparison of two strings with different
locales throw an exception?

Should locales be represented as objects or just a string like "en_GB"?

loewis · 2008-12-20T18:41:27Z

I am trying to get a PEP together for this. Does anyone have any thoughts
on how to handle comparison between unicode strings in a locale aware
situation?

Implementation-wise, or specification-wise? Implementation-wise, you can
either try to use the C library, or ICU. For portability, ICU is better;
for maintenance, the C library. Specification-wise: it should just
Do The Right Thing, and probably be exposed either through the locale
module, or through locale objects (in case you want to operate on
multiple different locales in a single program) - see other OO languages
on how they provide locales.

Should __lt__ and __gt__ be specified as ignoring locale?

Yes.

In which case do
we need to add a new method for doing locale aware comparisons?

No. Collation is a feature of the locale, not of the strings.

Should locale be a property of the string, an argument passed to
upper/lower/isupper/islower/swapcase/capitalize/sort or global state
(locale module...)?

Either global state, or the object *that gets the strings passed to it*.

Should doing a locale aware comparison of two strings with different
locales throw an exception?

Strings should not be tied into locales.

Should locales be represented as objects or just a string like "en_GB"?

If you want to have multiple of them simultaneously, you need objects.
You still need to identify them by name.

malemburg · 2008-12-20T19:52:50Z

On 2008-12-20 17:19, Alex Stapleton wrote:

Alex Stapleton <alexs@prol.etari.at> added the comment:

I am trying to get a PEP together for this. Does anyone have any thoughts
on how to handle comparison between unicode strings in a locale aware
situation?

Some thoughts:

the Unicode implementation *must* stay locale independent
we should implement the Unicode collation algorithm
(TR#10, http://unicode.org/reports/tr10/)
which collation to use should be a parameter of a function
or object initializer and it should be possible to use
multiple collations in the same application (without switching
the locale)
the terms "locale" and "collation" should not be mixed;
a (default) collation is a property of a locale and there can
also be more than one collation per locale

The Unicode collation algorithm defines collation in terms of a
key function for each collation, so that already fits nicely with
the key function parameter of list.sort().

Should __lt__ and __gt__ be specified as ignoring locale? In which case do
we need to add a new method for doing locale aware comparisons?

Unicode strings should not get any locale or collation specific
methods. Instead this feature should be implemented elsewhere
and the strings in question passed to this new function or
object.

Should locale be a property of the string, an argument passed to
upper/lower/isupper/islower/swapcase/capitalize/sort or global state
(locale module...)?

No. See above.

Should doing a locale aware comparison of two strings with different
locales throw an exception?

No, assigning locales to strings is not going to work and
we should not go down that road.

It's better to have locale aware functions for certain operations,
so that you can pass your Unicode strings to these function
instead of binding additional context information to the Unicode
strings themselves.

Should locales be represented as objects or just a string like "en_GB"?

I think the easiest way to get the collation algorithm implemented
is by using a similar scheme as for codecs: you pass a collation
name to a central function and get back a collation object that
implements the collation in form of a key method and a compare
method.

senn · 2009-10-13T19:57:01Z

Has there been any action on this? a PEP?

I disagree that using ICU is good way to simply get proper
unicode casing. (A heavy hammer for a small task...)

I agree locales are a different issue (and would prefer
optional arguments to the unicode object casing methods --
that could then be used within any future sort of locale object
to handle correct casing -- but don't rely on such.)

Most of the special casing rules can be accomplished by
a decomposition (or recursive decomposition) on the character
followed by casing the result -- so NO new table is necessary
-- only marking up the characters so implicated (there are
extra unused bits in the char type table that could be used
for this purpose -- so no additional space needed there either).

What remains are a tiny handful of cases that need to be handled
in code.

I have a half finished implementation of this, in case anyone
is interested.

loewis · 2009-10-13T22:19:29Z

I have a half finished implementation of this, in case anyone
is interested.

Feel free to upload it here. I'm fairly skeptical that it is
possible to implement casing "correctly" in a locale-independent
way.

senn · 2009-10-14T19:00:09Z

Feel free to upload it here. I'm fairly skeptical that it is
possible to implement casing "correctly" in a locale-independent
way.

Ok. I will try to find time to complete it enough to be readable.

Unicode (see sec 3.13) specifies the casing of unicode strings pretty
completely -- i.e. it gives "Default Casing" rules to be used when no
locale specific "tailoring" is available. The only dependencies on
locale for the special casing rules are for Turkish, Azeri, and
Lithuanian. And you only need to know that that is the language, no
other details. So I'm sure that a complete implementation is possible
without resort to a lot of locale munging -- at least for .lower()
.upper() and .title().

.swapcase() is just ...err... dumb^h^h^h^h questionably useful.

However .capitalize() is a bit weird; and I'm not sure it isn't
incorrectly implemented now:

It UPPERCASES the first character, rather than TITLECASING, which is
probably wrong in the very few cases where it makes a difference:
e.g. (using Croatian ligatures)

>>> u'\u01c5amonjna'.title()
u'\u01c4amonjna'
>>> u'\u01c5amonjna'.capitalize()
u'\u01c5amonjna'

"Capitalization" is not precisely defined (by the Unicode standard) --
the currently python implementation doesn't even do what the docs say:
"makes the first character have upper case" (it also lower-cases all
other characters!), however I might argue that a more useful
implementation "makes the first character have titlecase..."

senn · 2009-10-14T19:25:29Z

Yikes! I just noticed that u''.title() is really broken!

It doesn't really pay attention to word breaks --
only characters that "have case".
Therefore when there are (caseless)
combining characters in a word it's really broken e.g.

>>> u'n\u0303on\u0303e'.title()
u'N\u0303On\u0303E'

That is (where '~' is combining-tilde-over)
n~on~e -title-cases-to-> N~On~E

malemburg · 2009-10-14T20:16:27Z

Jeff Senn wrote:
> 
> Jeff Senn <senn@users.sourceforge.net> added the comment:
> 
> Yikes! I just noticed that u''.title() is really broken! 
> 
> It doesn't really pay attention to word breaks -- 
> only characters that "have case".  
> Therefore when there are (caseless)
> combining characters in a word it's really broken e.g.
> 
>>>> u'n\u0303on\u0303e'.title()
> u'N\u0303On\u0303E'
> 
> That is (where '~' is combining-tilde-over)
> n~on~e -title-cases-to-> N~On~E

Please have a look at http://bugs.python.org/issue6412 - that patch
addresses many casing issues, at least up the extent that we can
actually fix them without breaking code relying on:

len(s.upper()) == len(s)

for upper/lower/title.

If we add support for 1-n code point mappings, then we can only
enable this support by using an option to the casing methods (perhaps
not a bad idea: the parameter could be used to signal the local
to assume).

malemburg · 2009-10-14T20:26:10Z

Jeff Senn wrote:
> However .capitalize() is a bit weird; and I'm not sure it isn't 
> incorrectly implemented now:
> 
> It UPPERCASES the first character, rather than TITLECASING, which is 
> probably wrong in the very few cases where it makes a difference:
> e.g. (using Croatian ligatures)
> 
>>>> u'\u01c5amonjna'.title()
> u'\u01c4amonjna'
>>>> u'\u01c5amonjna'.capitalize()
> u'\u01c5amonjna'
> 
> "Capitalization" is not precisely defined (by the Unicode standard) -- 
> the currently python implementation doesn't even do what the docs say: 
> "makes the first character have upper case" (it also lower-cases all 
> other characters!), however I might argue that a more useful 
> implementation "makes the first character have titlecase..."

You don't have to worry about .capitalize() and .swapcase() :-)

Those methods are defined by their implementation and don't resemble
anything defined in Unicode.

I agree that they are, well, not that useful.

rhettinger · 2009-10-14T20:40:29Z

.swapcase() is just ...err... dumb^h^h^h^h questionably useful.

FWIW, it appears that the original use case (as an Emacs macro) was to
correct blocks of text where touch typists had accidentally left the
CapsLocks key turned on: tHE qUICK bROWN fOX jUMPED oVER tHE lAZY dOG.

I agree with the rest of you that Python would be better-off without
swapcase().

abalkin · 2010-12-06T18:42:16Z

> .swapcase() is just ...err... dumb^h^h^h^h questionably useful.

I agree with the rest of you that Python would be better-off
without swapcase().

As long as str.upper/lower are based only on UnicodeData.txt 1-to-1 mappings, existence of str.swapcase() indicates to the users that they should not expect many-to-1 mappings. Also it does seem to be occasionally used for testing. -0 on removing it.

abalkin · 2013-06-23T22:52:54Z

There has been a relatively recent discussion of case mappings under bpo-12753 (msg144836).

I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields. More sophisticated case mapping algorithms belong to a specialized library module not python core.

The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation.

abalkin · 2013-06-23T23:56:11Z

It looks like at least the OP issue has been fixed in bpo-12736:

>>> 'ß'.upper()
'SS'

malemburg · 2013-06-24T08:23:50Z

On 24.06.2013 00:52, Alexander Belopolsky wrote:

Alexander Belopolsky added the comment:

There has been a relatively recent discussion of case mappings under bpo-12753 (msg144836).

I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields. More sophisticated case mapping algorithms belong to a specialized library module not python core.

The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation.

.title() and .capitalize() are 1-1 mappings as well. Python only supports
"Simple Case Operations" and does not support "Full Case Operations"
which require parsing context (SpecialCasing.txt).

ICU does provide support for both:
http://userguide.icu-project.org/transforms/casemappings

PyICU wraps ICU, but it is not clear to me how you'd access those
mappings (the package doesn't provide dcoumentation on the API, instead
just gives a description of how to map the C++ API to a Python one):
https://pypi.python.org/pypi/PyICU

alexs mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Dec 9, 2008

abalkin closed this as completed Jun 23, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode case mappings are incorrect #48860

Unicode case mappings are incorrect #48860

alexs mannequin commented Dec 9, 2008

alexs mannequin commented Dec 9, 2008

loewis mannequin commented Dec 9, 2008

malemburg commented Dec 10, 2008

alexs mannequin commented Dec 10, 2008

alexs mannequin commented Dec 20, 2008

loewis mannequin commented Dec 20, 2008

malemburg commented Dec 20, 2008

senn mannequin commented Oct 13, 2009

loewis mannequin commented Oct 13, 2009

senn mannequin commented Oct 14, 2009

senn mannequin commented Oct 14, 2009

malemburg commented Oct 14, 2009

malemburg commented Oct 14, 2009

rhettinger commented Oct 14, 2009

abalkin commented Dec 6, 2010

abalkin commented Jun 23, 2013

abalkin commented Jun 23, 2013

malemburg commented Jun 24, 2013

Unicode case mappings are incorrect #48860

Unicode case mappings are incorrect #48860

Comments

alexs mannequin commented Dec 9, 2008

alexs mannequin commented Dec 9, 2008

loewis mannequin commented Dec 9, 2008

malemburg commented Dec 10, 2008

alexs mannequin commented Dec 10, 2008

alexs mannequin commented Dec 20, 2008

loewis mannequin commented Dec 20, 2008

malemburg commented Dec 20, 2008

senn mannequin commented Oct 13, 2009

loewis mannequin commented Oct 13, 2009

senn mannequin commented Oct 14, 2009

senn mannequin commented Oct 14, 2009

malemburg commented Oct 14, 2009

malemburg commented Oct 14, 2009

rhettinger commented Oct 14, 2009

abalkin commented Dec 6, 2010

abalkin commented Jun 23, 2013

abalkin commented Jun 23, 2013

malemburg commented Jun 24, 2013