Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode case mappings are incorrect #48860

Closed
alexs mannequin opened this issue Dec 9, 2008 · 18 comments
Closed

Unicode case mappings are incorrect #48860

alexs mannequin opened this issue Dec 9, 2008 · 18 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@alexs
Copy link
Mannequin

alexs mannequin commented Dec 9, 2008

BPO 4610
Nosy @malemburg, @loewis, @rhettinger, @abalkin, @ezio-melotti
Superseder
  • bpo-12736: Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-06-23.23:56:10.803>
    created_at = <Date 2008-12-09.14:50:29.656>
    labels = ['type-bug', 'expert-unicode']
    title = 'Unicode case mappings are incorrect'
    updated_at = <Date 2013-06-24.08:23:50.233>
    user = 'https://bugs.python.org/alexs'

    bugs.python.org fields:

    activity = <Date 2013-06-24.08:23:50.233>
    actor = 'lemburg'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-06-23.23:56:10.803>
    closer = 'belopolsky'
    components = ['Unicode']
    creation = <Date 2008-12-09.14:50:29.656>
    creator = 'alexs'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 4610
    keywords = []
    message_count = 18.0
    messages = ['77417', '77461', '77526', '77572', '78112', '78116', '78122', '93936', '93944', '94011', '94017', '94023', '94024', '94026', '123488', '191738', '191740', '191750']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'loewis', 'rhettinger', 'belopolsky', 'senn', 'ezio.melotti', 'alexs']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = None
    status = 'closed'
    superseder = '12736'
    type = 'behavior'
    url = 'https://bugs.python.org/issue4610'
    versions = ['Python 3.4']

    @alexs
    Copy link
    Mannequin Author

    alexs mannequin commented Dec 9, 2008

    Following a discussion on reddit it seems that the unicode case
    conversion algorithms are not being followed.

    $ python3.0
    Python 3.0rc1 (r30rc1:66499, Oct 10 2008, 02:33:36) 
    [GCC 4.0.1 (Apple Inc. build 5488)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> x='ß'
    >>> print(x, x.upper())
    ß ß

    This conversion is correct as defined in UnicodeData.txt however
    http://unicode.org/Public/UNIDATA/SpecialCasing.txt defines a more
    complete set of case conversions.

    According to this file "ß".upper() should be "SS". Presumably Python
    simply isn't using this file to create it's mapping database.

    @alexs alexs mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Dec 9, 2008
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 9, 2008

    I have known this problem for years, and decided not to act; I don't
    consider it an important problem. Implementing it properly is
    complicated by the fact that some of the case mappings are conditional
    on the locale.

    If you consider it important, please submit a patch.

    I'd rather see efforts put into an integration of ICU, which should
    solve this problem and many others with Python's locale support.

    @malemburg
    Copy link
    Member

    Python uses the Unicode database for the mapping and this only contains
    1-1 mappings. The special cases (mostly 1-2 mappings) are not included.

    It would be nice to have them available as well, but I guess we'd have
    to write them in code rather than invent a new mapping table for them.

    Furthermore, there are a few cases like e.g. the Turkish i where case
    mappings depend on external context such as the language the code point
    is used in - those cases are difficult to get right.

    We may need to extend the .lower()/.upper()/.title() methods with an
    optional parameter that allow providing this extra context information
    to the methods.

    BTW: 'ß' is being phased out in German. The new writing rules encourage
    using 'ss' or 'SS' instead (which is not entirely correct, since 'ß'
    originated from 'sz' used some hundred or so years ago, but those are
    just details ;-).

    @alexs
    Copy link
    Mannequin Author

    alexs mannequin commented Dec 10, 2008

    I agree with loewis that ICU is probably the best way to get this
    functionality into Python.

    lemburg, yes it seems like extending those methods would be required at
    the very least. We would probably also need to support ICUs collators as
    well I think.

    @alexs
    Copy link
    Mannequin Author

    alexs mannequin commented Dec 20, 2008

    I am trying to get a PEP together for this. Does anyone have any thoughts
    on how to handle comparison between unicode strings in a locale aware
    situation?

    Should __lt__ and __gt__ be specified as ignoring locale? In which case do
    we need to add a new method for doing locale aware comparisons?

    Should locale be a property of the string, an argument passed to
    upper/lower/isupper/islower/swapcase/capitalize/sort or global state
    (locale module...)?

    Should doing a locale aware comparison of two strings with different
    locales throw an exception?

    Should locales be represented as objects or just a string like "en_GB"?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 20, 2008

    I am trying to get a PEP together for this. Does anyone have any thoughts
    on how to handle comparison between unicode strings in a locale aware
    situation?

    Implementation-wise, or specification-wise? Implementation-wise, you can
    either try to use the C library, or ICU. For portability, ICU is better;
    for maintenance, the C library. Specification-wise: it should just
    Do The Right Thing, and probably be exposed either through the locale
    module, or through locale objects (in case you want to operate on
    multiple different locales in a single program) - see other OO languages
    on how they provide locales.

    Should __lt__ and __gt__ be specified as ignoring locale?

    Yes.

    In which case do
    we need to add a new method for doing locale aware comparisons?

    No. Collation is a feature of the locale, not of the strings.

    Should locale be a property of the string, an argument passed to
    upper/lower/isupper/islower/swapcase/capitalize/sort or global state
    (locale module...)?

    Either global state, or the object *that gets the strings passed to it*.

    Should doing a locale aware comparison of two strings with different
    locales throw an exception?

    Strings should not be tied into locales.

    Should locales be represented as objects or just a string like "en_GB"?

    If you want to have multiple of them simultaneously, you need objects.
    You still need to identify them by name.

    @malemburg
    Copy link
    Member

    On 2008-12-20 17:19, Alex Stapleton wrote:

    Alex Stapleton <alexs@prol.etari.at> added the comment:

    I am trying to get a PEP together for this. Does anyone have any thoughts
    on how to handle comparison between unicode strings in a locale aware
    situation?

    Some thoughts:

    • the Unicode implementation *must* stay locale independent

    • we should implement the Unicode collation algorithm
      (TR#10, http://unicode.org/reports/tr10/)

    • which collation to use should be a parameter of a function
      or object initializer and it should be possible to use
      multiple collations in the same application (without switching
      the locale)

    • the terms "locale" and "collation" should not be mixed;
      a (default) collation is a property of a locale and there can
      also be more than one collation per locale

    The Unicode collation algorithm defines collation in terms of a
    key function for each collation, so that already fits nicely with
    the key function parameter of list.sort().

    Should __lt__ and __gt__ be specified as ignoring locale? In which case do
    we need to add a new method for doing locale aware comparisons?

    Unicode strings should not get any locale or collation specific
    methods. Instead this feature should be implemented elsewhere
    and the strings in question passed to this new function or
    object.

    Should locale be a property of the string, an argument passed to
    upper/lower/isupper/islower/swapcase/capitalize/sort or global state
    (locale module...)?

    No. See above.

    Should doing a locale aware comparison of two strings with different
    locales throw an exception?

    No, assigning locales to strings is not going to work and
    we should not go down that road.

    It's better to have locale aware functions for certain operations,
    so that you can pass your Unicode strings to these function
    instead of binding additional context information to the Unicode
    strings themselves.

    Should locales be represented as objects or just a string like "en_GB"?

    I think the easiest way to get the collation algorithm implemented
    is by using a similar scheme as for codecs: you pass a collation
    name to a central function and get back a collation object that
    implements the collation in form of a key method and a compare
    method.

    @senn
    Copy link
    Mannequin

    senn mannequin commented Oct 13, 2009

    Has there been any action on this? a PEP?

    I disagree that using ICU is good way to simply get proper
    unicode casing. (A heavy hammer for a small task...)

    I agree locales are a different issue (and would prefer
    optional arguments to the unicode object casing methods --
    that could then be used within any future sort of locale object
    to handle correct casing -- but don't rely on such.)

    Most of the special casing rules can be accomplished by
    a decomposition (or recursive decomposition) on the character
    followed by casing the result -- so NO new table is necessary
    -- only marking up the characters so implicated (there are
    extra unused bits in the char type table that could be used
    for this purpose -- so no additional space needed there either).

    What remains are a tiny handful of cases that need to be handled
    in code.

    I have a half finished implementation of this, in case anyone
    is interested.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 13, 2009

    I have a half finished implementation of this, in case anyone
    is interested.

    Feel free to upload it here. I'm fairly skeptical that it is
    possible to implement casing "correctly" in a locale-independent
    way.

    @senn
    Copy link
    Mannequin

    senn mannequin commented Oct 14, 2009

    Feel free to upload it here. I'm fairly skeptical that it is
    possible to implement casing "correctly" in a locale-independent
    way.

    Ok. I will try to find time to complete it enough to be readable.

    Unicode (see sec 3.13) specifies the casing of unicode strings pretty
    completely -- i.e. it gives "Default Casing" rules to be used when no
    locale specific "tailoring" is available. The only dependencies on
    locale for the special casing rules are for Turkish, Azeri, and
    Lithuanian. And you only need to know that that is the language, no
    other details. So I'm sure that a complete implementation is possible
    without resort to a lot of locale munging -- at least for .lower()
    .upper() and .title().

    .swapcase() is just ...err... dumb^h^h^h^h questionably useful.

    However .capitalize() is a bit weird; and I'm not sure it isn't
    incorrectly implemented now:

    It UPPERCASES the first character, rather than TITLECASING, which is
    probably wrong in the very few cases where it makes a difference:
    e.g. (using Croatian ligatures)

    >>> u'\u01c5amonjna'.title()
    u'\u01c4amonjna'
    >>> u'\u01c5amonjna'.capitalize()
    u'\u01c5amonjna'

    "Capitalization" is not precisely defined (by the Unicode standard) --
    the currently python implementation doesn't even do what the docs say:
    "makes the first character have upper case" (it also lower-cases all
    other characters!), however I might argue that a more useful
    implementation "makes the first character have titlecase..."

    @senn
    Copy link
    Mannequin

    senn mannequin commented Oct 14, 2009

    Yikes! I just noticed that u''.title() is really broken!

    It doesn't really pay attention to word breaks --
    only characters that "have case".
    Therefore when there are (caseless)
    combining characters in a word it's really broken e.g.

    >>> u'n\u0303on\u0303e'.title()
    u'N\u0303On\u0303E'

    That is (where '~' is combining-tilde-over)
    n~on~e -title-cases-to-> N~On~E

    @malemburg
    Copy link
    Member

    Jeff Senn wrote:
    > 
    > Jeff Senn <senn@users.sourceforge.net> added the comment:
    > 
    > Yikes! I just noticed that u''.title() is really broken! 
    > 
    > It doesn't really pay attention to word breaks -- 
    > only characters that "have case".  
    > Therefore when there are (caseless)
    > combining characters in a word it's really broken e.g.
    > 
    >>>> u'n\u0303on\u0303e'.title()
    > u'N\u0303On\u0303E'
    > 
    > That is (where '~' is combining-tilde-over)
    > n~on~e -title-cases-to-> N~On~E

    Please have a look at http://bugs.python.org/issue6412 - that patch
    addresses many casing issues, at least up the extent that we can
    actually fix them without breaking code relying on:

    len(s.upper()) == len(s)

    for upper/lower/title.

    If we add support for 1-n code point mappings, then we can only
    enable this support by using an option to the casing methods (perhaps
    not a bad idea: the parameter could be used to signal the local
    to assume).

    @malemburg
    Copy link
    Member

    Jeff Senn wrote:
    > However .capitalize() is a bit weird; and I'm not sure it isn't 
    > incorrectly implemented now:
    > 
    > It UPPERCASES the first character, rather than TITLECASING, which is 
    > probably wrong in the very few cases where it makes a difference:
    > e.g. (using Croatian ligatures)
    > 
    >>>> u'\u01c5amonjna'.title()
    > u'\u01c4amonjna'
    >>>> u'\u01c5amonjna'.capitalize()
    > u'\u01c5amonjna'
    > 
    > "Capitalization" is not precisely defined (by the Unicode standard) -- 
    > the currently python implementation doesn't even do what the docs say: 
    > "makes the first character have upper case" (it also lower-cases all 
    > other characters!), however I might argue that a more useful 
    > implementation "makes the first character have titlecase..."

    You don't have to worry about .capitalize() and .swapcase() :-)

    Those methods are defined by their implementation and don't resemble
    anything defined in Unicode.

    I agree that they are, well, not that useful.

    @rhettinger
    Copy link
    Contributor

    .swapcase() is just ...err... dumb^h^h^h^h questionably useful.

    FWIW, it appears that the original use case (as an Emacs macro) was to
    correct blocks of text where touch typists had accidentally left the
    CapsLocks key turned on: tHE qUICK bROWN fOX jUMPED oVER tHE lAZY dOG.

    I agree with the rest of you that Python would be better-off without
    swapcase().

    @abalkin
    Copy link
    Member

    abalkin commented Dec 6, 2010

    > .swapcase() is just ...err... dumb^h^h^h^h questionably useful.

    I agree with the rest of you that Python would be better-off
    without swapcase().

    As long as str.upper/lower are based only on UnicodeData.txt 1-to-1 mappings, existence of str.swapcase() indicates to the users that they should not expect many-to-1 mappings. Also it does seem to be occasionally used for testing. -0 on removing it.

    @abalkin
    Copy link
    Member

    abalkin commented Jun 23, 2013

    There has been a relatively recent discussion of case mappings under bpo-12753 (msg144836).

    I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields. More sophisticated case mapping algorithms belong to a specialized library module not python core.

    The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation.

    @abalkin
    Copy link
    Member

    abalkin commented Jun 23, 2013

    It looks like at least the OP issue has been fixed in bpo-12736:

    >>> 'ß'.upper()
    'SS'

    @abalkin abalkin closed this as completed Jun 23, 2013
    @malemburg
    Copy link
    Member

    On 24.06.2013 00:52, Alexander Belopolsky wrote:

    Alexander Belopolsky added the comment:

    There has been a relatively recent discussion of case mappings under bpo-12753 (msg144836).

    I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields. More sophisticated case mapping algorithms belong to a specialized library module not python core.

    The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation.

    .title() and .capitalize() are 1-1 mappings as well. Python only supports
    "Simple Case Operations" and does not support "Full Case Operations"
    which require parsing context (SpecialCasing.txt).

    ICU does provide support for both:
    http://userguide.icu-project.org/transforms/casemappings

    PyICU wraps ICU, but it is not clear to me how you'd access those
    mappings (the package doesn't provide dcoumentation on the API, instead
    just gives a description of how to map the C++ API to a Python one):
    https://pypi.python.org/pypi/PyICU

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants