Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Titlecase as defined in Unicode Case Mappings not followed #50661

Open
christoph mannequin opened this issue Jul 3, 2009 · 14 comments
Open

Titlecase as defined in Unicode Case Mappings not followed #50661

christoph mannequin opened this issue Jul 3, 2009 · 14 comments
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@christoph
Copy link
Mannequin

christoph mannequin commented Jul 3, 2009

BPO 6412
Nosy @malemburg, @rhettinger, @terryjreedy, @pitrou, @ezio-melotti, @bitdancer, @int-ua
Files
  • test_unicode.titlecase.diff: Patch adding a test case for istitle()
  • unicodeobject.titlecase.diff: Incomplete patch fixing title() and istitle()
  • unicodeobject.titlecase.2.diff: Patch fixing title() and istitle()
  • unicodeobject.titlecase.3.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2009-07-03.23:14:47.639>
    labels = ['type-bug', 'library']
    title = 'Titlecase as defined in Unicode Case Mappings not followed'
    updated_at = <Date 2017-11-08.20:08:20.093>
    user = 'https://bugs.python.org/christoph'

    bugs.python.org fields:

    activity = <Date 2017-11-08.20:08:20.093>
    actor = 'Serhiy Int'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2009-07-03.23:14:47.639>
    creator = 'christoph'
    dependencies = []
    files = ['14443', '14444', '14890', '14994']
    hgrepos = []
    issue_num = 6412
    keywords = ['patch']
    message_count = 14.0
    messages = ['90086', '90087', '90563', '92635', '92636', '93263', '93265', '93267', '93273', '94036', '94037', '94039', '112791', '112840']
    nosy_count = 10.0
    nosy_names = ['lemburg', 'rhettinger', 'terry.reedy', 'ggenellina', 'pitrou', 'senn', 'christoph', 'ezio.melotti', 'r.david.murray', 'Serhiy Int']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6412'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Jul 3, 2009

    Titlecase, i.e. istitle() and title(), is buggy when the string
    includes combining diacritical marks.

    >>> u'H\u0301ngh'.istitle()
    False
    >>> u'H\u0301ngh'.title()
    u'H\u0301Ngh'
    >>>
    
    The string given already is in titlecase so that the following result
    is expected:
    >>> u'H\u0301ngh'.istitle()
    True
    >>> u'H\u0301ngh'.title()
    u'H\u0301ngh'
    >>>

    UTR#21 Case Mappings defines the following algorithm for titlecase
    mapping [1]:

    For each character C, find the preceding character B.
    ignore any intervening case-ignorable characters when finding B.
    If B exists, and is cased
    map C to UCD_lower(C)
    Otherwise,
    map C to UCD_title(C)

    The class of 'case-ignorable' is defined under [2] and includes
    Nonspacing Marks (Mn) as listed in [3]. This includes diacritcal marks
    and others. These should not be handled similar to spaces which they
    currently are, thus dividing words.

    A patch including the above test case is attached.

    [1]
    http://unicode.org/reports/tr21/tr21-5.html#Case_Conversion_of_Strings
    [2] http://unicode.org/reports/tr21/tr21-5.html#Definitions
    [3] http://www.fileformat.info/info/unicode/category/Mn/list.htm

    @christoph christoph mannequin added the stdlib Python modules in the Lib dir label Jul 3, 2009
    @ezio-melotti ezio-melotti added the type-bug An unexpected behavior, bug, or error label Jul 3, 2009
    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Jul 3, 2009

    Adding a incomplete patch in need of a function
    Py_UNICODE_ISCASEIGNORABLE defining the case-ignorable class.

    I don't want to touch capitalize() as I don't fully understand the
    semantics, where it is different to title(). It seems though following
    UTR#21 not the first character should be uppercased, but the first
    character with casing.

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Jul 16, 2009

    Casing algorithms should follow Section 3.13 "Default Case Algorithms"
    in the standard itself, not UTR#21.

    See
    http://www.unicode.org/Public/5.2.0/ucd/DerivedCoreProperties-5.2.0d11.
    Unicode 5.2. A nice mail on the Unicode mail list has a bit explanation
    to that: http://www.unicode.org/mail-arch/unicode-ml/y2009-

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Sep 14, 2009

    Implementing full patch solving it the old way (UTR#21).

    The correct way for the latest Unicode version would be to implement
    the word breaking algorithm described in (UAX#29) [1] first.

    [1] http://www.unicode.org/reports/tr29/#Word_Boundaries

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Sep 14, 2009

    I should add that I didn't include the two header files generated by
    Tools/unicode/makeunicodedata.py

    @malemburg
    Copy link
    Member

    The patch looks good, but it doesn't include the few extra characters
    that are also considered case-ignorable:

    • U+0027 APOSTROPHE
    • U+00AD SOFT HYPHEN (SHY)
    • U+2019 RIGHT SINGLE QUOTATION MARK

    Could you add those as well ?

    Thanks.

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Sep 29, 2009

    • U+0027 APOSTROPHE
      hardcoded (see below)
    • U+00AD SOFT HYPHEN (SHY)
      has the "Format (Cf)" property and thus is included automatically
    • U+2019 RIGHT SINGLE QUOTATION MARK
      hardcoded (see below)
    I hardcoded some characters into Tools/unicode/makeunicodedata.py:
    >>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
    u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
    u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
    : · · ״ ‧ ︓ ﹕ : ' . ‘ ’ ․ ﹒ ' .

    Those cannot currently be extracted automatically, as neither
    DerivedCoreProperties.txt nor the source file for property
    "Word_Break(C) = MidLetter or MidNumLet" are provided in the script.

    As I said, the patch is only a second best solution, as the correct
    path would be implementing the word breaking algorithm as described in
    the newest standard. This patch is just an improvement over the current
    situation.

    @malemburg
    Copy link
    Member

    Christoph Burgmer wrote:
    > 
    > Christoph Burgmer <cburgmer@ira.uka.de> added the comment:
    > 
    >> * U+0027 APOSTROPHE
    > hardcoded (see below)
    >> * U+00AD SOFT HYPHEN (SHY)
    > has the "Format (Cf)" property and thus is included automatically
    >> * U+2019 RIGHT SINGLE QUOTATION MARK
    > hardcoded (see below)
    > 
    > I hardcoded some characters into Tools/unicode/makeunicodedata.py:
    >>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
    > u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
    > u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
    > : · · ״ ‧ ︓ ﹕ : ' . ‘ ’ ․ ﹒ ' .
    > 
    > Those cannot currently be extracted automatically, as neither
    > DerivedCoreProperties.txt nor the source file for property
    > "Word_Break(C) = MidLetter or MidNumLet" are provided in the script.

    As long as those code points are defined somewhere in the Unicode
    standard files, that's ok.

    It would be good to add a comment explaining the above in the code.

    BTW: It's better to use "if (....)" instead of \-line joining. The
    parens will automatically have Python do the line joining for you
    and it looks better.

    As I said, the patch is only a second best solution, as the correct
    path would be implementing the word breaking algorithm as described in
    the newest standard. This patch is just an improvement over the current
    situation.

    We could handle the work-breaking in a separate new method.

    For .title(), I think your patch is an improvement and it will
    fix most of the cases that bpo-7008 mentions.

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Sep 29, 2009

    New patch

    • updated comments to reflect needed integration of
      DerivedCoreProperties.txt
    • cleaned up if(...) construct
    • updated (from bpo-7008) and integrated testcase

    When applying this patch, run Tools/unicode/makeunicodedata.py to
    regenerate the header files.

    Note though, that with this patch str and unicode objects will not
    behave equally:
    >>> s = "This isn't right"
    >>> s.title() == unicode(s).title()
    False

    @senn
    Copy link
    Mannequin

    senn mannequin commented Oct 14, 2009

    Referred to this from bpo-4610... anyone following this might want to
    look there as well.

    @senn
    Copy link
    Mannequin

    senn mannequin commented Oct 14, 2009

    So, is it not considered a bug that:

    >>> "This isn't right".title()
    "This Isn'T Right"

    !?!?!?

    @malemburg
    Copy link
    Member

    Jeff Senn wrote:
    > 
    > Jeff Senn <senn@users.sourceforge.net> added the comment:
    > 
    > So, is it not considered a bug that:
    > 
    >>>> "This isn't right".title()
    > "This Isn'T Right"
    > 
    > !?!?!?

    That's http://bugs.python.org/issue7008 and is fixed as part of
    http://bugs.python.org/issue6412

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Aug 4, 2010

    @terry

    How is the behavior changed? To me it seems the same to as initially reported.
    The results are consistent but nonetheless wrong. It's not about whether your agree with the result, but rather about following the Unicode standard.

    @terryjreedy
    Copy link
    Member

    Christoph is responding above to a previous version of this message with an erroneous conclusion based on a misreading of his original message.

    The proposed patch makes this issue overlap bpo-7008, which had some contentious discussion, so I am adding some people from that to this nosy list so they may opine here. Otherwise starting over:

    3.1 has the same bug.

    3.1.2
    >>> 'H\u0301ngh'.istitle()
    False
    >>> 'H\u0301ngh'=='H\u0301ngh'.title()
    False
    >>> 'H\u0301ngh'.title()
    'H́Ngh' # in IDLE, the accent is over the H

    The problem is that .title() treats the accent that looks like an apostrophe '\u0301' as if it were an apostrophe "'". The latter are documented as forming word boundaries, as in

    >>> "De'souza".title()
    "De'Souza"
    >>> "O'brian".title()
    "O'Brian"

    Here is the beginning of the 3.1.2 title() doc:
    "str.title()
    Return a titlecased version of the string where words start with an uppercase character and the remaining characters are lowercase.

    The algorithm uses a simple language-independent definition of a word as groups of consecutive letters. The definition works in many contexts but it means that apostrophes in contractions and possessives form word boundaries, which may not be the desired result:"

    That means that

    >>> "This Isn'T Right".istitle()
    True

    is correct as documented.

    I interpret the conclusion of bpo-7008, based on Guido's msg93242, as saying that that should be left alone. but I interpret previous messages and the test in unicodeobject.titlecase.3.diff as saying this would become be False. Such a change would badly affect the prior examples where the post ' capital *is* wanted. The is why that change was rejected in bpo-7008. So I think ' should be removed from the current patch. I do not know about the other chars that are hard-coded.

    With or without that, there is the issue of whether the current behavior really contradicts the somewhat vague doc and whether change would break enough code that this issue should be treated as a feature change for 3.2 only.

    Reading this from msg93265
    "As I said, the patch is only a second best solution, as the correct
    path would be implementing the word breaking algorithm as described in
    the newest standard. This patch is just an improvement over the current
    situation."
    makes me wonder whether .title & and .istitle should be left alone until the right solution is implemented.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants