New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Titlecase as defined in Unicode Case Mappings not followed #50661
Comments
Titlecase, i.e. istitle() and title(), is buggy when the string >>> u'H\u0301ngh'.istitle()
False
>>> u'H\u0301ngh'.title()
u'H\u0301Ngh'
>>>
The string given already is in titlecase so that the following result
is expected:
>>> u'H\u0301ngh'.istitle()
True
>>> u'H\u0301ngh'.title()
u'H\u0301ngh'
>>> UTR#21 Case Mappings defines the following algorithm for titlecase For each character C, find the preceding character B. The class of 'case-ignorable' is defined under [2] and includes A patch including the above test case is attached. [1] |
Adding a incomplete patch in need of a function I don't want to touch capitalize() as I don't fully understand the |
Casing algorithms should follow Section 3.13 "Default Case Algorithms" See |
Implementing full patch solving it the old way (UTR#21). The correct way for the latest Unicode version would be to implement |
I should add that I didn't include the two header files generated by |
The patch looks good, but it doesn't include the few extra characters
Could you add those as well ? Thanks. |
I hardcoded some characters into Tools/unicode/makeunicodedata.py:
>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
: · · ״ ‧ ︓ ﹕ : ' . ‘ ’ ․ ﹒ ' . Those cannot currently be extracted automatically, as neither As I said, the patch is only a second best solution, as the correct |
Christoph Burgmer wrote:
>
> Christoph Burgmer <cburgmer@ira.uka.de> added the comment:
>
>> * U+0027 APOSTROPHE
> hardcoded (see below)
>> * U+00AD SOFT HYPHEN (SHY)
> has the "Format (Cf)" property and thus is included automatically
>> * U+2019 RIGHT SINGLE QUOTATION MARK
> hardcoded (see below)
>
> I hardcoded some characters into Tools/unicode/makeunicodedata.py:
>>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
> u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
> u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
> : · · ״ ‧ ︓ ﹕ : ' . ‘ ’ ․ ﹒ ' .
>
> Those cannot currently be extracted automatically, as neither
> DerivedCoreProperties.txt nor the source file for property
> "Word_Break(C) = MidLetter or MidNumLet" are provided in the script. As long as those code points are defined somewhere in the Unicode It would be good to add a comment explaining the above in the code. BTW: It's better to use "if (....)" instead of \-line joining. The
We could handle the work-breaking in a separate new method. For .title(), I think your patch is an improvement and it will |
New patch
When applying this patch, run Tools/unicode/makeunicodedata.py to Note though, that with this patch str and unicode objects will not
behave equally:
>>> s = "This isn't right"
>>> s.title() == unicode(s).title()
False |
Referred to this from bpo-4610... anyone following this might want to |
So, is it not considered a bug that: >>> "This isn't right".title()
"This Isn'T Right" !?!?!? |
Jeff Senn wrote:
>
> Jeff Senn <senn@users.sourceforge.net> added the comment:
>
> So, is it not considered a bug that:
>
>>>> "This isn't right".title()
> "This Isn'T Right"
>
> !?!?!? That's http://bugs.python.org/issue7008 and is fixed as part of |
How is the behavior changed? To me it seems the same to as initially reported. |
Christoph is responding above to a previous version of this message with an erroneous conclusion based on a misreading of his original message. The proposed patch makes this issue overlap bpo-7008, which had some contentious discussion, so I am adding some people from that to this nosy list so they may opine here. Otherwise starting over: 3.1 has the same bug. 3.1.2
>>> 'H\u0301ngh'.istitle()
False
>>> 'H\u0301ngh'=='H\u0301ngh'.title()
False
>>> 'H\u0301ngh'.title()
'H́Ngh' # in IDLE, the accent is over the H The problem is that .title() treats the accent that looks like an apostrophe '\u0301' as if it were an apostrophe "'". The latter are documented as forming word boundaries, as in >>> "De'souza".title()
"De'Souza"
>>> "O'brian".title()
"O'Brian" Here is the beginning of the 3.1.2 title() doc: The algorithm uses a simple language-independent definition of a word as groups of consecutive letters. The definition works in many contexts but it means that apostrophes in contractions and possessives form word boundaries, which may not be the desired result:" That means that >>> "This Isn'T Right".istitle()
True is correct as documented. I interpret the conclusion of bpo-7008, based on Guido's msg93242, as saying that that should be left alone. but I interpret previous messages and the test in unicodeobject.titlecase.3.diff as saying this would become be False. Such a change would badly affect the prior examples where the post ' capital *is* wanted. The is why that change was rejected in bpo-7008. So I think ' should be removed from the current patch. I do not know about the other chars that are hard-coded. With or without that, there is the issue of whether the current behavior really contradicts the somewhat vague doc and whether change would break enough code that this issue should be treated as a feature change for 3.2 only. Reading this from msg93265 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: