Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.upper converts to title #56413

Closed
py-user mannequin opened this issue May 29, 2011 · 14 comments
Closed

str.upper converts to title #56413

py-user mannequin opened this issue May 29, 2011 · 14 comments
Assignees
Labels
docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error

Comments

@py-user
Copy link
Mannequin

py-user mannequin commented May 29, 2011

BPO 12204
Nosy @malemburg, @rhettinger, @abalkin, @ezio-melotti, @merwok, @py-user
Files
  • issue12204.diff: Patch to add a note in the doc.
  • issue12204-2.diff: Patch that factors out definition of cased chars.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2011-08-15.11:29:57.255>
    created_at = <Date 2011-05-29.04:49:20.018>
    labels = ['type-bug', 'docs']
    title = 'str.upper converts to title'
    updated_at = <Date 2011-08-15.12:50:10.679>
    user = 'https://github.com/py-user'

    bugs.python.org fields:

    activity = <Date 2011-08-15.12:50:10.679>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2011-08-15.11:29:57.255>
    closer = 'ezio.melotti'
    components = ['Documentation']
    creation = <Date 2011-05-29.04:49:20.018>
    creator = 'py.user'
    dependencies = []
    files = ['22708', '22709']
    hgrepos = []
    issue_num = 12204
    keywords = ['patch']
    message_count = 14.0
    messages = ['137167', '137171', '137181', '137554', '140778', '140779', '140853', '140855', '142119', '142120', '142124', '142126', '142127', '142128']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'rhettinger', 'belopolsky', 'ezio.melotti', 'eric.araujo', 'docs@python', 'py.user', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12204'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @py-user
    Copy link
    Mannequin Author

    py-user mannequin commented May 29, 2011

    specification

    str.upper()¶

    Return a copy of the string converted to uppercase.
    

    str.isupper()¶

    Return true if all cased characters in the string are uppercase and there is at least one cased character, false otherwise. Cased characters are those with general category property being one of “Lu”, “Ll”, or “Lt” and uppercase characters are those with general category property “Lu”.
    
    >>> '\u1ff3'
    'ῳ'
    >>> '\u1ff3'.islower()
    True
    >>> '\u1ff3'.upper()
    'ῼ'
    >>> '\u1ff3'.upper().isupper()
    False
    >>>

    @py-user py-user mannequin added the type-bug An unexpected behavior, bug, or error label May 29, 2011
    @ezio-melotti
    Copy link
    Member

    '\u1ff3'.upper() returns '\u1ffc', so we have:
    U+1FF3 (ῳ - GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI)
    U+1FFC (ῼ - GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI)
    The first belongs to the Ll (Letter, lowercase) category, whereas the second belongs to the Lt (Letter, titlecase) category.

    The entries for these two chars in the UnicodeData.txt0 files are:
    1FF3;GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI;Ll;0;L;03C9 0345;;;;N;;;1FFC;;1FFC
    1FFC;GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI;Lt;0;L;03A9 0345;;;;N;;;;1FF3;

    U+1FF3 has U+1FFC in both the third last and last field (Simple_Uppercase_Mapping and Simple_Titlecase_Mapping respectively -- see 1), so .upper() is doing the right thing here.
    U+1FFC has U+1FF3 in the second last field (Simple_Lowercase_Mapping), but since it's category is not Lu, but Lt, .isupper() returns False.

    The Unicode Standard Annex #442 defines the Lt category as:
    Lt Titlecase_Letter a digraphic character, with first part uppercase

    I'm not sure there's anything to fix here, both function behave as documented, and it might indeed be the case that .upper() returns chars with category Lt, that then return False with .isupper()

    @ezio-melotti ezio-melotti added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels May 29, 2011
    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    '\u1ff3'.upper() returns '\u1ffc', so we have:
    U+1FF3 (ῳ - GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI)
    U+1FFC (ῼ - GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI)
    The first belongs to the Ll (Letter, lowercase) category, whereas the second belongs to the Lt (Letter, titlecase) category.

    The entries for these two chars in the UnicodeData.txt0 files are:
    1FF3;GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI;Ll;0;L;03C9 0345;;;;N;;;1FFC;;1FFC
    1FFC;GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI;Lt;0;L;03A9 0345;;;;N;;;;1FF3;

    U+1FF3 has U+1FFC in both the third last and last field (Simple_Uppercase_Mapping and Simple_Titlecase_Mapping respectively -- see 1), so .upper() is doing the right thing here.
    U+1FFC has U+1FF3 in the second last field (Simple_Lowercase_Mapping), but since it's category is not Lu, but Lt, .isupper() returns False.

    The Unicode Standard Annex #442 defines the Lt category as:
    Lt Titlecase_Letter a digraphic character, with first part uppercase

    I'm not sure there's anything to fix here, both function behave as documented, and it might indeed be the case that .upper() returns chars with category Lt, that then return False with .isupper()

    I think there's a misunderstanding here: title cased characters
    are ones typically used in titles of a document. They don't
    necessarily have to be upper case, though, since some characters
    are never used as first letters of a word.

    Note that .upper() also does not guarantee to return an upper
    case character. It just applies the mapping defined in the
    Unicode standard and if there is no such mapping, or Python
    does not support the mapping, the method returns the
    original character.

    The German ß is such a character (U+00DF). It doesn't have
    an uppercase mapping in actual use and only received such
    a mapping in Unicode 5.1 based on rather controversial
    grounds (see http://en.wikipedia.org/wiki/ẞ).

    The character is normally mapped to 'SS' when converting it
    to upper case or title case. This multi-character mapping
    is not supported by Python, so .upper() just returns U+00DF.

    I suggest to close this ticket as invalid or to add a note
    to the documentation explaining how the mapping is applied
    (and when not).

    @merwok
    Copy link
    Member

    merwok commented Jun 3, 2011

    A note sounds good.

    @merwok merwok added docs Documentation in the Doc dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Jun 3, 2011
    @ezio-melotti
    Copy link
    Member

    Here's a patch.
    I don't think it's necessary to update the docstring.

    @ezio-melotti
    Copy link
    Member

    New patch that factors out the definition of cased characters adding it to a footnote.

    @merwok
    Copy link
    Member

    merwok commented Jul 22, 2011

    Patch looks good, with one issue: I’ve never encountered “cased character” before, is it an accepted term or an invention in our docs?

    @ezio-melotti
    Copy link
    Member

    I think it's an invention, but its meaning is quite clear to me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 15, 2011

    New changeset 16edc5cf4a79 by Ezio Melotti in branch '3.2':
    bpo-12204: document that str.upper().isupper() might be False and add a note about cased characters.
    http://hg.python.org/cpython/rev/16edc5cf4a79

    New changeset fb49394f75ed by Ezio Melotti in branch '2.7':
    bpo-12204: document that str.upper().isupper() might be False and add a note about cased characters.
    http://hg.python.org/cpython/rev/fb49394f75ed

    New changeset c821e3a54930 by Ezio Melotti in branch 'default':
    bpo-12204: merge with 3.2.
    http://hg.python.org/cpython/rev/c821e3a54930

    @ezio-melotti
    Copy link
    Member

    Fixed, thanks for the report!

    @rhettinger
    Copy link
    Contributor

    Are you sure this should have been backported? Are there any apps that may be working now but won't be after the next point release?

    @ezio-melotti
    Copy link
    Member

    This is only a doc patch, maybe you are confusing this issue with bpo-12266?

    @rhettinger
    Copy link
    Contributor

    Right. I was looking at the other patches that went in in the last 24 hours.

    @ezio-melotti
    Copy link
    Member

    It's unlikely that bpo-12266 might break apps. The behavior changed only for fairly unusual characters, and the old behavior was clearly wrong.
    FWIW the str.capitalize() implementation of PyPy doesn't have the bug, and after the fix both CPython and PyPy have the same behavior.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants