str.capitalize contradicts oneself #56475

py-user · 2011-06-05T05:54:59Z

BPO	12266
Nosy	@malemburg, @abalkin, @ezio-melotti, @merwok, @bitdancer, @py-user
Superseder	bpo-12204: str.upper converts to title
Files	issue12266.diff: Patch against 3.2 + tests.

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2011-08-15.06:31:57.022>
created_at = <Date 2011-06-05.05:54:59.103>
labels = ['interpreter-core', 'type-bug']
title = 'str.capitalize contradicts oneself'
updated_at = <Date 2011-08-15.07:04:58.815>
user = 'https://github.com/py-user'

bugs.python.org fields:

activity = <Date 2011-08-15.07:04:58.815>
actor = 'python-dev'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2011-08-15.06:31:57.022>
closer = 'ezio.melotti'
components = ['Interpreter Core']
creation = <Date 2011-06-05.05:54:59.103>
creator = 'py.user'
dependencies = []
files = ['22898']
hgrepos = []
issue_num = 12266
keywords = ['patch']
message_count = 15.0
messages = ['137682', '137694', '137720', '140780', '140796', '140798', '140799', '140857', '140858', '140859', '142071', '142099', '142100', '142101', '142103']
nosy_count = 7.0
nosy_names = ['lemburg', 'belopolsky', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'py.user', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = '12204'
type = 'behavior'
url = 'https://bugs.python.org/issue12266'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

py-user · 2011-06-05T05:54:59Z

specification

str.capitalize()¶

Return a copy of the string with its first character capitalized and the rest lowercased.

>>> '\u1ffc', '\u1ff3'
('ῼ', 'ῳ')
>>> '\u1ffc'.isupper()
False
>>> '\u1ff3'.islower()
True
>>> s = '\u1ff3\u1ff3\u1ffc\u1ffc'
>>> s
'ῳῳῼῼ'
>>> s.capitalize()
'ῼῳῼῼ'
>>>

A: lower
B: title

A -> B & !B -> A

bitdancer · 2011-06-05T13:37:42Z

This looks like a duplicate of bpo-12204.

py-user · 2011-06-06T00:18:58Z

in http://bugs.python.org/issue12204
Marc-Andre Lemburg wrote:
I suggest to close this ticket as invalid or to add a note
to the documentation explaining how the mapping is applied
(and when not).

this problem is another
str.capitalize makes the first character big, but it doesn't make the rest small
clearing documentation is not enough

lowering works
>>> '\u1ffc'
'ῼ'
>>> '\u1ffc'.lower()
'ῳ'
>>>

ezio-melotti · 2011-07-21T04:59:51Z

Indeed this seems a different issue, and might be worth fixing it.
Given this definition:
  str.capitalize()¶
      Return a copy of the string with its first character capitalized and the rest lowercased.
we might implement capitalize like:
>>> def mycapitalize(s):
...     return s[0].upper() + s[1:].lower()
... 
>>> 'fOoBaR'.capitalize()
'Foobar'
>>> mycapitalize('fOoBaR')
'Foobar'

And this would yield the correct result:
>>> s = u'\u1ff3\u1ff3\u1ffc\u1ffc'
>>> print s
ῳῳῼῼ
>>> print s.capitalize()
ῼῳῼῼ
>>> print mycapitalize(s)
ῼῳῳῳ
>>> s.capitalize().istitle()
False
>>> mycapitalize(s).istitle()
True

This doesn't happen because the actual implementation of str.capitalize checks if a char is uppercase (and not if it's titlecase too) before converting it to lowercase.  This can be fixed doing:
diff -r cb44fef5ea1d Objects/unicodeobject.c
--- a/Objects/unicodeobject.c   Thu Jul 21 01:11:30 2011 +0200
+++ b/Objects/unicodeobject.c   Thu Jul 21 07:57:21 2011 +0300
@@ -6739,7 +6739,7 @@
     }
     s++;
     while (--len > 0) {
-        if (Py_UNICODE_ISUPPER(*s)) {
+        if (Py_UNICODE_ISUPPER(*s) || Py_UNICODE_ISTITLE(*s)) {
             *s = Py_UNICODE_TOLOWER(*s);
             status = 1;
         }

malemburg · 2011-07-21T08:34:22Z

I think it would be better to use this code:

    if (!Py_UNICODE_ISUPPER(*s)) {
        *s = Py_UNICODE_TOUPPER(*s);
        status = 1;
    }
    s++;
    while (--len > 0) {
        if (Py_UNICODE_ISLOWER(*s)) {
            *s = Py_UNICODE_TOLOWER(*s);
            status = 1;
        }
        s++;
    }

Since this actually implements what the doc-string says.

Note that title case is not the same as upper case. Title case is
a special case that get's applied when using a string as a title
of a text and may well include characters that are lower case
but which are only used in titles.

ezio-melotti · 2011-07-21T08:52:55Z

Do you mean "if (!Py_UNICODE_ISLOWER(*s)) {" (with the '!')?

This sounds fine to me, but with this approach all the uncased characters will go through a Py_UNICODE_TO* macro, whereas with the current code only the cased ones are converted. I'm not sure this matters too much though.

OTOH if the non-lowercase cased chars are always either upper or titlecased, checking for both should be equivalent.

malemburg · 2011-07-21T09:02:34Z

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

Do you mean "if (!Py_UNICODE_ISLOWER(*s)) {" (with the '!')?

Sorry, here's the correct version:

    if (!Py_UNICODE_ISUPPER(*s)) {
        *s = Py_UNICODE_TOUPPER(*s);
        status = 1;
    }
    s++;
    while (--len > 0) {
        if (!Py_UNICODE_ISLOWER(*s)) {
            *s = Py_UNICODE_TOLOWER(*s);
            status = 1;
        }
        s++;
    }

This sounds fine to me, but with this approach all the uncased characters will go through a Py_UNICODE_TO* macro, whereas with the current code only the cased ones are converted. I'm not sure this matters too much though.

OTOH if the non-lowercase cased chars are always either upper or titlecased, checking for both should be equivalent.

AFAIK, there are characters that don't have a case mapping at all.
It may also be the case, that a non-cased character still has a
lower/upper case mapping, e.g. for typographical reasons.

Someone would have to check this against the current Unicode database.

ezio-melotti · 2011-07-22T03:30:13Z

>>> import sys; hex(sys.maxunicode)
'0x10ffff'
>>> import unicodedata; unicodedata.unidata_version
'6.0.0'

import unicodedata
all_chars = list(map(chr, range(0x110000)))
Ll = [c for c in all_chars if unicodedata.category(c) == 'Ll']
Lu = [c for c in all_chars if unicodedata.category(c) == 'Lu']
Lt = [c for c in all_chars if unicodedata.category(c) == 'Lt']
Lo = [c for c in all_chars if unicodedata.category(c) == 'Lo']
Lm = [c for c in all_chars if unicodedata.category(c) == 'Lm']

>>> [len(x) for x in [Ll, Lu, Lt, Lo, Lm]]
[1759, 1436, 31, 97084, 210]
>>> sum(1 for c in Lu if c.lower() == c)
471  # uppercase chars with no lower
>>> sum(1 for c in Lt if c.lower() == c)
0    # titlecase chars with no lower
>>> sum(1 for c in Ll if c.upper() == c)
760  # lowercase chars with no upper
>>> sum(1 for c in Lo if c.upper() != c or c.title() != c or c.lower() != c)
0    # "Letter, other" chars with a different upper/title/lower case
>>> sum(1 for c in Lm if c.upper() != c or c.title() != c or c.lower() != c)
0    # "Letter, modifier" chars with a different upper/title/lower case
>>> sum(1 for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c))
85   # non-letter chars with a different upper/title/lower case
>>> [c for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c)]
['', 'Ⅰ', 'Ⅱ', 'Ⅲ', 'Ⅳ', 'Ⅴ', 'Ⅵ', 'Ⅶ', 'Ⅷ', 'Ⅸ', 'Ⅹ', 'Ⅺ', 'Ⅻ', 'Ⅼ', 'Ⅽ', 'Ⅾ', 'Ⅿ', 'ⅰ', 'ⅱ', 'ⅲ', 'ⅳ', 'ⅴ', 'ⅵ', 'ⅶ', 'ⅷ', 'ⅸ', 'ⅹ', 'ⅺ', 'ⅻ', 'ⅼ', 'ⅽ', 'ⅾ', 'ⅿ', 'Ⓐ', 'Ⓑ', 'Ⓒ', 'Ⓓ', 'Ⓔ', 'Ⓕ', 'Ⓖ', 'Ⓗ', 'Ⓘ', 'Ⓙ', 'Ⓚ', 'Ⓛ', 'Ⓜ', 'Ⓝ', 'Ⓞ', 'Ⓟ', 'Ⓠ', 'Ⓡ', 'Ⓢ', 'Ⓣ', 'Ⓤ', 'Ⓥ', 'Ⓦ', 'Ⓧ', 'Ⓨ', 'Ⓩ', 'ⓐ', 'ⓑ', 'ⓒ', 'ⓓ', 'ⓔ', 'ⓕ', 'ⓖ', 'ⓗ', 'ⓘ', 'ⓙ', 'ⓚ', 'ⓛ', 'ⓜ', 'ⓝ', 'ⓞ', 'ⓟ', 'ⓠ', 'ⓡ', 'ⓢ', 'ⓣ', 'ⓤ', 'ⓥ', 'ⓦ', 'ⓧ', 'ⓨ', 'ⓩ']
>>> list(c.lower() for c in _)
['', 'ⅰ', 'ⅱ', 'ⅲ', 'ⅳ', 'ⅴ', 'ⅵ', 'ⅶ', 'ⅷ', 'ⅸ', 'ⅹ', 'ⅺ', 'ⅻ', 'ⅼ', 'ⅽ', 'ⅾ', 'ⅿ', 'ⅰ', 'ⅱ', 'ⅲ', 'ⅳ', 'ⅴ', 'ⅵ', 'ⅶ', 'ⅷ', 'ⅸ', 'ⅹ', 'ⅺ', 'ⅻ', 'ⅼ', 'ⅽ', 'ⅾ', 'ⅿ', 'ⓐ', 'ⓑ', 'ⓒ', 'ⓓ', 'ⓔ', 'ⓕ', 'ⓖ', 'ⓗ', 'ⓘ', 'ⓙ', 'ⓚ', 'ⓛ', 'ⓜ', 'ⓝ', 'ⓞ', 'ⓟ', 'ⓠ', 'ⓡ', 'ⓢ', 'ⓣ', 'ⓤ', 'ⓥ', 'ⓦ', 'ⓧ', 'ⓨ', 'ⓩ', 'ⓐ', 'ⓑ', 'ⓒ', 'ⓓ', 'ⓔ', 'ⓕ', 'ⓖ', 'ⓗ', 'ⓘ', 'ⓙ', 'ⓚ', 'ⓛ', 'ⓜ', 'ⓝ', 'ⓞ', 'ⓟ', 'ⓠ', 'ⓡ', 'ⓢ', 'ⓣ', 'ⓤ', 'ⓥ', 'ⓦ', 'ⓧ', 'ⓨ', 'ⓩ']
>>> len(_)
85
>>> {unicodedata.category(c) for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c)}
{'So', 'Mn', 'Nl'}

So == Symbol, Other
Mn == Mark, Nonspacing
Nl == Number, Letter

py-user · 2011-07-22T04:26:16Z

>> [c for c in all_chars if c not in L and ...

L ?

ezio-melotti · 2011-07-22T04:34:09Z

L = set(sum([Ll, Lu, Lt, Lo, Lm], []))

ezio-melotti · 2011-08-14T17:58:37Z

Attached patch + tests.

python-dev · 2011-08-15T06:22:46Z

New changeset c34772013c53 by Ezio Melotti in branch '3.2':
bpo-12266: Fix str.capitalize() to correctly uppercase/lowercase titlecased and cased non-letter characters.
http://hg.python.org/cpython/rev/c34772013c53

New changeset eab17979a586 by Ezio Melotti in branch '2.7':
bpo-12266: Fix str.capitalize() to correctly uppercase/lowercase titlecased and cased non-letter characters.
http://hg.python.org/cpython/rev/eab17979a586

python-dev · 2011-08-15T06:26:43Z

New changeset 1ea72da11724 by Ezio Melotti in branch 'default':
bpo-12266: merge with 3.2.
http://hg.python.org/cpython/rev/1ea72da11724

ezio-melotti · 2011-08-15T06:31:57Z

Fixed, thanks for the report!

python-dev · 2011-08-15T07:04:59Z

New changeset d3816fa1bcdf by Ezio Melotti in branch '2.7':
bpo-12266: move the tests in test_unicode.
http://hg.python.org/cpython/rev/d3816fa1bcdf

py-user mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jun 5, 2011

bitdancer closed this as completed Jun 5, 2011

py-user mannequin added the type-bug An unexpected behavior, bug, or error label Jun 6, 2011

py-user mannequin changed the title ~~str.capitalize contradicts~~ str.capitalize contradicts oneself Jun 6, 2011

ezio-melotti reopened this Jul 21, 2011

ezio-melotti self-assigned this Jul 21, 2011

ezio-melotti closed this as completed Aug 15, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str.capitalize contradicts oneself #56475

str.capitalize contradicts oneself #56475

py-user mannequin commented Jun 5, 2011

py-user mannequin commented Jun 5, 2011

bitdancer commented Jun 5, 2011

py-user mannequin commented Jun 6, 2011

ezio-melotti commented Jul 21, 2011

malemburg commented Jul 21, 2011

ezio-melotti commented Jul 21, 2011

malemburg commented Jul 21, 2011

ezio-melotti commented Jul 22, 2011

py-user mannequin commented Jul 22, 2011

ezio-melotti commented Jul 22, 2011

ezio-melotti commented Aug 14, 2011

python-dev mannequin commented Aug 15, 2011

python-dev mannequin commented Aug 15, 2011

ezio-melotti commented Aug 15, 2011

python-dev mannequin commented Aug 15, 2011

str.capitalize contradicts oneself #56475

str.capitalize contradicts oneself #56475

Comments

py-user mannequin commented Jun 5, 2011

py-user mannequin commented Jun 5, 2011

bitdancer commented Jun 5, 2011

py-user mannequin commented Jun 6, 2011

ezio-melotti commented Jul 21, 2011

malemburg commented Jul 21, 2011

ezio-melotti commented Jul 21, 2011

malemburg commented Jul 21, 2011

ezio-melotti commented Jul 22, 2011

py-user mannequin commented Jul 22, 2011

ezio-melotti commented Jul 22, 2011

ezio-melotti commented Aug 14, 2011

python-dev mannequin commented Aug 15, 2011

python-dev mannequin commented Aug 15, 2011

ezio-melotti commented Aug 15, 2011

python-dev mannequin commented Aug 15, 2011