classification
Title: str.capitalize contradicts oneself
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder: str.upper converts to title
View: 12204
Assigned To: ezio.melotti Nosy List: belopolsky, eric.araujo, ezio.melotti, lemburg, py.user, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2011-06-05 05:54 by py.user, last changed 2011-08-15 07:04 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
issue12266.diff ezio.melotti, 2011-08-14 17:58 Patch against 3.2 + tests. review
Messages (15)
msg137682 - (view) Author: py.user (py.user) * Date: 2011-06-05 05:54
specification

str.capitalize()¶

    Return a copy of the string with its first character capitalized and the rest lowercased.


>>> '\u1ffc', '\u1ff3'
('ῼ', 'ῳ')
>>> '\u1ffc'.isupper()
False
>>> '\u1ff3'.islower()
True
>>> s = '\u1ff3\u1ff3\u1ffc\u1ffc'
>>> s
'ῳῳῼῼ'
>>> s.capitalize()
'ῼῳῼῼ'
>>>

A: lower
B: title

A -> B & !B -> A
msg137694 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-05 13:37
This looks like a duplicate of #12204.
msg137720 - (view) Author: py.user (py.user) * Date: 2011-06-06 00:18
in http://bugs.python.org/issue12204
Marc-Andre Lemburg wrote:
I suggest to close this ticket as invalid or to add a note
to the documentation explaining how the mapping is applied
(and when not).

this problem is another
str.capitalize makes the first character big, but it doesn't make the rest small
clearing documentation is not enough

lowering works
>>> '\u1ffc'
'ῼ'
>>> '\u1ffc'.lower()
'ῳ'
>>>
msg140780 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-07-21 04:59
Indeed this seems a different issue, and might be worth fixing it.
Given this definition:
  str.capitalize()¶
      Return a copy of the string with its first character capitalized and the rest lowercased.
we might implement capitalize like:
>>> def mycapitalize(s):
...     return s[0].upper() + s[1:].lower()
... 
>>> 'fOoBaR'.capitalize()
'Foobar'
>>> mycapitalize('fOoBaR')
'Foobar'

And this would yield the correct result:
>>> s = u'\u1ff3\u1ff3\u1ffc\u1ffc'
>>> print s
ῳῳῼῼ
>>> print s.capitalize()
ῼῳῼῼ
>>> print mycapitalize(s)
ῼῳῳῳ
>>> s.capitalize().istitle()
False
>>> mycapitalize(s).istitle()
True

This doesn't happen because the actual implementation of str.capitalize checks if a char is uppercase (and not if it's titlecase too) before converting it to lowercase.  This can be fixed doing:
diff -r cb44fef5ea1d Objects/unicodeobject.c
--- a/Objects/unicodeobject.c   Thu Jul 21 01:11:30 2011 +0200
+++ b/Objects/unicodeobject.c   Thu Jul 21 07:57:21 2011 +0300
@@ -6739,7 +6739,7 @@
     }
     s++;
     while (--len > 0) {
-        if (Py_UNICODE_ISUPPER(*s)) {
+        if (Py_UNICODE_ISUPPER(*s) || Py_UNICODE_ISTITLE(*s)) {
             *s = Py_UNICODE_TOLOWER(*s);
             status = 1;
         }
msg140796 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-07-21 08:34
I think it would be better to use this code:

    if (!Py_UNICODE_ISUPPER(*s)) {
        *s = Py_UNICODE_TOUPPER(*s);
        status = 1;
    }
    s++;
    while (--len > 0) {
        if (Py_UNICODE_ISLOWER(*s)) {
            *s = Py_UNICODE_TOLOWER(*s);
            status = 1;
        }
        s++;
    }

Since this actually implements what the doc-string says.

Note that title case is not the same as upper case. Title case is
a special case that get's applied when using a string as a title
of a text and may well include characters that are lower case
but which are only used in titles.
msg140798 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-07-21 08:52
Do you mean  "if (!Py_UNICODE_ISLOWER(*s)) {"  (with the '!')?

This sounds fine to me, but with this approach all the uncased characters will go through a Py_UNICODE_TO* macro, whereas with the current code only the cased ones are converted.  I'm not sure this matters too much though.

OTOH if the non-lowercase cased chars are always either upper or titlecased, checking for both should be equivalent.
msg140799 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-07-21 09:02
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> Do you mean  "if (!Py_UNICODE_ISLOWER(*s)) {"  (with the '!')?

Sorry, here's the correct version:

    if (!Py_UNICODE_ISUPPER(*s)) {
        *s = Py_UNICODE_TOUPPER(*s);
        status = 1;
    }
    s++;
    while (--len > 0) {
        if (!Py_UNICODE_ISLOWER(*s)) {
            *s = Py_UNICODE_TOLOWER(*s);
            status = 1;
        }
        s++;
    }

> This sounds fine to me, but with this approach all the uncased characters will go through a Py_UNICODE_TO* macro, whereas with the current code only the cased ones are converted.  I'm not sure this matters too much though.
> 
> OTOH if the non-lowercase cased chars are always either upper or titlecased, checking for both should be equivalent.

AFAIK, there are characters that don't have a case mapping at all.
It may also be the case, that a non-cased character still has a
lower/upper case mapping, e.g. for typographical reasons.

Someone would have to check this against the current Unicode database.
msg140857 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-07-22 03:30
>>> import sys; hex(sys.maxunicode)
'0x10ffff'
>>> import unicodedata; unicodedata.unidata_version
'6.0.0'

import unicodedata
all_chars = list(map(chr, range(0x110000)))
Ll = [c for c in all_chars if unicodedata.category(c) == 'Ll']
Lu = [c for c in all_chars if unicodedata.category(c) == 'Lu']
Lt = [c for c in all_chars if unicodedata.category(c) == 'Lt']
Lo = [c for c in all_chars if unicodedata.category(c) == 'Lo']
Lm = [c for c in all_chars if unicodedata.category(c) == 'Lm']

>>> [len(x) for x in [Ll, Lu, Lt, Lo, Lm]]
[1759, 1436, 31, 97084, 210]
>>> sum(1 for c in Lu if c.lower() == c)
471  # uppercase chars with no lower
>>> sum(1 for c in Lt if c.lower() == c)
0    # titlecase chars with no lower
>>> sum(1 for c in Ll if c.upper() == c)
760  # lowercase chars with no upper
>>> sum(1 for c in Lo if c.upper() != c or c.title() != c or c.lower() != c)
0    # "Letter, other" chars with a different upper/title/lower case
>>> sum(1 for c in Lm if c.upper() != c or c.title() != c or c.lower() != c)
0    # "Letter, modifier" chars with a different upper/title/lower case
>>> sum(1 for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c))
85   # non-letter chars with a different upper/title/lower case
>>> [c for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c)]
['', 'Ⅰ', 'Ⅱ', 'Ⅲ', 'Ⅳ', 'Ⅴ', 'Ⅵ', 'Ⅶ', 'Ⅷ', 'Ⅸ', 'Ⅹ', 'Ⅺ', 'Ⅻ', 'Ⅼ', 'Ⅽ', 'Ⅾ', 'Ⅿ', 'ⅰ', 'ⅱ', 'ⅲ', 'ⅳ', 'ⅴ', 'ⅵ', 'ⅶ', 'ⅷ', 'ⅸ', 'ⅹ', 'ⅺ', 'ⅻ', 'ⅼ', 'ⅽ', 'ⅾ', 'ⅿ', 'Ⓐ', 'Ⓑ', 'Ⓒ', 'Ⓓ', 'Ⓔ', 'Ⓕ', 'Ⓖ', 'Ⓗ', 'Ⓘ', 'Ⓙ', 'Ⓚ', 'Ⓛ', 'Ⓜ', 'Ⓝ', 'Ⓞ', 'Ⓟ', 'Ⓠ', 'Ⓡ', 'Ⓢ', 'Ⓣ', 'Ⓤ', 'Ⓥ', 'Ⓦ', 'Ⓧ', 'Ⓨ', 'Ⓩ', 'ⓐ', 'ⓑ', 'ⓒ', 'ⓓ', 'ⓔ', 'ⓕ', 'ⓖ', 'ⓗ', 'ⓘ', 'ⓙ', 'ⓚ', 'ⓛ', 'ⓜ', 'ⓝ', 'ⓞ', 'ⓟ', 'ⓠ', 'ⓡ', 'ⓢ', 'ⓣ', 'ⓤ', 'ⓥ', 'ⓦ', 'ⓧ', 'ⓨ', 'ⓩ']
>>> list(c.lower() for c in _)
['', 'ⅰ', 'ⅱ', 'ⅲ', 'ⅳ', 'ⅴ', 'ⅵ', 'ⅶ', 'ⅷ', 'ⅸ', 'ⅹ', 'ⅺ', 'ⅻ', 'ⅼ', 'ⅽ', 'ⅾ', 'ⅿ', 'ⅰ', 'ⅱ', 'ⅲ', 'ⅳ', 'ⅴ', 'ⅵ', 'ⅶ', 'ⅷ', 'ⅸ', 'ⅹ', 'ⅺ', 'ⅻ', 'ⅼ', 'ⅽ', 'ⅾ', 'ⅿ', 'ⓐ', 'ⓑ', 'ⓒ', 'ⓓ', 'ⓔ', 'ⓕ', 'ⓖ', 'ⓗ', 'ⓘ', 'ⓙ', 'ⓚ', 'ⓛ', 'ⓜ', 'ⓝ', 'ⓞ', 'ⓟ', 'ⓠ', 'ⓡ', 'ⓢ', 'ⓣ', 'ⓤ', 'ⓥ', 'ⓦ', 'ⓧ', 'ⓨ', 'ⓩ', 'ⓐ', 'ⓑ', 'ⓒ', 'ⓓ', 'ⓔ', 'ⓕ', 'ⓖ', 'ⓗ', 'ⓘ', 'ⓙ', 'ⓚ', 'ⓛ', 'ⓜ', 'ⓝ', 'ⓞ', 'ⓟ', 'ⓠ', 'ⓡ', 'ⓢ', 'ⓣ', 'ⓤ', 'ⓥ', 'ⓦ', 'ⓧ', 'ⓨ', 'ⓩ']
>>> len(_)
85
>>> {unicodedata.category(c) for c in all_chars if c not in L and (c.upper() != c or c.title() != c or c.lower() != c)}
{'So', 'Mn', 'Nl'}

So == Symbol, Other
Mn == Mark, Nonspacing
Nl == Number, Letter
msg140858 - (view) Author: py.user (py.user) * Date: 2011-07-22 04:26
>>> [c for c in all_chars if c not in L and ...

L ?
msg140859 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-07-22 04:34
L = set(sum([Ll, Lu, Lt, Lo, Lm], []))
msg142071 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-14 17:58
Attached patch + tests.
msg142099 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-15 06:22
New changeset c34772013c53 by Ezio Melotti in branch '3.2':
#12266: Fix str.capitalize() to correctly uppercase/lowercase titlecased and cased non-letter characters.
http://hg.python.org/cpython/rev/c34772013c53

New changeset eab17979a586 by Ezio Melotti in branch '2.7':
#12266: Fix str.capitalize() to correctly uppercase/lowercase titlecased and cased non-letter characters.
http://hg.python.org/cpython/rev/eab17979a586
msg142100 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-15 06:26
New changeset 1ea72da11724 by Ezio Melotti in branch 'default':
#12266: merge with 3.2.
http://hg.python.org/cpython/rev/1ea72da11724
msg142101 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 06:31
Fixed, thanks for the report!
msg142103 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-15 07:04
New changeset d3816fa1bcdf by Ezio Melotti in branch '2.7':
#12266: move the tests in test_unicode.
http://hg.python.org/cpython/rev/d3816fa1bcdf
History
Date User Action Args
2011-08-15 07:04:58python-devsetmessages: + msg142103
2011-08-15 06:31:57ezio.melottisetstatus: open -> closed
resolution: duplicate -> fixed
messages: + msg142101
2011-08-15 06:26:43python-devsetmessages: + msg142100
2011-08-15 06:22:45python-devsetnosy: + python-dev
messages: + msg142099
2011-08-14 17:58:37ezio.melottisetfiles: + issue12266.diff
keywords: + patch
messages: + msg142071
2011-07-22 04:34:09ezio.melottisetmessages: + msg140859
2011-07-22 04:26:16py.usersetmessages: + msg140858
2011-07-22 03:30:13ezio.melottisetmessages: + msg140857
2011-07-21 09:02:34lemburgsetmessages: + msg140799
2011-07-21 08:52:55ezio.melottisetmessages: + msg140798
2011-07-21 08:34:22lemburgsetmessages: + msg140796
2011-07-21 04:59:52ezio.melottisetstatus: closed -> open
assignee: ezio.melotti
messages: + msg140780

versions: + Python 2.7, Python 3.2, Python 3.3, - Python 3.1
2011-07-21 04:20:56ezio.melottisetnosy: + lemburg, belopolsky, ezio.melotti, eric.araujo
2011-06-06 00:22:44py.usersettitle: str.capitalize contradicts -> str.capitalize contradicts oneself
2011-06-06 00:21:55py.usersettype: behavior
2011-06-06 00:18:57py.usersetmessages: + msg137720
2011-06-05 13:37:41r.david.murraysetstatus: open -> closed

superseder: str.upper converts to title

nosy: + r.david.murray
messages: + msg137694
resolution: duplicate
stage: resolved
2011-06-05 05:54:59py.usercreate