classification
Title: str.capitalize should titlecase the first character not uppercase
Type: enhancement Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ZackerySpytz, ezio.melotti, kingsley, serhiy.storchaka, steve.dower, steven.daprano
Priority: normal Keywords: easy (C), patch

Created on 2019-04-07 10:40 by steven.daprano, last changed 2019-04-12 18:43 by steve.dower. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12804 merged kingsley, 2019-04-12 14:07
Messages (10)
msg339568 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-04-07 10:40
str.capitalize appears to uppercase the first character of the string, which is okay for ASCII but not for non-English letters.

For example, the letter NJ in Croatian appears as Nj at the start of words when the first character is capitalized:

Njemačka ('Germany'), not NJemačka.

(In ASCII, that's Njemacka not NJemacka.)

https://en.wikipedia.org/wiki/Gaj's_Latin_alphabet#Digraphs

But using any of:

U+01CA LATIN CAPITAL LETTER NJ
U+01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J
U+01CC LATIN SMALL LETTER NJ 

we get the wrong result with capitalize:


py> 'NJemačka'.capitalize()
'NJemačka'
py> 'Njemačka'.capitalize()
'NJemačka'
py> 'njemačka'.capitalize()
'NJemačka'


I believe that the correct behaviour is to titlecase the first code point and lowercase the rest, which is what the Apache library here does:

https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#capitalize-java.lang.String-
msg339570 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-04-07 10:57
I think this is a reasonable change.

Also the docs for str.title() should be fixed.
msg339804 - (view) Author: Kingsley McDonald (kingsley) * Date: 2019-04-09 20:34
Hello there,

I'm an absolute beginner here and this whole thing is a little overwhelming, so please bear with me. I think this would be a suitable first task for me to take on because it appears to be a simple one-line change (correct me if I'm mistaken, though).
msg339878 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-04-10 18:15
This issue is easy if you know C.

* Find the implementation of str.capitalize in unicodeobject.c and make it using the title case. See on the implementation of str.title for example.

* Find tests for str.capitalize and aďd new cases. Finding the proper place for test may be the hardest part.

* Update the documentation for str.capitalize. Add the versionchanged directive.

* Fix the documentation for str.title. Use str.capitalize in the example.

* Add the news and What's New entries.
msg339890 - (view) Author: Kingsley McDonald (kingsley) * Date: 2019-04-10 20:49
Thanks for clarifying all of that! I now have the patch and tests working locally. However, I'm not too sure what documentation needs to be changed for str.title. Should it specify that only the first letter of digraphs are capitalised, rather than the full character?
I sure hope I get the hang of this soon :-D
msg340066 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-04-12 15:35
New changeset b015fc86f7b1f35283804bfee788cce0a5495df7 by Steve Dower (Kingsley M) in branch 'master':
bpo-36549: str.capitalize now titlecases the first character instead of uppercasing it (GH-12804)
https://github.com/python/cpython/commit/b015fc86f7b1f35283804bfee788cce0a5495df7
msg340067 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-04-12 15:36
Thanks! I'm a big fan of this change :)
msg340076 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2019-04-12 16:14
I think that the PR may have been merged too quickly. Serhiy had made a list, and I think that the PR was missing some necessary changes.
msg340095 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-04-12 18:42
What is missing? It looks like everything on Serhiy's list was done.
msg340096 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-04-12 18:43
Oh, apart from the What's New section. But this looks enough like a bugfix (previous behaviour "wasn't capitalizing my name correctly" - new behaviour "now capitalizes my name correctly") that it's hardly critical to advertise it on that page.
History
Date User Action Args
2019-04-12 18:43:26steve.dowersetmessages: + msg340096
2019-04-12 18:42:11steve.dowersetmessages: + msg340095
2019-04-12 16:14:06ZackerySpytzsetnosy: + ZackerySpytz
messages: + msg340076
2019-04-12 15:36:11steve.dowersetstatus: open -> closed
resolution: fixed
messages: + msg340067

stage: patch review -> resolved
2019-04-12 15:35:48steve.dowersetnosy: + steve.dower
messages: + msg340066
2019-04-12 14:07:20kingsleysetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request12731
2019-04-10 20:49:41kingsleysetmessages: + msg339890
2019-04-10 18:15:03serhiy.storchakasetmessages: + msg339878
2019-04-10 12:50:50vstinnersetnosy: - vstinner
2019-04-09 20:34:54kingsleysetnosy: + kingsley
messages: + msg339804
2019-04-07 10:57:18serhiy.storchakasettype: enhancement
components: + Interpreter Core, Unicode
versions: + Python 3.8
keywords: + easy (C)
nosy: + serhiy.storchaka, ezio.melotti, vstinner

messages: + msg339570
stage: needs patch
2019-04-07 10:40:51steven.dapranocreate