Issue 12737: str.title() is overzealous by upcasing combining marks inappropriately

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56946

classification

Title:	str.title() is overzealous by upcasing combining marks inappropriately
Type:	behavior	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, ezio.melotti, flox, gvanrossum, iritkatriel, loewis, tchrist, terry.reedy, vishvas.vasuki
Priority:	normal	Keywords:

Created on 2011-08-11 22:37 by tchrist, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
titletest.python	tchrist, 2011-08-11 22:37	demo showing python uses incorrect sense of \w for words in titlecasing
titletest.py	ezio.melotti, 2011-09-30 03:36	slightly improved demo

Messages (20)
msg141929 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-11 22:37
Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest. However, this is not true. It is not using either of the two word detection algorithms that Unicode provides. One allows you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number, or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and the other uses the more sophisticated word-break provided by the Word_Break properties such as Word_Break=MidNumLet Python is using neither of these, so gets the wrong answer. titlecase of déme un café should be Déme Un Café not DéMe Un Café titlecase of i̇stanbul should be İstanbul not İStanbul titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο Because those are in NFD form, you get different answers than if they are in NFC. That is not right. You should get the same answer. The bug is you aren't using the right definition for \w, and so get screwed up. This is likely related to issue 12731. In the enclosed tester file, which fails 4 out of its 6 tests, there is also a bug shown with this failed result: titlecase of 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 should be 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 not 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 That one is related to issue 12730. See the attached tester, which was run under Python 3.2. As far as I can tell, these bugs exist in all python versions.
msg141998 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-08-12 23:30
I changed the title because 'string' is a module that once contained the functions that are now attached to the str class as methods. So 'string.title' is an obsolete attribute reference.
msg142106 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-08-15 08:29
See also #12746.
msg142110 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-15 10:20
So the issue here is that while using combing chars, str.title() fails to titlecase the string properly. The algorithm implemented by str.title() [0] is quite simple: it loops through the code units, and uppercases all the chars that follow a char that is not lower/upper/titlecased. This means that if Déme doesn't use combining accents, the char before the 'm' is 'é', 'é' is a lowercase char, so 'm' is not capitalized. If the 'é' is represented as 'e' + '´', the char before the 'm' is '´', '´' is not a lower/upper/titlecase char, so the 'm' is capitalized. I guess we could normalize the string before doing the title casing, and then normalize it back. Also the str methods don't claim to follow Unicode afaik, so unless we decide that they should, we could implement whatever algorithm we want. [0]: Objects/unicodeobject.c:6752
msg143038 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2011-08-26 21:16
Yeah, this should be fixed in 3.3 and probably backported to 3.2 and 2.7. (There is already no guarantee that len(s) == len(s.title()), right?)
msg143046 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-26 22:00
Guido van Rossum <report@bugs.python.org> wrote on Fri, 26 Aug 2011 21:16:57 -0000: > Yeah, this should be fixed in 3.3 and probably backported to 3.2 > and 2.7. (There is already no guarantee that len(s) == > len(s.title()), right?) Well, I don't know of any such guarantee, but I don't know Python very well. In general, Unicode makes very few guarantees about casing. Under full casemapping, which is the only way to do the silly Turkish stuff amongst quite a bit else, any of the three casemappings can change the length of the string. Other things you can't rely on are round tripping and "single paths". By roundtripping, just look at the two lowercase sigmas and think about how you can't get back to one of them if you uppercase them both. By single paths, I mean that code that does some sort of conversion where it first lowercases everything and then titlecases the first letter can produce something different from titlecasing just the original first letter and then lowercasing the rest of them. That's because tc(x) and tc(lc(x)) can be different. --tom
msg144207 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-17 16:31
I think string methods (and other parts of the stdlib) assume NFC and leave normalization to NFC up to the user. Before fixing str.title() we should take a more general decision about handling strings that use other normalization forms.
msg144233 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-09-18 08:45
Tom: it's intentional that .title() doesn't use traditional word break algorithms. In 2.x, "foo3bar".title() is "Foo3Bar", i.e. the 3 counts as a word end. So neither UTS#18 \w nor UAX#29 apply. So in UTS#18 terminology, .title() matches more closes \alpha+, despite UTS#18 saying that this shouldn't be used for word-breaking. It's not clear to me how UTS#18 defines \alpha. On the one hand, they say that marks should be included, OTOH they refer to the Alphabetic derived category which doesn't include marks, except for the few that have been included in Other_Alphatetic.
msg144661 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-30 03:36
After PEP 393 the result is still the same (I attached a slightly improved version of the script): titlecase of 'deme un cafe' should be 'Deme Un Cafe' not 'DeMe Un Cafe' titlecase of 'istanbul' should be 'Istanbul' not 'IStanbul' titlecase of 'α στο διαολο' should be 'Α Στο Διαολο' not 'ΑΙ Στο ΔιαΟλο' failed 3 out of 6 tests Martin, do you think that str.title() should follow the Unicode standard? Should string methods work with all the normalizations or just with NFC?
msg144683 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-09-30 10:36
> Martin, do you think that str.title() should follow the Unicode standard? I don't think that "follow the Unicode standard" has any meaning in this context: the Unicode standard doesn't specify (AFAIK) what a .title() method in a programming language should do. > Should string methods work with all the normalizations or just with NFC? When we know what .title() should do, it should do so correctly for all strings. I try to propose a definition for .title() "Split S into words. Change the first letter in a word to upper-case, and all subsequent letters to lower case. A word is a sequence that starts with a letter, followed by letter-related characters." Letters are all characters from the "Alphabetic" category, i.e. Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic. "letter-related" characters are letters + marks (Mn, Mc, Me).
msg144688 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-09-30 12:37
> Martin v. Löwis <martin@v.loewis.de> added the comment: > "Split S into words. Change the first letter in a word to upper-case, Except that I think you actually mean that the first "letter" is changed into titlecase not uppercase. One might also say try to change for all these, in that not all cased code points in Unicode have casemaps that are different from themselves. For example, a superscript lowercase a or b has no distinct uppercase mapping, the way the non-superscript versions do: % (echo xyz; echo ab AB \| unisupers) \| uc XYZ ᵃᵇ ᴬᴮ > and all subsequent letters to lower case. A word is a sequence that > starts with a letter, followed by letter-related characters." I don't like the way you have defined letters and letter-related characters. The first already has a definition, which is not the one you are using. Word characters also has a definition in Unicode, and it is not the one you are using. I strongly advise against redefining standard Unicode properties. Choose other, unused terms if you must. It is very confusing otherwise. > Letters are all characters from the "Alphabetic" category, i.e. > Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic. Except that is exactly the definition of the Unicode Alphabetic property, not the Unicode Letter property. It is a mistake to equate Letter=Alphabetic, and very confusing too. I agree that this probably what you want, though. I just don't think you should use "letter-related characters" when there is an existing formal definition that works, or that you should redefine Letter. > "letter-related" characters are letters + marks (Mn, Mc, Me). That isn't quite right. * Letters are Lu+Ll+Lt+Lm+Lo. * Alphabetic is Letters + Other_Alphabetic. * Other_Alphabetic is certain marks (like the iota subscript) and the letter numbers (Nl), as well as a few symbols. * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. I think you are looking for here are Word characters without Nd + Pc, so just Alphabetic + Mn+Mc+Me. Is that right? --tom PS: You can do union/intersection stuff with properties to see what the resulting sets look like using the unichars command-line tool. This is everything that is both alphabetic and also a mark: % unichars -gs '\p{Alphabetic}' '\pM' ‭ ○ͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI ‭ ○ְ U+05B0 GC=Mn SC=Hebrew HEBREW POINT SHEVA ‭ ○ֱ U+05B1 GC=Mn SC=Hebrew HEBREW POINT HATAF SEGOL ‭ ○ֲ U+05B2 GC=Mn SC=Hebrew HEBREW POINT HATAF PATAH ‭ ○ֳ U+05B3 GC=Mn SC=Hebrew HEBREW POINT HATAF QAMATS ... ‭ ○ं U+0902 GC=Mn SC=Devanagari DEVANAGARI SIGN ANUSVARA ‭ ः U+0903 GC=Mc SC=Devanagari DEVANAGARI SIGN VISARGA ‭ ा U+093E GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN AA ‭ ि U+093F GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN I ‭ ी U+0940 GC=Mc SC=Devanagari DEVANAGARI VOWEL SIGN II ‭ ○ु U+0941 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN U ‭ ○ू U+0942 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN UU ‭ ○ृ U+0943 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC R ‭ ○ॄ U+0944 GC=Mn SC=Devanagari DEVANAGARI VOWEL SIGN VOCALIC RR ... While these are the NON-alphabetic marks, which are still Word characters though of course: % unichars -gs '\P{Alphabetic}' '\pM' ‭ ○̀ U+0300 GC=Mn SC=Inherited COMBINING GRAVE ACCENT ‭ ○́ U+0301 GC=Mn SC=Inherited COMBINING ACUTE ACCENT ‭ ○̂ U+0302 GC=Mn SC=Inherited COMBINING CIRCUMFLEX ACCENT ‭ ○̃ U+0303 GC=Mn SC=Inherited COMBINING TILDE ‭ ○̄ U+0304 GC=Mn SC=Inherited COMBINING MACRON ‭ ○̅ U+0305 GC=Mn SC=Inherited COMBINING OVERLINE ‭ ○̆ U+0306 GC=Mn SC=Inherited COMBINING BREVE ‭ ○̇ U+0307 GC=Mn SC=Inherited COMBINING DOT ABOVE ‭ ○̈ U+0308 GC=Mn SC=Inherited COMBINING DIAERESIS ‭ ○̉ U+0309 GC=Mn SC=Inherited COMBINING HOOK ABOVE ‭ ○̊ U+030A GC=Mn SC=Inherited COMBINING RING ABOVE ‭ ○̋ U+030B GC=Mn SC=Inherited COMBINING DOUBLE ACUTE ACCENT ‭ ○̌ U+030C GC=Mn SC=Inherited COMBINING CARON ... And here are the Cased code points that are do not change when upper-, title-, or lowercased: % unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]' ‭ ª U+00AA GC=Ll SC=Latin FEMININE ORDINAL INDICATOR ‭ º U+00BA GC=Ll SC=Latin MASCULINE ORDINAL INDICATOR ‭ ĸ U+0138 GC=Ll SC=Latin LATIN SMALL LETTER KRA ‭ ƍ U+018D GC=Ll SC=Latin LATIN SMALL LETTER TURNED DELTA ‭ ƛ U+019B GC=Ll SC=Latin LATIN SMALL LETTER LAMBDA WITH STROKE ‭ ƪ U+01AA GC=Ll SC=Latin LATIN LETTER REVERSED ESH LOOP ‭ ƫ U+01AB GC=Ll SC=Latin LATIN SMALL LETTER T WITH PALATAL HOOK ‭ ƺ U+01BA GC=Ll SC=Latin LATIN SMALL LETTER EZH WITH TAIL ‭ ƾ U+01BE GC=Ll SC=Latin LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE ‭ ȡ U+0221 GC=Ll SC=Latin LATIN SMALL LETTER D WITH CURL ‭ ȴ U+0234 GC=Ll SC=Latin LATIN SMALL LETTER L WITH CURL ‭ ȵ U+0235 GC=Ll SC=Latin LATIN SMALL LETTER N WITH CURL ‭ ȶ U+0236 GC=Ll SC=Latin LATIN SMALL LETTER T WITH CURL ‭ ȷ U+0237 GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J ‭ ȸ U+0238 GC=Ll SC=Latin LATIN SMALL LETTER DB DIGRAPH ‭ ȹ U+0239 GC=Ll SC=Latin LATIN SMALL LETTER QP DIGRAPH ‭ ɕ U+0255 GC=Ll SC=Latin LATIN SMALL LETTER C WITH CURL ‭ ɘ U+0258 GC=Ll SC=Latin LATIN SMALL LETTER REVERSED E ‭ ɚ U+025A GC=Ll SC=Latin LATIN SMALL LETTER SCHWA WITH HOOK ‭ ɜ U+025C GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E ‭ ɝ U+025D GC=Ll SC=Latin LATIN SMALL LETTER REVERSED OPEN E WITH HOOK ‭ ɞ U+025E GC=Ll SC=Latin LATIN SMALL LETTER CLOSED REVERSED OPEN E ‭ ɟ U+025F GC=Ll SC=Latin LATIN SMALL LETTER DOTLESS J WITH STROKE ‭ ɡ U+0261 GC=Ll SC=Latin LATIN SMALL LETTER SCRIPT G ‭ ɢ U+0262 GC=Ll SC=Latin LATIN LETTER SMALL CAPITAL G ‭ ɤ U+0264 GC=Ll SC=Latin LATIN SMALL LETTER RAMS HORN ‭ ɥ U+0265 GC=Ll SC=Latin LATIN SMALL LETTER TURNED H ‭ ɦ U+0266 GC=Ll SC=Latin LATIN SMALL LETTER H WITH HOOK ... You can get unichars from http://training.perl.com/scripts/unichars where you might also care to get uniprops and perhaps uninames to go with it. There are other Unicode tools there (the directory is 100% Unicode tools, not general scripts as its name suggests), but those are the important ones, I reckon.
msg144690 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2011-09-30 15:02
I like how we're actually converging on an implementable and maximally-useful algorithm.
msg144722 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-10-01 10:59
> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. Where did you get that definition from? UTS#18 defines "<word_character>", which is Alphabetic + U+200C + U+200D (i.e. not including marks, but including those > I think you are looking for here are Word characters without > Nd + Pc, so just Alphabetic + Mn+Mc+Me. > > Is that right? With your definition of "Word character" above, yes, that's right. Marks won't start a word, though. As for terminology: I think the documentation should continue to speak about "words" and "letters", and then define what is meant in this context. It's not that the Unicode consortium invented the term "letter", so we should use it more liberally than just referring to the L* categories.
msg144723 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-10-01 11:07
Martin v. Löwis <report@bugs.python.org> wrote on Sat, 01 Oct 2011 10:59:48 -0000: >> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. > Where did you get that definition from? UTS#18 defines > "<word_character>", which is Alphabetic + U+200C + U+200D > (i.e. not including marks, but including those From UTS#18 RL1.2A in Annex C, where a \p{word} or \w character is defined to be \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} >> I think you are looking for here are Word characters without >> Nd + Pc, so just Alphabetic + Mn+Mc+Me. >> >> Is that right? > > With your definition of "Word character" above, yes, that's right. It's not mine. It's tr18's. > Marks won't start a word, though. That's the smarter boundary thing they talk about. I'm not myself familiar with \pM > As for terminology: I think the documentation should continue to > speak about "words" and "letters", and then define what is meant > in this context. It's not that the Unicode consortium invented > the term "letter", so we should use it more liberally than just > referring to the L* categories. I really don't think it wise to have private definitions of these. If Letter doesn't mean L?, things get too weird. That's why there are separate definitions of alphabetic, word, etc. --tom
msg144735 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-10-01 14:42
>> As for terminology: I think the documentation should continue to >> speak about "words" and "letters", and then define what is meant >> in this context. It's not that the Unicode consortium invented >> the term "letter", so we should use it more liberally than just >> referring to the L* categories. > > I really don't think it wise to have private definitions of these. > > If Letter doesn't mean L?, things get too weird. That's why > there are separate definitions of alphabetic, word, etc. But I won't be using the word "Letter", but "letter" (lower case). Nobody will assume that this refers to the Unicode standard; people would rather expect that this is [A-Za-z] (i.e. not expect non-ASCII characters to be considered at all). So elaboration is necessary, anyway. I take the risk of confusing the 10 people that ever read UTS#18 :-)
msg379614 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2020-10-25 22:43
Of the examples given two seem ok now, but the Istanbul one is still wrong: >>> "déme un café".title() 'Déme Un Café' >>> "ᾲ στο διάολο".title() 'Ὰͅ Στο Διάολο' >>> >>> "i̇stanbul".title() 'İStanbul'
msg379629 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-10-26 02:08
Are you sure? Running Ezio's titletest.py, I get this output (note that the UCD major version is in the double digits so the test for that misfires :-). titletest.py: Please set your PYTHONIOENCODING envariable to utf8 WARNING: Your old UCD is out of date, expected 6.0.0 but got 13.0.0 titlecase of 'déme un café' should be 'Déme Un Café' not 'DéMe Un Café' titlecase of 'i̇stanbul' should be 'İstanbul' not 'İStanbul' titlecase of 'ᾲ στο διάολο' should be 'Ὰͅ Στο Διάολο' not 'ᾺΙ Στο ΔιάΟλο' failed 3 out of 6 tests Note that the test program specifically uses combining marks, which are alternate ways to spell some characters. It seems what's failing is the second deme un cafe, the first istanbul, and the (only) greek phrase.
msg379656 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2020-10-26 09:04
You're right, I see that too when I don't tamper with the test.
msg398165 - (view)	Author: Vishvas Vasuki (vishvas.vasuki)	Date: 2021-07-24 16:03
This case still fails with 3.9 - 'Tr̥tīyā'.title()
msg398173 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-07-24 18:29
Which comes out 'Tr̥Tīyā'. The underdot '̥' is '0x325'

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 56946
2021-07-24 18:29:50	terry.reedy	set	messages: + msg398173
2021-07-24 16:03:52	vishvas.vasuki	set	nosy: + vishvas.vasuki messages: + msg398165
2020-10-27 03:00:34	vstinner	set	nosy: - vstinner
2020-10-26 09:04:48	iritkatriel	set	messages: + msg379656 components: + Library (Lib)
2020-10-26 02:08:51	gvanrossum	set	messages: + msg379629
2020-10-25 22:43:06	iritkatriel	set	nosy: + iritkatriel messages: + msg379614 versions: + Python 3.9, Python 3.10, - Python 3.2, Python 3.3
2011-10-18 13:23:43	flox	set	nosy: + flox
2011-10-01 14:42:35	loewis	set	messages: + msg144735 title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
2011-10-01 11:07:49	tchrist	set	messages: + msg144723 title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
2011-10-01 10:59:47	loewis	set	messages: + msg144722 title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
2011-09-30 15:02:08	gvanrossum	set	messages: + msg144690
2011-09-30 12:37:57	tchrist	set	messages: + msg144688 title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
2011-09-30 10:36:38	loewis	set	messages: + msg144683 title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
2011-09-30 03:36:43	ezio.melotti	set	files: + titletest.py messages: + msg144661
2011-09-18 08:45:52	loewis	set	messages: + msg144233
2011-09-17 16:31:45	ezio.melotti	set	messages: + msg144207
2011-08-26 22:00:18	tchrist	set	messages: + msg143046 title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
2011-08-26 21:16:56	gvanrossum	set	nosy: + gvanrossum messages: + msg143038
2011-08-15 10:20:25	ezio.melotti	set	messages: + msg142110
2011-08-15 08:29:28	vstinner	set	messages: + msg142106
2011-08-13 11:53:44	pitrou	set	nosy: + loewis, vstinner stage: needs patch versions: + Python 3.3
2011-08-12 23:30:41	terry.reedy	set	nosy: + terry.reedy messages: + msg141998 title: string.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
2011-08-12 18:06:18	Arfrever	set	nosy: + Arfrever
2011-08-11 22:53:23	ezio.melotti	set	nosy: + ezio.melotti
2011-08-11 22:37:33	tchrist	create