Issue 7008: str.title() misbehaves with apostrophes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51257

classification

Title:	str.title() misbehaves with apostrophes
Type:	behavior	Stage:	test needed
Components:		Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	christoph, ezio.melotti, lemburg, markon, nickd, nnorwitz, pitrou, r.david.murray, rhettinger, twb
Priority:	normal	Keywords:

Created on 2009-09-27 17:23 by nickd, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (28)
msg93180 - (view)	Author: Nick Devenish (nickd)	Date: 2009-09-27 17:23
str.title() capitalizes the first letter after an apostrophe: >>> "This isn't right".title() "This Isn'T Right" The library function string.capwords, which appears to have exactly the same responsibility, doesn't exhibit this behavior: >>> string.capwords("This isn't right") "This Isn't Right" Tested on 2.6.2 on Mac OS X
msg93212 - (view)	Author: Marco Buccini (markon)	Date: 2009-09-28 14:59
This was already asked some years ago. http://mail.python.org/pipermail/python-list/2006-April/549340.html
msg93220 - (view)	Author: Thomas W. Barr (twb)	Date: 2009-09-28 17:51
The string module, however, fails to properly capitalize anything in quotes: >>> string.capwords("i pity the 'foo'.") "I Pity The 'foo'." The string module could be easily made to work like the object. The object could be made to work more like the module, only capitalizing things after a space and the start of the string, but I'm not really sure that it's any better. (The s.istitle() should also be updated if s.title() is changed.) The inconsistency is pretty nasty, though, and the documentation should probably be more specific about what's going on.
msg93223 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-09-28 18:29
I agree with the OP that str.title should be made smarter. As it stands, it is a likely bug factory that would pass unittests, then generate unpleasant results with real user inputs. Extending on Thomas's comment, I think string.capwords() needs to be deprecated and eliminated. It is an egregious hack that has unfortunate effects such as dropping runs for repeated spaces and incorrectly handling strings in quotes. As it stands, we have two methods that both don't quite do what we would really want in a title casing method (correct handling of apostrophe's and quotation marks, keeping the string length unchanged, and only changing desired letters from lower to uppercase with no other side-effects).
msg93226 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-09-28 19:01
I believe capwords was supposed to be removed in 3.0, but this did not happen.
msg93227 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-09-28 19:05
If you can find a link to the discussion for removing capwords, we can go ahead and deprecate it now.
msg93229 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-09-28 19:27
I haven't been able to find any discussion of deprecating capwords other than a mention in this thread: http://mail.python.org/pipermail/python-3000/2007-April/006642.html Later in the thread Barry says he is neutral on removing capwords, and it is not mentioned further. I think Ezio found some other information somewhere.
msg93232 - (view)	Author: Thomas W. Barr (twb)	Date: 2009-09-28 20:45
If "correct handling of apostrophe's and quotation marks, keeping the string length unchanged, and only changing desired letters from lower to uppercase with no other side-effects" is the criterion we want, then what I suggested (toupper() the first character, and any character that follows a space or punctuation character) should work. (Unless I'm missing something.) Do we want to tolower() all other characters, like the interpreter does now? I can make a test and patch for this if this is what we decide.
msg93235 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-09-28 21:02
I'm still researching what other languages do. MS-Excel matches what Python currently does. Django uses the python version and then fixes-up apostrophe errors: title=lambda value: re.sub("([a-z])'([A-Z])", lambda m: m.group(0).lower(), value.title()). It would also be nice to handle hyphenates like "xray" --> "X-ray". Am thinking that it would be nice if the user could pass-in an optional argument to list all desired characters to prevent transitions (such as apostrophes and hyphens). A broader solution would be to replace string.capwords() with a more sophisticated set of rules that generally match what people are really trying to accomplish with title casing: http://aitech.ac.jp/~ckelly/midi/help/caps.html http://search.cpan.org/dist/Text-Capitalize/Capitalize.pm "Headline Style" in the Chicago Manual of Style or Associate Pressd Stylebook: http://grammar.about.com/b/2008/04/11/rules-for-capitalizing-the-words-in-a-title.htm Any such attempt at a broad solution needs to provide ways for users to modify the list of exception words and options for quoted text.
msg93236 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-09-28 21:39
Thomas, if you write-up an initial patch, aim for the most conservative version that leaves all of the behavior unchanged except for embedded single apostrophes (to handle contractions and possessives). That will assure that we don't muck-up any existing uses for title case: i'm I'm you're You're he's He's david's David's 'bad' 'Bad' f''t f''t 'x 'x Given letters-apostrophe-letter, capitalize only the first letter and lowercase the rest.
msg93237 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-09-28 21:41
We shouldn't change the current default behaviour, people are probably relying on it. Besides, doing the right thing is both (natural) language-dependent and context-dependent. It would be (very) hard to come with an implementation catering to all needs. Perhaps a dedicated typography module, but str.title() is certainly not the answer. However, adding an optional argument to str.title() so as to change the list of recognized separators could be an useful addition for those people who aren't too perfectionist about the result.
msg93238 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-09-28 22:21
Guido, do you have an opinion on whether to have str.title() handle embedded apostrophes, "you're" --> "You're" instead of "You'Re"? IMO, the problem comes-up often enough that people are looking for workarounds (i.e. string.capwords() was a failed hack created to handle the problem and django.titlecase() is a successful attempt at a workaround). I'm not worried about Antoines's comment that we can't change anything ever. I am concerned about his point (mentioned on IRC) that there are no context free solutions (the absolute right answer is hard). While the change would seem to always be helpful in an English context, in French the proper title casing of "l'argent" is "L'Argent". Then again, there are cases in French that don't work under either method (i.e. title casing Amaury Forgeot d'Arc ends-up capitalizing the D no matter what we do). Options: 1. Leave everything the same (rejecting requests for apostrophe handling and forever live with the likes of You'Re). 2. Handle embedded single apostrophes, fixing most cases in English, and wreaking havoc on the French (who are going to be ill-served under any scenario). 3. Add an optional argument to str.title() with a list of characters that will not trigger a transition. This lets people add apostrophes and hyphens and other characters of interest. Hyphens are hard because cases like mother-in-law should properly be converted to Mother-in_Law and hyphens get used in many odd ways. 4. Add a new string method for handling title case with embedded apostrophes but leaving the old version unchanged. My order of preferences is 2,4,3,1.
msg93239 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-09-28 22:32
> I think Ezio found some other information somewhere. While I was fixing #7000 I found that the tests for capwords had been removed in r54854 but since the function was already there I added them back in r75072. The commit message of r54854 says "Also remove all calls to functions in the string module (except maketrans)". I'm adding Neal to the nosy list, maybe he remembers if maketrans really was the only function that was supposed to survive. In #6412 other problems of .title() are discussed, and there are also a couple of links to Technical Reports of the Unicode Consortium about casing algorithms and similar issues (I didn't have time to read them yet though).
msg93240 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-09-28 22:33
> While > the change would seem to always be helpful in an English context, in > French the proper title casing of "l'argent" is "L'Argent". Well I think even in English it doesn't work right. For example someone named O'Brien would end up as "O'brien". My point is that capitalization is both language-sensitive and context-sensitive, and it's a hard problem for a computer to solve. Since str.title() can only be a very crude approximation of the right thing, there's no good reason to break backwards compatibility, IMO. > 1. Leave everything the same (rejecting requests for apostrophe handling > and forever live with the likes of You'Re). > > 2. Handle embedded single apostrophes, fixing most cases in English, and > wreaking havoc on the French (who are going to be ill-served under any > scenario). > > 3. Add an optional argument to str.title() with a list of characters > that will not trigger a transition. This lets people add apostrophes > and hyphens and other characters of interest. Hyphens are hard because > cases like mother-in-law should properly be converted to Mother-in_Law > and hyphens get used in many odd ways. > > 4. Add a new string method for handling title case with embedded > apostrophes but leaving the old version unchanged. > > My order of preferences is 2,4,3,1. I really think the only reasonable options are 3 and 1. 2 breaks compatibility with no real benefit. 4 is too specific a variation (especially in the unicode case, where you might want to take into account the different variants of apostrophes and other characters), and adding a new method for such a subtle difference is not warranted.
msg93241 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-09-28 22:54
By the way, we might want to mention in the documentation that the title() method only gives imperfect results when trying to titlecase natural language. So that people don't get fooled thinking things are simple :-) What do you think?
msg93242 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-09-28 23:02
Raymond, please refrain from emotional terms like "bug factory". I have nothing to say about whether string.capwords() should be removed, but I want to note that it does a split on whitespace and then rejoins using a single space, so that string.capwords('A B\tC\r\nD') returns 'A B C D'. The title() method exists primarily because the Unicode standard has a definition of "title case". I wouldn't want to change its default behavior because there is no reasonable behavior that isn't locale- dependent, and Unicode methods shouldn't depend on locale; and even then it won't be perfect, as the O'Brien example shows. Also note that .title() matches .istitle() in the sense that x.title().istitle() is supposed to be true (except in end cases like a string containing no letters). I worry that providing an API that adds a way to specify a set of characters to be treated as letters (for the purpose of deciding where words start) will just make the bugs in apps harder to find because the examples are rarer (like "l'Aperitif" or "O'Brien" -- or "RSVP" for that matter). With the current behavior at least app authors will easily notice the problem, decide whether it matters to them, and implement their own algorithm if they do. And they are free to be as elaborate or simplistic as they care. What's a realistic use case for .title() anyway? (Proposal: close as won't fix.)
msg93243 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-09-28 23:03
A doc fix sounds like a great idea.
msg93244 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-09-28 23:12
I will add a comment to the docs.
msg93250 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2009-09-29 03:44
I don't recall anything specifically wrt removing capwords. Most likely it was something that struck me as not widely used or really necessary--a good candidate to be removed. Applications could then write the fucntion however they chose which would avoid the problem of Python needing to figure out if it should be Isn'T or Isn't and all the other variations mentioned here.
msg93258 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-09-29 07:57
Guido van Rossum wrote: > What's a realistic use case for .title() anyway? The primary use is when converting a string to be used as title or sub-title of text - mostly inspired by the way English treats titles. The implementation follows the rules laid out in UTR#21: http://unicode.org/reports/tr21/tr21-3.html The Python version only implements the basic set of rules, i.e. "If the preceeding letter is cased, chose the lowercase mapping; otherwise chose the titlecase mapping (in most cases, this will be the same as the uppercase, but not always)." It doesn't implement the special casing rules, since these would require locale and language dependent context information which we don't implement/use in Python. It also doesn't implement mappings that would result in a change of length (ligatures) or require look-ahead strategies (e.g. if the casing depends on the code point following the converted code point). Patches to enhance the code to support those additional rules are welcome. Regarding the apostrophe: the Unicode standard doesn't appear to include any rule regarding that character and its use in titles or upper-case versions of text. The apostrophe itself is a non-cased code point. It's likely that the special use of the apostrophe in English is actually a language-specific use case. For those, it's (currently) better to implement your own versions of the conversion functions, based on the existing methods. Regarding the idea to add an option to define which characters to regard as cased/non-cased: This would cause the algorithm to no longer adhere to the Unicode standard and most probably cause more problems than it solves.
msg93260 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-09-29 08:37
Marc-Andre Lemburg wrote: > > Regarding the apostrophe: the Unicode standard doesn't appear to > include any rule regarding that character and its use in titles > or upper-case versions of text. The apostrophe itself is a > non-cased code point. > > It's likely that the special use of the apostrophe in English > is actually a language-specific use case. For those, it's (currently) > better to implement your own versions of the conversion functions, > based on the existing methods. Looking at the many different uses in various languages, this appears to be the better option: http://en.wikipedia.org/wiki/Apostrophe To make things even more complicated, the usual typewriter apostrophe that you find in ASCII is not the only one in Unicode: http://en.wikipedia.org/wiki/Apostrophe#Unicode
msg93261 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-09-29 08:42
> Patches to enhance the code to support those additional rules are welcome. #6412 has a patch.
msg93262 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-09-29 08:57
> To make things even more complicated, the usual typewriter apostrophe > that you find in ASCII is not the only one in Unicode: > > http://en.wikipedia.org/wiki/Apostrophe#Unicode Yup, and the right one typographically isn't necessarily the ASCII one :-) That's why Microsoft Word automatically inserts a non-ASCII apostrophe when you type « ' », at least in certain languages (apparently OpenOffice doesn't).
msg93264 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-09-29 09:06
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > >> Patches to enhance the code to support those additional rules > are welcome. > > #6412 has a patch. That patch looks promising.
msg93271 - (view)	Author: Christoph Burgmer (christoph)	Date: 2009-09-29 10:20
I admit I don't fully understand the semantics of capwords(). But from what I believe what it should do, this function could be happily replaced by the word-breaking algorithm as defined in http://www.unicode.org/reports/tr29/. This algorithm should be implemented anyway, to properly solve issue6412.
msg93272 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-09-29 10:34
> This algorithm should be implemented anyway, to properly solve > issue6412. Sure, but it should be another function, which might have its place in the wordwrap module. capwords() itself could be deprecated, since it's an obvious one-liner. Replacing in with another method, however, will just confuse and annoy existing users.
msg93274 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-09-29 10:40
Christoph Burgmer wrote: > > Christoph Burgmer <cburgmer@ira.uka.de> added the comment: > > I admit I don't fully understand the semantics of capwords(). string.capwords() is an old function from the days before Unicode. The function is basically defined by its implementation. > But from > what I believe what it should do, this function could be happily > replaced by the word-breaking algorithm as defined in > http://www.unicode.org/reports/tr29/. > > This algorithm should be implemented anyway, to properly solve > issue6412. Simple word breaking would be nice to have in Python as new Unicode method, e.g. .splitwords(). Note however, that word boundaries are just as complicated as casing: there are lots of special cases in different languages or locales (see the notes after the word boundary rules in the TR29).
msg93277 - (view)	Author: Christoph Burgmer (christoph)	Date: 2009-09-29 11:01
Antoine Pitrou wrote: > capwords() itself could be deprecated, since it's an obvious one- > Replacing in with another method, however, will just confuse and annoy > existing users. Yes, sorry, I meant the semantics, where as you are right for the specific function. Marc-Andre Lemburg wrote: > Note however, that word boundaries are just as complicated as casing: > there are lots of special cases in different languages or locales > (see the notes after the word boundary rules in the TR29). ICU already has the full implementation, so Python could get away with just supporting the default implementation (as seen with other case mappings). >>> from PyICU import UnicodeString, Locale, BreakIterator >>> en_US_locale = Locale('en_US') >>> breakIter = BreakIterator.createWordInstance(en_US_locale) >>> s = UnicodeString("There's a hole in the bucket.") >>> print s.toTitle(breakIter, en_US_locale) There's A Hole In The Bucket. >>> breakIter.setText("There's a hole in the bucket.") >>> last = 0 >>> for i in breakIter: ... print s[last:i] ... last = i ... There's A Hole In The Bucket .

History
Date	User	Action	Args
2022-04-11 14:56:53	admin	set	github: 51257
2012-08-24 01:30:01	r.david.murray	link	issue15774 superseder
2009-09-29 14:49:21	gvanrossum	set	assignee: gvanrossum -> nosy: - gvanrossum
2009-09-29 11:01:31	christoph	set	messages: + msg93277
2009-09-29 10:40:54	lemburg	set	messages: + msg93274
2009-09-29 10:34:25	pitrou	set	messages: + msg93272
2009-09-29 10:20:45	christoph	set	nosy: + christoph messages: + msg93271
2009-09-29 09:06:49	lemburg	set	messages: + msg93264
2009-09-29 08:57:50	pitrou	set	messages: + msg93262
2009-09-29 08:42:31	ezio.melotti	set	assignee: gvanrossum messages: + msg93261
2009-09-29 08:37:13	lemburg	set	messages: + msg93260
2009-09-29 07:57:39	lemburg	set	nosy: + lemburg messages: + msg93258
2009-09-29 03:44:48	nnorwitz	set	messages: + msg93250
2009-09-28 23:12:59	rhettinger	set	status: open -> closed resolution: wont fix messages: + msg93244
2009-09-28 23:03:59	gvanrossum	set	assignee: gvanrossum -> (no value)
2009-09-28 23:03:51	gvanrossum	set	messages: + msg93243
2009-09-28 23:02:52	gvanrossum	set	messages: + msg93242
2009-09-28 22:54:01	pitrou	set	messages: + msg93241
2009-09-28 22:33:12	pitrou	set	messages: + msg93240
2009-09-28 22:32:55	ezio.melotti	set	nosy: + nnorwitz messages: + msg93239
2009-09-28 22:21:35	rhettinger	set	assignee: rhettinger -> gvanrossum messages: + msg93238 nosy: + gvanrossum
2009-09-28 21:41:01	pitrou	set	nosy: + pitrou messages: + msg93237
2009-09-28 21:39:08	rhettinger	set	messages: + msg93236
2009-09-28 21:02:23	rhettinger	set	messages: + msg93235
2009-09-28 20:45:34	twb	set	messages: + msg93232
2009-09-28 19:27:18	r.david.murray	set	nosy: + ezio.melotti messages: + msg93229
2009-09-28 19:08:20	rhettinger	set	assignee: rhettinger
2009-09-28 19:05:20	rhettinger	set	messages: + msg93227
2009-09-28 19:01:59	r.david.murray	set	priority: normal nosy: + r.david.murray messages: + msg93226 stage: test needed
2009-09-28 18:29:30	rhettinger	set	nosy: + rhettinger messages: + msg93223 versions: + Python 2.7, Python 3.2, - Python 2.6
2009-09-28 17:51:17	twb	set	nosy: + twb messages: + msg93220
2009-09-28 14:59:28	markon	set	nosy: + markon messages: + msg93212
2009-09-27 17:23:25	nickd	create