Message 141929 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	tchrist
Date	2011-08-11.22:37:33
SpamBayes Score	1.5448975e-10
Marked as misclassified	No
Message-id	<1313102254.23.0.17412568785.issue12737@psf.upfronthosting.co.za>
In-reply-to

Content
Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest. However, this is not true. It is not using either of the two word detection algorithms that Unicode provides. One allows you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number, or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and the other uses the more sophisticated word-break provided by the Word_Break properties such as Word_Break=MidNumLet Python is using neither of these, so gets the wrong answer. titlecase of déme un café should be Déme Un Café not DéMe Un Café titlecase of i̇stanbul should be İstanbul not İStanbul titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο Because those are in NFD form, you get different answers than if they are in NFC. That is not right. You should get the same answer. The bug is you aren't using the right definition for \w, and so get screwed up. This is likely related to issue 12731. In the enclosed tester file, which fails 4 out of its 6 tests, there is also a bug shown with this failed result: titlecase of 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 should be 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 not 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 That one is related to issue 12730. See the attached tester, which was run under Python 3.2. As far as I can tell, these bugs exist in all python versions.

Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest.  However, this is not true.  It is not using either of the two word detection algorithms that Unicode provides.  One allows you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number, or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and the other uses the more sophisticated word-break provided by the Word_Break properties such as Word_Break=MidNumLet

Python is using neither of these, so gets the wrong answer.

titlecase of déme un café should be Déme Un Café not DéMe Un Café
titlecase of i̇stanbul should be İstanbul not İStanbul
titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο

Because those are in NFD form, you get different answers than if they are in NFC.  That is not right. You should get the same answer. The bug is you aren't using the right definition for \w, and so get screwed up.  This is likely related to issue 12731.

In the enclosed tester file, which fails 4 out of its 6 tests, there is also a bug shown with this failed result:

  titlecase of 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 should be 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 not 𐐼𐐯𐑅𐐨𐑉𐐯𐐻

That one is related to issue 12730. 

See the attached tester, which was run under Python 3.2.  As far as I can tell, these bugs exist in all python versions.

History
Date	User	Action	Args
2011-08-11 22:37:34	tchrist	set	recipients: + tchrist
2011-08-11 22:37:34	tchrist	set	messageid: <1313102254.23.0.17412568785.issue12737@psf.upfronthosting.co.za>
2011-08-11 22:37:33	tchrist	link	issue12737 messages
2011-08-11 22:37:33	tchrist	create