This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients tchrist
Date 2011-08-11.22:37:33
SpamBayes Score 1.5448975e-10
Marked as misclassified No
Message-id <1313102254.23.0.17412568785.issue12737@psf.upfronthosting.co.za>
In-reply-to
Content
Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest.  However, this is not true.  It is not using either of the two word detection algorithms that Unicode provides.  One allows you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number, or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and the other uses the more sophisticated word-break provided by the Word_Break properties such as Word_Break=MidNumLet

Python is using neither of these, so gets the wrong answer.

titlecase of déme un café should be Déme Un Café not DéMe Un Café
titlecase of i̇stanbul should be İstanbul not İStanbul
titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο

Because those are in NFD form, you get different answers than if they are in NFC.  That is not right. You should get the same answer. The bug is you aren't using the right definition for \w, and so get screwed up.  This is likely related to issue 12731.

In the enclosed tester file, which fails 4 out of its 6 tests, there is also a bug shown with this failed result:

  titlecase of 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 should be 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 not 𐐼𐐯𐑅𐐨𐑉𐐯𐐻

That one is related to issue 12730. 

See the attached tester, which was run under Python 3.2.  As far as I can tell, these bugs exist in all python versions.
History
Date User Action Args
2011-08-11 22:37:34tchristsetrecipients: + tchrist
2011-08-11 22:37:34tchristsetmessageid: <1313102254.23.0.17412568785.issue12737@psf.upfronthosting.co.za>
2011-08-11 22:37:33tchristlinkissue12737 messages
2011-08-11 22:37:33tchristcreate