This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients belopolsky, ezio.melotti, georg.brandl, lemburg, moese, phr, tchrist, vstinner
Date 2011-08-12.02:41:15
SpamBayes Score 1.650485e-09
Marked as misclassified No
Message-id <1313116876.94.0.050147310014.issue2857@psf.upfronthosting.co.za>
In-reply-to
Content
Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:

  http://unicode.org/reports/tr26/

CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please.

Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8.
History
Date User Action Args
2011-08-12 02:41:17tchristsetrecipients: + tchrist, lemburg, georg.brandl, phr, belopolsky, moese, vstinner, ezio.melotti
2011-08-12 02:41:16tchristsetmessageid: <1313116876.94.0.050147310014.issue2857@psf.upfronthosting.co.za>
2011-08-12 02:41:16tchristlinkissue2857 messages
2011-08-12 02:41:16tchristcreate