Message 141949 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	belopolsky, ezio.melotti, georg.brandl, lemburg, moese, phr, tchrist, vstinner
Date	2011-08-12.10:26:32
SpamBayes Score	6.054762e-11
Marked as misclassified	No
Message-id	<4E44FFD0.2020303@egenix.com>
In-reply-to	<1313116876.94.0.050147310014.issue2857@psf.upfronthosting.co.za>

Content
Tom Christiansen wrote: > > Tom Christiansen <tchrist@perl.com> added the comment: > > Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at: > > http://unicode.org/reports/tr26/ > > CESU-8 is not a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please. > > Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8. CESU-8 is a different encoding than the one we are talking about. The only difference between UTF-8 and the modified one is the different encoding for the U+0000 code point to have the output not contain any NUL bytes.

Tom Christiansen wrote:
> 
> Tom Christiansen <tchrist@perl.com> added the comment:
> 
> Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:
> 
>   http://unicode.org/reports/tr26/
> 
> CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please.
> 
> Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8.

CESU-8 is a different encoding than the one we are talking about.

The only difference between UTF-8 and the modified one is the different
encoding for the U+0000 code point to have the output not contain
any NUL bytes.

History
Date	User	Action	Args
2011-08-12 10:26:33	lemburg	set	recipients: + lemburg, georg.brandl, phr, belopolsky, moese, vstinner, ezio.melotti, tchrist
2011-08-12 10:26:32	lemburg	link	issue2857 messages
2011-08-12 10:26:32	lemburg	create