Message 102239 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	dangra, ezio.melotti, lemburg, sjmachin
Date	2010-04-03.11:41:33
SpamBayes Score	3.091971e-13
Marked as misclassified	No
Message-id	<4BB7296C.3050601@egenix.com>
In-reply-to	<1270247238.98.0.791005996157.issue8271@psf.upfronthosting.co.za>

Content
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > Here's a new patch. Should be complete but I want to test it some more before committing. > I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code. Ok. > I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev. Right, but that idea is controversial. In Python we need to be able to put those surrogate code points into source code (encoded as UTF-8) as well as pickle and marshal dumps of Unicode object dumps, so we can't consider them invalid UTF-8.

Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> Here's a new patch. Should be complete but I want to test it some more before committing.
> I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). This lead to some simplification in the code.

Ok.

> I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. I included tests and fix but I left them commented out because this is out of the scope of this patch, and it probably need a discussion on python-dev.

Right, but that idea is controversial. In Python we need to be able to
put those surrogate code points into source code (encoded as UTF-8) as
well as pickle and marshal dumps of Unicode object dumps, so we can't
consider them invalid UTF-8.

History
Date	User	Action	Args
2010-04-03 11:41:36	lemburg	set	recipients: + lemburg, sjmachin, ezio.melotti, dangra
2010-04-03 11:41:34	lemburg	link	issue8271 messages
2010-04-03 11:41:33	lemburg	create