Message 107074 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	dangra, ezio.melotti, lemburg, pitrou, sjmachin, vstinner
Date	2010-06-04.16:22:45
SpamBayes Score	0.00014426257
Marked as misclassified	No
Message-id	<1275668570.45.0.975004000598.issue8271@psf.upfronthosting.co.za>
In-reply-to

Content
I added a test for the 'ignore' error handler. I will commit the patch before the RC unless someone has something against it. To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629, so: 1) Invalid sequences are now handled as described in http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95); 2) 5- and 6-bits-long sequences are now invalid (no changes in behavior, I just removed the "deafult:" of the switch/case and marked them with '0' in the first table); 3) According to RFC 3629, codepoints in the surrogate range (U+D800-U+DFFF) should be considered invalid, but this would not be backward compatible, so I added code and tests but left them commented away; 4) I changed the error message "unexpected code byte" to "invalid start byte" and "invalid data" to "invalid continuation byte"; 5) I added an extensive set of tests in test_unicode; 6) I fixed test_codeccallbacks because it was failing after this change.

I added a test for the 'ignore' error handler. I will commit the patch before the RC unless someone has something against it.

To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629, so:
1) Invalid sequences are now handled as described in http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95);
2) 5- and 6-bits-long sequences are now invalid (no changes in behavior, I just removed the "deafult:" of the switch/case and marked them with '0' in the first table);
3) According to RFC 3629, codepoints in the surrogate range (U+D800-U+DFFF) should be considered invalid, but this would not be backward compatible, so I added code and tests but left them commented away;
4) I changed the error message "unexpected code byte" to "invalid start byte" and "invalid data" to "invalid continuation byte";
5) I added an extensive set of tests in test_unicode;
6) I fixed test_codeccallbacks because it was failing after this change.

History
Date	User	Action	Args
2010-06-04 16:22:50	ezio.melotti	set	recipients: + ezio.melotti, lemburg, sjmachin, pitrou, vstinner, dangra
2010-06-04 16:22:50	ezio.melotti	set	messageid: <1275668570.45.0.975004000598.issue8271@psf.upfronthosting.co.za>
2010-06-04 16:22:48	ezio.melotti	link	issue8271 messages
2010-06-04 16:22:48	ezio.melotti	create