Message 102320 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	dangra, ezio.melotti, lemburg, sjmachin, vstinner
Date	2010-04-04.05:49:13
SpamBayes Score	1.3444821e-08
Marked as misclassified	No
Message-id	<1270360159.99.0.657484109192.issue8271@psf.upfronthosting.co.za>
In-reply-to

Content
This new patch (v3) should be ok. I added a few more tests and found another corner case: '\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch. I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet). Finally, I changed the error messages to make them clearer: unexpected code byte -> invalid start byte; invalid data -> invalid continuation byte. (I can revert this if the old messages are better or if it is better to fix this with a separate commit.) The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars.

This new patch (v3) should be ok. 
I added a few more tests and found another corner case:
'\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch.

I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet).

Finally, I changed the error messages to make them clearer:
unexpected code byte -> invalid start byte;
invalid data -> invalid continuation byte.
(I can revert this if the old messages are better or if it is better to fix this with a separate commit.)

The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars.

History
Date	User	Action	Args
2010-04-04 05:49:20	ezio.melotti	set	recipients: + ezio.melotti, lemburg, sjmachin, vstinner, dangra
2010-04-04 05:49:19	ezio.melotti	set	messageid: <1270360159.99.0.657484109192.issue8271@psf.upfronthosting.co.za>
2010-04-04 05:49:17	ezio.melotti	link	issue8271 messages
2010-04-04 05:49:16	ezio.melotti	create