Message102320
This new patch (v3) should be ok.
I added a few more tests and found another corner case:
'\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch.
I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet).
Finally, I changed the error messages to make them clearer:
unexpected code byte -> invalid start byte;
invalid data -> invalid continuation byte.
(I can revert this if the old messages are better or if it is better to fix this with a separate commit.)
The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars. |
|
Date |
User |
Action |
Args |
2010-04-04 05:49:20 | ezio.melotti | set | recipients:
+ ezio.melotti, lemburg, sjmachin, vstinner, dangra |
2010-04-04 05:49:19 | ezio.melotti | set | messageid: <1270360159.99.0.657484109192.issue8271@psf.upfronthosting.co.za> |
2010-04-04 05:49:17 | ezio.melotti | link | issue8271 messages |
2010-04-04 05:49:16 | ezio.melotti | create | |
|