Message152420
> The followings are on my TODO list, although this patch doesn't depend
> on any of these and can be reviewed and landed separately:
> * make the surrogatepass error handler work for utf-16 and utf-32. (I
> should be able to finish this by today)
Unfortunately this took longer than I thought but here comes the patch.
>> * fix an error in the error handler for utf-16-le. (In, Python3.2
>> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00"
>> instead of "A" for some reason)
>
> This should probably be done on a separate patch that will be applied
> to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will
> go in 3.3 only. (Note that lot of Unicode-related code changed between
> 3.2 and 3.3.)
This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3.
You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue #13913 for this.
Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead...
In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8.
>> Should we really reject lone surrogates for UTF-7?
>
> No, I meant only UTF-8/16/32; UTF-7 is fine as is.
Good to know. |
|
Date |
User |
Action |
Args |
2012-02-01 00:36:12 | kennyluck | set | recipients:
+ kennyluck, lemburg, gvanrossum, loewis, vstinner, ezio.melotti, tchrist |
2012-02-01 00:36:11 | kennyluck | set | messageid: <1328056571.43.0.998017593435.issue12892@psf.upfronthosting.co.za> |
2012-02-01 00:36:10 | kennyluck | link | issue12892 messages |
2012-02-01 00:36:10 | kennyluck | create | |
|