Author kennyluck
Recipients ezio.melotti, gvanrossum, kennyluck, lemburg, loewis, tchrist, vstinner
Date 2012-02-01.00:36:07
SpamBayes Score 7.2557e-11
Marked as misclassified No
Message-id <1328056571.43.0.998017593435.issue12892@psf.upfronthosting.co.za>
In-reply-to
Content
> The followings are on my TODO list, although this patch doesn't depend
> on any of these and can be reviewed and landed separately:
>  * make the surrogatepass error handler work for utf-16 and utf-32. (I
>    should be able to finish this by today)

Unfortunately this took longer than I thought but here comes the patch.

>>  * fix an error in the error handler for utf-16-le. (In, Python3.2 
>> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" 
>> instead of "A" for some reason)
>
> This should probably be done on a separate patch that will be applied
> to 3.2/3.3 (assuming that it can go to 3.2).  Rejecting surrogates will
> go in 3.3 only.  (Note that lot of Unicode-related code changed between
> 3.2 and 3.3.)

This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3.

You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue #13913 for this.

Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead...

In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8.

>> Should we really reject lone surrogates for UTF-7?
>
> No, I meant only UTF-8/16/32; UTF-7 is fine as is.

Good to know.
History
Date User Action Args
2012-02-01 00:36:12kennylucksetrecipients: + kennyluck, lemburg, gvanrossum, loewis, vstinner, ezio.melotti, tchrist
2012-02-01 00:36:11kennylucksetmessageid: <1328056571.43.0.998017593435.issue12892@psf.upfronthosting.co.za>
2012-02-01 00:36:10kennylucklinkissue12892 messages
2012-02-01 00:36:10kennyluckcreate