Message 152420 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kennyluck
Recipients	ezio.melotti, gvanrossum, kennyluck, lemburg, loewis, tchrist, vstinner
Date	2012-02-01.00:36:07
SpamBayes Score	7.255696e-11
Marked as misclassified	No
Message-id	<1328056571.43.0.998017593435.issue12892@psf.upfronthosting.co.za>
In-reply-to

Content
> The followings are on my TODO list, although this patch doesn't depend > on any of these and can be reviewed and landed separately: > * make the surrogatepass error handler work for utf-16 and utf-32. (I > should be able to finish this by today) Unfortunately this took longer than I thought but here comes the patch. >> * fix an error in the error handler for utf-16-le. (In, Python3.2 >> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" >> instead of "A" for some reason) > > This should probably be done on a separate patch that will be applied > to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will > go in 3.3 only. (Note that lot of Unicode-related code changed between > 3.2 and 3.3.) This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3. You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue #13913 for this. Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead... In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be\|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode\|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8. >> Should we really reject lone surrogates for UTF-7? > > No, I meant only UTF-8/16/32; UTF-7 is fine as is. Good to know.

> The followings are on my TODO list, although this patch doesn't depend
> on any of these and can be reviewed and landed separately:
>  * make the surrogatepass error handler work for utf-16 and utf-32. (I
>    should be able to finish this by today)

Unfortunately this took longer than I thought but here comes the patch.

>>  * fix an error in the error handler for utf-16-le. (In, Python3.2 
>> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" 
>> instead of "A" for some reason)
>
> This should probably be done on a separate patch that will be applied
> to 3.2/3.3 (assuming that it can go to 3.2).  Rejecting surrogates will
> go in 3.3 only.  (Note that lot of Unicode-related code changed between
> 3.2 and 3.3.)

This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3.

You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue #13913 for this.

Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead...

In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8.

>> Should we really reject lone surrogates for UTF-7?
>
> No, I meant only UTF-8/16/32; UTF-7 is fine as is.

Good to know.

History
Date	User	Action	Args
2012-02-01 00:36:12	kennyluck	set	recipients: + kennyluck, lemburg, gvanrossum, loewis, vstinner, ezio.melotti, tchrist
2012-02-01 00:36:11	kennyluck	set	messageid: <1328056571.43.0.998017593435.issue12892@psf.upfronthosting.co.za>
2012-02-01 00:36:10	kennyluck	link	issue12892 messages
2012-02-01 00:36:10	kennyluck	create