Message 152416 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kennyluck
Recipients	ezio.melotti, kennyluck
Date	2012-01-31.23:51:10
SpamBayes Score	3.1250613e-11
Marked as misclassified	No
Message-id	<1328053871.05.0.70184707738.issue13916@psf.upfronthosting.co.za>
In-reply-to

Content
Currently the "surrogatepass" handler always encodes the surrogates in UTF-8 and hence the behavior for, say, "\udc80".encode("latin-1", "surrogatepass").decode("latin-1") might be unexpected and I don't even know what would, say, "\udc80\udc80".encode("big5", "surrogatepass").decode("big5"), return. Regardless of the fact that the documentation says "surrogatepass" is specific to utf-8", the currently behavior is arguably not too harmful thanks to PyBytesObject's '\0' ending (so that ((p[0] & 0xf0) == 0xe0 \|\| (p[1] & 0xc0) == 0x80 \|\| (p[2] & 0xc0) == 0x80) in PyCodec_SurrogatePassErrors would not crash). However, I suggest we have the system either 1) raise early LookupError 2) raise the original Unicode(Decode\|Encoding)Exception as soon as PyCodec_SurrogatePassErrors is called. I prefer the former. Having this could shorten PyCodec_SurrogatePassErrors significantly in the patch I will shortly submit for issue #12892 as all the error conditions for utf-8, utf-16 and utf-32 are predicable* and almost all the conditionals could be removed. (The * statement is arguable if someone initializes interp->codec_search_path before _PyCodecRegistry_Init and the utf-16/32 encoders are overwritten. I don't think we need to worry about this too much though. Or am I wrong here?)

Currently the "surrogatepass" handler always encodes the surrogates in UTF-8 and hence the behavior for, say, "\udc80".encode("latin-1", "surrogatepass").decode("latin-1") might be unexpected and I don't even know what would, say, "\udc80\udc80".encode("big5", "surrogatepass").decode("big5"), return. Regardless of the fact that the documentation says "surrogatepass" is specific to utf-8", the currently behavior is arguably not too harmful thanks to PyBytesObject's '\0' ending (so that ((p[0] & 0xf0) == 0xe0 || (p[1] & 0xc0) == 0x80 || (p[2] & 0xc0) == 0x80) in PyCodec_SurrogatePassErrors would not crash).

However, I suggest we have the system either 1) raise early LookupError 2) raise the original Unicode(Decode|Encoding)Exception as soon as PyCodec_SurrogatePassErrors is called. I prefer the former.

Having this could shorten PyCodec_SurrogatePassErrors significantly in the patch I will shortly submit for issue #12892 as all the error conditions for utf-8, utf-16 and utf-32 are predicable* and almost all the conditionals could be removed. (The * statement is arguable if someone initializes interp->codec_search_path before _PyCodecRegistry_Init and the utf-16/32 encoders are overwritten. I don't think we need to worry about this too much though. Or am I wrong here?)

History
Date	User	Action	Args
2012-01-31 23:51:11	kennyluck	set	recipients: + kennyluck, ezio.melotti
2012-01-31 23:51:11	kennyluck	set	messageid: <1328053871.05.0.70184707738.issue13916@psf.upfronthosting.co.za>
2012-01-31 23:51:10	kennyluck	link	issue13916 messages
2012-01-31 23:51:10	kennyluck	create