Author kennyluck
Recipients ezio.melotti, kennyluck
Date 2012-01-31.23:51:10
SpamBayes Score 3.12506e-11
Marked as misclassified No
Message-id <>
Currently the "surrogatepass" handler always encodes the surrogates in UTF-8 and hence the behavior for, say, "\udc80".encode("latin-1", "surrogatepass").decode("latin-1") might be unexpected and I don't even know what would, say, "\udc80\udc80".encode("big5", "surrogatepass").decode("big5"), return. Regardless of the fact that the documentation says "surrogatepass" is specific to utf-8", the currently behavior is arguably not too harmful thanks to PyBytesObject's '\0' ending (so that ((p[0] & 0xf0) == 0xe0 || (p[1] & 0xc0) == 0x80 || (p[2] & 0xc0) == 0x80) in PyCodec_SurrogatePassErrors would not crash).

However, I suggest we have the system either 1) raise early LookupError 2) raise the original Unicode(Decode|Encoding)Exception as soon as PyCodec_SurrogatePassErrors is called. I prefer the former.

Having this could shorten PyCodec_SurrogatePassErrors significantly in the patch I will shortly submit for issue #12892 as all the error conditions for utf-8, utf-16 and utf-32 are predicable* and almost all the conditionals could be removed. (The * statement is arguable if someone initializes interp->codec_search_path before _PyCodecRegistry_Init and the utf-16/32 encoders are overwritten. I don't think we need to worry about this too much though. Or am I wrong here?)
Date User Action Args
2012-01-31 23:51:11kennylucksetrecipients: + kennyluck, ezio.melotti
2012-01-31 23:51:11kennylucksetmessageid: <>
2012-01-31 23:51:10kennylucklinkissue13916 messages
2012-01-31 23:51:10kennyluckcreate