Message 227362 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date	2014-09-23.14:50:12
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<542188A1.80307@egenix.com>
In-reply-to	<1411470724.13.0.309067671367.issue18814@psf.upfronthosting.co.za>

Content
On 23.09.2014 13:12, Nick Coghlan wrote: > > Nick Coghlan added the comment: > > Draft docstring for that version > > def convert_surrogates(data, errors='replace'): > """Convert escaped surrogates by applying a different error handler > > Uses the "replace" error handler by default, but any input > error handler may be specified. > """ > return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) Nick, the doc string is not correct. It is not working on escaped surrogates. Instead it is working on lone surrogates that were used to encode undecodable bytes from some input data. The longer story goes like this: The "surrogateescape" error handler in the .decode() call that lead up to the data you want this function to take as input, will convert undecodable data to lone low surrogates. The function then reverts these bytes back into UTF-8 (which may well not be the original encoding, as Antoine has already pointed out, but that's not really important for the use case), recreating the unencodable bytes and then decodes the result again using the UTF-8 codec using a new error handler. So in summary, the function is supposed to retroactively apply a different error handler to the input data, undoing the effects of the "surrogateescapes" error handler. The name still doesn't match this functionality. BTW: There's a catch in the approach. The encoding used to decode the original data may well be 'ascii'. Now, if the original input data was in fact UTF-8, the input decoding would have mapped the UTF-8 code points to lone surrogates. The above function would then turn these back into UTF-8, redecode and get a completely different string back (since the error handlers would not trigger). I'm not sure whether adding such a small function with so many unclear implications is a good idea. Either it should be made more specific, e.g. be reserved for use on data from input streams with known encoding, or be put into the documentation as example for people to use and adapt as necessary.

On 23.09.2014 13:12, Nick Coghlan wrote:
> 
> Nick Coghlan added the comment:
> 
> Draft docstring for that version
> 
>     def convert_surrogates(data, errors='replace'):
>         """Convert escaped surrogates by applying a different error handler
> 
>         Uses the "replace" error handler by default, but any input
>         error handler may be specified.
>         """
>         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

Nick, the doc string is not correct. It is not working on escaped
surrogates. Instead it is working on lone surrogates that were used
to encode undecodable bytes from some input data.

The longer story goes like this:

The "surrogateescape" error handler in the .decode() call that lead up
to the data you want this function to take as input, will convert
undecodable data to lone low surrogates.

The function then reverts these bytes back into UTF-8 (which may well
not be the original encoding, as Antoine has already pointed out, but
that's not really important for the use case), recreating the
unencodable bytes and then decodes the result again using the UTF-8
codec using a new error handler.

So in summary, the function is supposed to retroactively apply
a different error handler to the input data, undoing the effects
of the "surrogateescapes" error handler.

The name still doesn't match this functionality.

BTW: There's a catch in the approach. The encoding used to decode
the original data may well be 'ascii'. Now, if the original input
data was in fact UTF-8, the input decoding would have mapped the
UTF-8 code points to lone surrogates. The above function would then
turn these back into UTF-8, redecode and get a completely different
string back (since the error handlers would not trigger).

I'm not sure whether adding such a small function with so many
unclear implications is a good idea. Either it should be
made more specific, e.g. be reserved for use on data from input
streams with known encoding, or be put into the documentation as
example for people to use and adapt as necessary.

History
Date	User	Action	Args
2014-09-23 14:50:13	lemburg	set	recipients: + lemburg, ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka
2014-09-23 14:50:13	lemburg	link	issue18814 messages
2014-09-23 14:50:12	lemburg	create