This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients Arfrever, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date 2014-09-23.14:50:12
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <542188A1.80307@egenix.com>
In-reply-to <1411470724.13.0.309067671367.issue18814@psf.upfronthosting.co.za>
Content
On 23.09.2014 13:12, Nick Coghlan wrote:
> 
> Nick Coghlan added the comment:
> 
> Draft docstring for that version
> 
>     def convert_surrogates(data, errors='replace'):
>         """Convert escaped surrogates by applying a different error handler
> 
>         Uses the "replace" error handler by default, but any input
>         error handler may be specified.
>         """
>         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

Nick, the doc string is not correct. It is not working on escaped
surrogates. Instead it is working on lone surrogates that were used
to encode undecodable bytes from some input data.

The longer story goes like this:

The "surrogateescape" error handler in the .decode() call that lead up
to the data you want this function to take as input, will convert
undecodable data to lone low surrogates.

The function then reverts these bytes back into UTF-8 (which may well
not be the original encoding, as Antoine has already pointed out, but
that's not really important for the use case), recreating the
unencodable bytes and then decodes the result again using the UTF-8
codec using a new error handler.

So in summary, the function is supposed to retroactively apply
a different error handler to the input data, undoing the effects
of the "surrogateescapes" error handler.

The name still doesn't match this functionality.

BTW: There's a catch in the approach. The encoding used to decode
the original data may well be 'ascii'. Now, if the original input
data was in fact UTF-8, the input decoding would have mapped the
UTF-8 code points to lone surrogates. The above function would then
turn these back into UTF-8, redecode and get a completely different
string back (since the error handlers would not trigger).

I'm not sure whether adding such a small function with so many
unclear implications is a good idea. Either it should be
made more specific, e.g. be reserved for use on data from input
streams with known encoding, or be put into the documentation as
example for people to use and adapt as necessary.
History
Date User Action Args
2014-09-23 14:50:13lemburgsetrecipients: + lemburg, ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka
2014-09-23 14:50:13lemburglinkissue18814 messages
2014-09-23 14:50:12lemburgcreate