This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ncoghlan
Recipients Arfrever, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date 2014-09-23.15:32:35
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1411486356.01.0.767469246346.issue18814@psf.upfronthosting.co.za>
In-reply-to
Content
As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it by default on all OS interfaces (including the standard streams when we detect 'ascii' as the filesystem encoding in 3.5+).

This *needs* to be a case that folks can handle without needing to spend years learning about encodings and error handlers first. That means being able to tell them "use this documented function to remove the surrogates" rather than "use this magic incantation that you don't understand, and that other people may not be able to read".

I know more about Unicode encodings than the average programmer at this point, yet I still needed to be schooled by true experts in this thread to learn how to solve the problem properly.

Look at this as an opportunity to encapsulate that knowledge in executable form, as while the code is short, it is conceptually *very* dense.

If there's a dedicated function, then replacing the encode/decode dance with a faster pure C alternative also becomes a future possibility (with only a recipe, there's no opportunity to ever optimise it).

With the additional clarification, it is also clear to me that Antoine is correct that the encoding needs to be configurable and should default to the appropriate setting to remove the surrogates from OS provided data.

With that change:

    def convert_surrogates(data, encoding=None, errors='replace'):
        """Convert escaped surrogates by applying a different error handler

        If no encoding is given, defaults to sys.getfilesystemencoding()
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        if encoding is None:
            encoding = sys.getfilesystemencoding()
        return data.encode(encoding, 'surrogateescape').decode(encoding, errors)

Since it's primarily intended for cleaning OS provided data, then I agree os.convert_surrogates() could be a good choice. It would be appropriate to reference it from os.fsdecode() as a way to clean escaped data when the original binary data was no longer available to be decoded again with a different error handler.
History
Date User Action Args
2014-09-23 15:32:36ncoghlansetrecipients: + ncoghlan, lemburg, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka
2014-09-23 15:32:36ncoghlansetmessageid: <1411486356.01.0.767469246346.issue18814@psf.upfronthosting.co.za>
2014-09-23 15:32:35ncoghlanlinkissue18814 messages
2014-09-23 15:32:35ncoghlancreate