Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utilities to "clean" surrogate code points from strings #63014

Open
ncoghlan opened this issue Aug 23, 2013 · 55 comments
Open

Add utilities to "clean" surrogate code points from strings #63014

ncoghlan opened this issue Aug 23, 2013 · 55 comments
Labels
3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@ncoghlan
Copy link
Contributor

BPO 18814
Nosy @malemburg, @ncoghlan, @pitrou, @vstinner, @ezio-melotti, @stevendaprano, @bitdancer, @yaseppochi, @vadmium, @serhiy-storchaka
Dependencies
  • bpo-22286: Allow backslashreplace error handler to be used on input
  • bpo-23676: Add support of UnicodeTranslateError in standard error handlers
  • Files
  • convert_surrogates.py
  • codecs_convert_escapes.patch
  • codecs_convert_escapes_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2013-08-23.04:02:31.968>
    labels = ['3.8', 'type-feature', 'library']
    title = 'Add utilities to "clean" surrogate code points from strings'
    updated_at = <Date 2018-03-30.06:52:39.866>
    user = 'https://github.com/ncoghlan'

    bugs.python.org fields:

    activity = <Date 2018-03-30.06:52:39.866>
    actor = 'ncoghlan'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2013-08-23.04:02:31.968>
    creator = 'ncoghlan'
    dependencies = ['22286', '23676']
    files = ['36700', '38506', '38520']
    hgrepos = []
    issue_num = 18814
    keywords = ['patch']
    message_count = 55.0
    messages = ['195937', '195959', '195960', '196047', '196049', '196054', '196063', '225791', '225793', '225800', '225801', '225806', '225809', '225811', '225817', '225818', '225822', '225858', '225869', '225875', '225889', '225972', '227334', '227339', '227340', '227341', '227342', '227343', '227344', '227345', '227346', '227362', '227364', '227365', '227367', '227368', '227369', '238182', '238193', '238277', '238285', '238286', '238309', '242799', '242810', '242939', '251682', '251693', '251694', '251701', '251708', '251740', '251836', '251854', '314682']
    nosy_count = 11.0
    nosy_names = ['lemburg', 'ncoghlan', 'pitrou', 'vstinner', 'ezio.melotti', 'Arfrever', 'steven.daprano', 'r.david.murray', 'sjt', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue18814'
    versions = ['Python 3.8']

    @ncoghlan
    Copy link
    Contributor Author

    Prompted by bpo-18713 and http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/, here are some possible utilities we could add to the codecs module to help deal with/debug issues related to surrogate escaped strings:

        def has_escaped_bytes(s):
            """Returns true if string contains surrogate escaped bytes"""
            ...
    
        def replace_escaped_bytes(s):
            """Replaces each surrogate escaped byte with a valid code point"""
            ...
    
        def decode_escaped_bytes(s, nominal_encoding, actual_encoding):
            """Reinterprets incorrectly decoded text using a new encoding"""
            return s.encode(nominal_encoding, 'surrogateescape').decode(actual_encoding)

    @ncoghlan ncoghlan added the type-feature A feature request or enhancement label Aug 23, 2013
    @bitdancer
    Copy link
    Member

    The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand.

    What does replace do? Replace it with the unknown character code point? Email also needs that.

    @serhiy-storchaka
    Copy link
    Member

    These functions are trivial.

        def has_escaped_bytes(s):
            try:
                s.encode()
                return False
            except UnicodeEncodeError:
                return True
    
        def replace_escaped_bytes(s):
            return s.encode('utf-8', 'replace').decode()

    @pitrou
    Copy link
    Member

    pitrou commented Aug 23, 2013

    Can you sum the use cases for these?
    (don't want to read a blog post, sorry :-))

    @vstinner
    Copy link
    Member

    "The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand."

    In which function you need to check this? What do you do if there are surrogate characters?

    @bitdancer
    Copy link
    Member

    The email package uses surrogateescape to store unknown bytes in unicode strings, just as with the handle-bad-data-from-os API surrogateescape was introduced for. (For the same reason: the source data may have improperly encoded bytes that we must nevertheless preserve). What action certain parts of the code takes differs depending on whether or not there are such encoded bytes in the string. So the code needs to check and branch based on that.

    The utility function is email.utils._has_surrogates. You can grep the email source to see where it is used, if you like.

    @ncoghlan
    Copy link
    Contributor Author

    The use case is to take data from a surrogate escaped interface and either
    filter it out entirely or convert it to a valid Unicode string at the point
    of *input*, before letting it make its way into the rest of the
    application. For example, this approach permits the Python equivalent of
    the Linux console behaviour when it attempts to display the improperly
    encoded file name.

    This approach is likely to be easier to implement in cross-platform code
    than dropping down to the bytes interfaces on POSIX systems and using the
    Unicode APIs on Windows.

    @ncoghlan
    Copy link
    Contributor Author

    Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal:

        # Constant in the string module (akin to string.ascii_letters et al)
        escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape')
    
        # Helper to ensure a string contains no escaped surrogates
        # This allows it to be safely encoded without surrogateescape
        _match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
        def clean(s, repl='\ufffd'):
            return _match_surrogates.sub(repl, s)
    
        # Helper to redecode a string that was decoded incorrectly
        # For example, WSGI strings are passed from the server to the
        # framework as latin-1 by default and may need to be redecoded
        def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'):
            return s.encode(old_encoding, old_errors).decode(encoding, errors)

    In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).

    @ncoghlan
    Copy link
    Contributor Author

    Note: I dropped the "has_escaped_bytes" idea, as "s != string.clean(s)" is functionally equivalent, albeit a bit more expensive if you don't actually want the cleaned string)

    @ezio-melotti
    Copy link
    Member

    I think similar functions should be added in the unicodedata module rather than the string module or as str methods. If I'm not mistaken this was already proposed in another issue.
    In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet.
    As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear. If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates). These can also be used to build oneliner to check if a string contains surrogates and/or to remove them.
    clean has a very generic name with no hints about surrogates, and its purpose is quite specific.
    I'm also not a big fan of redecode. The equivalent calls to encode/decode are not much longer and more explicit. Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).

    @ncoghlan
    Copy link
    Contributor Author

    The purpose of these changes it to provide tools specifically for working with surrogate escaped data, not for working with arbitrary lone Unicode surrogates.

    "escaped_surrogates" is not defined by the Unicode spec, it's defined by the behaviour of the surrogateescape error handler that lets us tunnel arbitrary bytes through str objects and reproduce them faithfully at the far end. On reflection, I think codecs would be a better home than string (as that's where the error handler is defined), but it doesn't belong in unicodedata.

    I'd be OK with changing the name of the clean function to "clean_escaped_surrogates".

    Needing redecode is not a bug: it's baked into the WSGI spec in PEP-3333. I would be OK with providing it in wsgiref rather than the codecs or string modules, but I think we should provide it somewhere.

    @serhiy-storchaka
    Copy link
    Member

    I agree with Ezio in all points. escaped_surrogates is inefficient for any purposes and incomplete. _match_surrogates can be created in more efficient way. clean() has too general and misleading name. redecode() looks just ridiculous, this cumbersome 4-argument function just combine two methods.

    @ncoghlan
    Copy link
    Contributor Author

    Guys, you're Python 3 unicode experts, already thoroughly familiar with how surrogateescape works. These features are not for you.

    The exact implementations don't matter. These need to exist, and they need to be documented with detailed explanations. People don't yet understand what the surrogateescape error handler is, and what it's for, or how to work with it.

    People consider surrogate escaping this weird mysterious thing. It's not - it's a really simple mapping of the 128 bytes with the high order bit set to a different part of the Unicode code space. These functions make it obvious.

    wsgiref.redecode is a natural consequence of the design of PEP-3333 - it makes sense to offer it there, as a hint that frameworks are likely to need it.

    We have a serious, serious, learnability problem around binary data handling in Python 3. It's nowhere near as bad as the learnability problem for text handling in Python 2, but it still needs more scaffolding to help people grow in understanding.

    @ezio-melotti
    Copy link
    Member

    That's why I think a function like redecode is a bad idea.
    With Python 2 I've seen lot of people blindingly trying .decode when .encode failed (and the other way around) whenever they were getting an UnicodeError (and the fact that decoding Unicode results in an encode error didn't help).
    I'm afraid that confused developers will try to (mis)use redecode as a workaround to attempt to fix something that shouldn't be broken in the first place, without actually understanding what the real problem is.
    There's really no excuse for developers not to know Unicode nowadays (Joel Spolsky said the same thing more than 10 years ago).
    IME people also tend to overestimate how difficult it is to understand Unicode. While it is true that you can go pretty deep down the rabbit hole, the basic concepts necessary for everyday use are pretty simple and straightforward.

    @ncoghlan
    Copy link
    Contributor Author

    The redecode thing is a distraction from my core concern here, so I've split that out to issue bpo-22264, a separate RFE for a "wsgiref.fix_encoding" function.

    For this issue, my main concern is the function to *clean* a string of escaped binary data, so it can be displayed easily, or otherwise purged of the escaped characters. Preserving the data by default is good, but you have to know a *lot* about how Python 3 works in order to be able figure out how to clean it out.

    For that, not knowing Unicode in general isn't the problem: it's not knowing PEP-383. If we forget the idea of exposing the constant with the escaped values (I agree that's not very useful), it suggests "codecs.clean_surrogate_escapes" as a possible name:

        # Helper to ensure a string contains no escaped surrogates
        # This allows it to be safely encoded without surrogateescape
        _extended_ascii = bytes(range(128, 256))
        _escaped_surrogates = _extended_ascii.decode('ascii',
                                        errors='surrogateescape')
        _match_escaped = re.compile('[{}]'.format(_escaped_surrogates))
        def clean_surrogate_escapes(s, repl='\ufffd'):
            return _match_escaped.sub(repl, s)

    A more efficient implementation in C would also be fine, this is just an easy way to define the exact semantics.

    (I also just noticed that unlike other error handlers, surrogateespace and surrogatepass do not have corresponding codecs.surrogateescape_errors and codecs.surrogatepass_errors functions)

    @serhiy-storchaka
    Copy link
    Member

    What problem is purposed to solve clean_surrogate_escapes()? Could you please provide user scenario or two?

    Possible alternative implementation is:

    def clean_surrogate_escapes(s):
        return s.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')

    It can be faster for some data (for mostly ASCII with rare surrogates it is superfast). For other data 'utf-16' can be better choice.

    @ncoghlan
    Copy link
    Contributor Author

    My main use case is for passing data to other applications that *don't* have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do it. (I had never even heard of surrogatepass until this evening)

    Situation:

    What I have: data decoded with surrogateescape
    What I want: that same data with all the surrogates gone, replaced with either the Unicode replacement character or an ASCII question mark (which I want will depend on the exact situation)

    Assume I am largely clueless about the codec system. I know nothing beyond the fact that Python 3 strings may have smuggled bytes in them and I want to get rid of them because they confuse the application I'm passing them to.

    The concrete example that got me thinking about this again was the task of writing filenames into a UTF-8 encoded email, and wanting to scrub the output from os.listdir before writing the list into the email (s/email/web page/ also works).

    For issue bpo-22016 I actually suggested doing this as *another* codec error handler ("surrogatereplace"), but Stephen Turnbull convinced me this original idea was better: it should just be a pure data transformation pass on the string, clearing the surrogates out, and leaving me with data that is identical to that I would have had if "surrogatereplace" had been used instead of "surrogateescape" in the first place.

    As "errors='replace'" already covers the "ASCII ?" replacement case, that means your proposed "redecode" based solution would cover the rest of my use case.

    @vstinner
    Copy link
    Member

    Your clean() function looses information. If a filename contains almost only undecodable characters, it will looks like ����.txt. It's not very useful. I would prefer to escape the byte. Mac OS X (HFS+ filesystem) uses for example %HH format: "\udc80" would be replaced with "%80" for example.

    This format is also used in URLs. For example, "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt" in Firefox (when listing a local directory), but Firefox uses the URL "file://.../a%E9b.txt" (hexadecimal in uppercase).

    In the Gnome file browser (Nautilus), "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt (invalid encoding)".

    @ncoghlan
    Copy link
    Contributor Author

    Ideally we'd have string modification support for all the translations we offer as codec error handlers:

    • Unicode replacement character ('replace' on input)
    • ASCII question mark ('replace' on output)
    • Dropping them entirely ('ignore')
    • XML character reference ('xmlcharrefreplace')
    • Python escape sequence ('backslashreplace')

    The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases.

    Now, some of these things *can* be done today using Serhiy's trick of encoding to UTF-8 and then decoding again:

    data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
    data.encode('utf-8', 'replace').decode('utf-8')
    data.encode('utf-8', 'ignore').decode('utf-8')
    

    However, these two don't work properly:

    data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
    data.encode('utf-8', 'backslashreplace').decode('utf-8')
    

    The reason those don't work is because they'll encode the *surrogate escaped bytes*, rather than the originals.

    Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.

    @bitdancer
    Copy link
    Member

    Right now having has_escaped_bytes isn't too important, since I've done nothing to profile and improve the performance of the new email code. But eventually I'll need it, because detecting the existence of escaped bytes is inside some of the inner loops in the header processing code. "s != s.clean_escaped_bytes()" (or whatever you call it) just smells wrong as a way to spell has_escaped_bytes.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 25, 2014

    data.encode('utf-8', 'replace').decode('utf-8')
    data.encode('utf-8', 'ignore').decode('utf-8')

    Why not the reverse:

    os.fsencode(data).decode('utf-8', 'replace')
    os.fsencode(data).decode('utf-8', 'ignore')

    Note that "backslashreplace" needs to be enhanced to work when decoding too.
    Note that "xmlcharrefreplace" doesn't make sense here: it encodes a *character* reference, but you're precisely trying to represent something which fails interpreting as a character.

    (AFAIK, XML can't represent non-text data, except in NDATA sequences)

    @ncoghlan
    Copy link
    Contributor Author

    Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case.

    Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286

    I also agree we can ignore xmlcharrefreplace here.

    So that leaves the basic pattern as:

    data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
    data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore')
    data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace')

    This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that.

    Possible function name and implementation:

        def convert_surrogateescape(data, errors='replace'):
            return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
        
    Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here)

    Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE.

    @ncoghlan
    Copy link
    Contributor Author

    Updated issue title to reflect current proposal.

    @ncoghlan ncoghlan changed the title Add tools for "cleaning" surrogate escaped strings Add codecs.convert_surrogateescape to "clean" surrogate escaped strings Sep 23, 2014
    @malemburg
    Copy link
    Member

    Don't like the function name :-)

    How about codecs.filter_non_utf8_data(), since that's closer
    to what the function is really doing and doesn't require
    knowledge about what "surrogateescape" is.

    @ncoghlan
    Copy link
    Contributor Author

    The error handler is called "surrogateescape". That means "convert_surrogateescape" is always only a single step away from thinking "I want to remove the smuggled bytes from a surrogateescape'd string", without needing to assume any knowledge on the part of the user other than the name of the error handler and the fact that it is used to smuggle arbitrary bytes through the Python 3 str type.

    Getting from "this string was decoded with the surrogateescape handler and may contain smuggled bytes" to "filter_non_utf8_data" as the relevant cleanup function is a much bigger leap that requires more assumed knowledge on the part of the user, and also one that confuses the conceptual purpose of the function (cleaning up the output of the surrogateescape error handler to ensure it is a pure Unicode string) with the internal details of the proposed approach to implementing that cleanup operation (encoding to UTF-8 with surrogateescape, and then decoding again with a different error handler).

    @ncoghlan
    Copy link
    Contributor Author

    The function definition again, this time with a draft docstring:

        def convert_surrogateescape(data, errors='replace'):
            """Convert escaped raw bytes by applying a different error handler
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    

    @ncoghlan
    Copy link
    Contributor Author

    Note I would also be OK with "convert_surrogates", as that's the term that appears in the relevant error message:

    >>> b'\xe9'.decode('ascii', 'surrogateescape').encode()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed

    @pitrou
    Copy link
    Member

    pitrou commented Sep 23, 2014

    Le 23/09/2014 12:57, Nick Coghlan a écrit :

    The function definition again, this time with a draft docstring:

    def convert_surrogateescape(data, errors='replace'):
        """Convert escaped raw bytes by applying a different error handler
    
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    

    'utf-8' is hardcoded?

    @malemburg
    Copy link
    Member

    On 23.09.2014 13:12, Nick Coghlan wrote:

    Nick Coghlan added the comment:

    Draft docstring for that version

    def convert_surrogates(data, errors='replace'):
        """Convert escaped surrogates by applying a different error handler
    
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    

    Nick, the doc string is not correct. It is not working on escaped
    surrogates. Instead it is working on lone surrogates that were used
    to encode undecodable bytes from some input data.

    The longer story goes like this:

    The "surrogateescape" error handler in the .decode() call that lead up
    to the data you want this function to take as input, will convert
    undecodable data to lone low surrogates.

    The function then reverts these bytes back into UTF-8 (which may well
    not be the original encoding, as Antoine has already pointed out, but
    that's not really important for the use case), recreating the
    unencodable bytes and then decodes the result again using the UTF-8
    codec using a new error handler.

    So in summary, the function is supposed to retroactively apply
    a different error handler to the input data, undoing the effects
    of the "surrogateescapes" error handler.

    The name still doesn't match this functionality.

    BTW: There's a catch in the approach. The encoding used to decode
    the original data may well be 'ascii'. Now, if the original input
    data was in fact UTF-8, the input decoding would have mapped the
    UTF-8 code points to lone surrogates. The above function would then
    turn these back into UTF-8, redecode and get a completely different
    string back (since the error handlers would not trigger).

    I'm not sure whether adding such a small function with so many
    unclear implications is a good idea. Either it should be
    made more specific, e.g. be reserved for use on data from input
    streams with known encoding, or be put into the documentation as
    example for people to use and adapt as necessary.

    @bitdancer
    Copy link
    Member

    And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8.

    I'm also not sure that it is worth adding this. If you know what you are doing the solution is obvious, and if you don't know what you are doing you shouldn't be using surrogateescape in the first place :)

    Now, if there were or there is intended to be a more efficient C level implementation, that answer might be different.

    @bitdancer
    Copy link
    Member

    Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio. So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module. os.convert_surrogates?

    @ncoghlan
    Copy link
    Contributor Author

    As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it by default on all OS interfaces (including the standard streams when we detect 'ascii' as the filesystem encoding in 3.5+).

    This *needs* to be a case that folks can handle without needing to spend years learning about encodings and error handlers first. That means being able to tell them "use this documented function to remove the surrogates" rather than "use this magic incantation that you don't understand, and that other people may not be able to read".

    I know more about Unicode encodings than the average programmer at this point, yet I still needed to be schooled by true experts in this thread to learn how to solve the problem properly.

    Look at this as an opportunity to encapsulate that knowledge in executable form, as while the code is short, it is conceptually *very* dense.

    If there's a dedicated function, then replacing the encode/decode dance with a faster pure C alternative also becomes a future possibility (with only a recipe, there's no opportunity to ever optimise it).

    With the additional clarification, it is also clear to me that Antoine is correct that the encoding needs to be configurable and should default to the appropriate setting to remove the surrogates from OS provided data.

    With that change:

        def convert_surrogates(data, encoding=None, errors='replace'):
            """Convert escaped surrogates by applying a different error handler
        If no encoding is given, defaults to sys.getfilesystemencoding()
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        if encoding is None:
            encoding = sys.getfilesystemencoding()
        return data.encode(encoding, 'surrogateescape').decode(encoding, errors)
    

    Since it's primarily intended for cleaning OS provided data, then I agree os.convert_surrogates() could be a good choice. It would be appropriate to reference it from os.fsdecode() as a way to clean escaped data when the original binary data was no longer available to be decoded again with a different error handler.

    @ncoghlan ncoghlan changed the title Add codecs.convert_surrogateescape to "clean" surrogate escaped strings Add a convert_surrogates function to "clean" surrogate escaped strings Sep 23, 2014
    @serhiy-storchaka
    Copy link
    Member

    Good catch Antoine!

    Here is a sample of more complicated implementation.

    @serhiy-storchaka serhiy-storchaka changed the title Add a convert_surrogates function to "clean" surrogate escaped strings Add codecs.convert_surrogateescape to "clean" surrogate escaped strings Sep 23, 2014
    @ncoghlan
    Copy link
    Contributor Author

    Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding.

    @serhiy-storchaka
    Copy link
    Member

    Proposed preliminary patch adds three functions in the codecs module:

    convert_surrogates(data, errors) -- handle lone surrogates with specified error handler.

    >>> codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace')
    'a€\\udca4'

    convert_surrogateescape(data, errors) -- handle surrogateescaped bytes with specified error handler

    >>> codecs.convert_surrogateescape('a\u20ac\udca4', 'backslashreplace')
    'a€\\xa4'

    convert_astrals(data, errors) -- handle astral (non-BMP) characters with specified error handler.

    >>> codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace')
    'a€\\U000e007f'

    Names are discussable.

    I think also about adding two functions or error handlers (that can used with convert_surrogates and convert_astrals) for composing astral characters from surrogate pairs and vice versa.

    @serhiy-storchaka serhiy-storchaka added the stdlib Python modules in the Lib dir label Mar 16, 2015
    @ncoghlan
    Copy link
    Contributor Author

    (Serhiy, did you miss uploading the new patch?)

    Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the case of handling lone surrogates and escaped surrogates, what about:

        rehandle_surrogatepass(data, errors="strict")
        rehandle_surrogateescape(data, errors="strict")

    That is, we know we have data that was decoded with either surrogatepass or surrogateespace (respectively) as the error handler, and we want to process the results of that with a different error handler.

    I believe those two would be enough to address the specific cases this issue was raised to cover, so it may make sense to file a separate issue to discuss the use cases for the custom astral handling.

    Since astrals aren't actually errors in the first place, that could become:

        handle_astrals(data, errors="strict")

    As in "pass every astral code point in this string through the named error handler".

    The astral -> surrogate pair and surrogate pair -> astral converters do sound potentially interesting, but as noted above, I think they may call for a separate issue that better explains the specific use cases.

    @serhiy-storchaka
    Copy link
    Member

    I uploaded the patch just before your comment Nick.

    Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other issue.

    I hesitate about the rehandle_surrogatepass name. This function handles surrogates than can be created not only with the "surrogatepass" handler, but also with different ways, e.g. with the "surrogateescape" handler, with chr(), handle_astral() or decompose_astrals(). Actually it checks that the string is valid Unicode (not containing surrogates) and handle errors if found with specified error handler.

    May be there is a time for wider discussion on Python-Dev. I especially want to hear opinions of Ezio and Martin.

    @ncoghlan
    Copy link
    Contributor Author

    I'd wondered about that with respect to rehandle_surrogatepass.

    The current implementation looks like it processes *all* surrogates (even valid surrogate pairs), so "handle_surrogates" might be a suitable name.

    If the intent is for it to be "handle_lone_surrogates", I'm not sure the current implementation achieves that, as a valid surrogate pair will match re.compile('[\ud800-\uefff]+').

    The rest looks OK to me, including the decompose_astrals() and compose_surrogate_pairs() functions. Regardless of any practical utility, the latter two seem useful for *educational* purposes when it comes to unicode, by making it clear how to switch between the single code point and dual code point representations of the astrals.

    @ncoghlan
    Copy link
    Contributor Author

    Oh, and yes, I agree a python-dev discussion would be a good idea.

    From my perspective, "rehandle_surrogateescape" is the key function for making it easier to check for malformed input data from operating system interfaces.

    The other items I don't personally have a use case for, but they seem potentially valuable in make some key Unicode concepts a bit more discoverable.

    @serhiy-storchaka
    Copy link
    Member

    Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings).

    @yaseppochi
    Copy link
    Mannequin

    yaseppochi mannequin commented May 9, 2015

    Please do not add the "rehandle" functions to codecs. They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation.

    I suggest a "validation" submodule of the unicodedata package, or perhaps a new "unicodeutils" package, for these functions, as well as those that just detect the surrogates, etc.

    Because they change the semantics of data they should be documented as potentially dangerous because they can't be inverted back to bytes without knowledge of the history of transformations they perform (and not even then in the case of the "replace" error handler). This matters in applications where the input bytes may have been digitally signed, for example.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented May 9, 2015

    surrogateescape and surrogateepass data *already* can't be inverted back to
    bytes reliably without knowing the original encoding - if you encode them
    as something else when they contain surrogates, you'll either get an
    exception (the default) or mojibake (if you use
    surrogateescape/surrogateepass as the output error handler). They only work
    as a transparent pass through if the input and output encodings match.

    I'd be fine with putting these data scrubbing functions somewhere other
    than in codecs, though (I'm not sure unicodedata is the right place, but a
    new module like "string.internals" might be, as these functions have more
    to do with Python's internal text representation than they do anything
    else. A module like the latter could also be a home for things like a
    chunking utility that splits a string up into substrings that use as little
    memory as possible for feeding into a StringIO instance before throwing the
    original away).

    I also don't think they're urgent - the introduction of /etc/locale.conf
    makes modern Linux far more consistent in getting locale settings right,
    and even older platforms tend to get the locale right for user processes.

    @ncoghlan
    Copy link
    Contributor Author

    I suggest we defer this one to 3.6 - I still think it's worth doing, but I don't think it's a major barrier to migration, and it would be good to get some real world experience with the new sys.stdin behaviour of defaulting to using surrogateescape in the POSIX locale in 3.5 before committing to a particular design for the surrogate cleaning API.

    I do like the idea of a "string.internals" submodule as a possible home for exposing the Python level API.

    @ncoghlan ncoghlan changed the title Add codecs.convert_surrogateescape to "clean" surrogate escaped strings Add utilities to "clean" surrogate code points from strings May 12, 2015
    @vadmium
    Copy link
    Member

    vadmium commented Sep 27, 2015

    [padding]

    I think my suggested colours for the bikeshed would be handle_surrogates() and handle_surrogateescape(). “Rehandle” seems awkward and too assuming to me. And I agree with Serhiy that surrogates are a Unicode thing, not just related to the “surrogatepass” handler.

    Adding them to “codecs” makes sense to me. The most important one, handle_surrogateescape() or equivalent, is closely related to the error handler of that module.

    Having handle_surrogateescape or equivalent would probably be useful for bpo-25184 (displaying an arbitrary file path in a UTF-8 HTML file).

    @vstinner
    Copy link
    Member

    Hum, I suggest to put these functions in a package on PyPI, or recipes on a
    website like stackoverfkow., and close the issue.

    I'm still not convinced that these functions are useful . Usually we take a
    function from an existing project used in applications to put it in the
    stdlib. Here the use case still looks artifical. For example which
    application requires to escape non-BMP character? How does it handle them
    currently?

    Threre are too many ways to handle surrogate characters. The common ways to
    show undecodable bytes are not supported by functions proposed by Serhiy.
    Example: %80 on Mac OS X. Gnome uses something else.

    It was said that one reason to add new functions is performance. I'm not
    convinced neither that such function is the bottleneck on any application.

    I prefer to wait until users experiment with their own implementation and
    see if a common function can be extracted from this to put it in the stdlib.

    @ncoghlan
    Copy link
    Contributor Author

    I think moving this forward mainly needs someone with the time and energy wrangle a python-ideas/dev discussion to get some additional feedback on the API design. As I see it, there are 2 main questions to be resolved:

    1. Where to expose these functions

    The default location would be the codecs module, as they're closely related to the error handlers in that module, and the main reasons for needing to clean data at all are handling dirty data produced by an interface that uses surrogatepass or surrogateescape when decoding (handle_surrogates, handle_surrogateescape), or encoding data for use in a context which doesn't correctly handle code points outside the basic multilingual plane (handle_astrals).

    If added to the codecs module, they could be documented in new sections on "Postprocessing decoded text" and "Preprocessing text for encoding".

    The main argument against that would be Stephen's one, which is that these aren't themselves encoding or decoding operations, but rather internal state manipulations on Python strings.

    1. The exact function set to be provided.

    The three potential data cleaning cases currently being considered:

    • process_surrogates: reprocessing all surrogates in the string, including lone surrogates and valid surrogate pairs. Such strings may be produced by using the "surrogatepass" handler when decoding, or by decomposing astral characters to surrogate pairs.
    • process_surrogateescape: reprocessing only lone surrogates in the U+DC80 to U+DCFF range, with other surrogate pairs or lone surrogates triggering UnicodeTranslateError. Such strings may be produced by using the "surrogateescape" error handler when decoding.
    • process_astrals: reprocessing all code points in the astral plane.

    These seem to cover the essentials to me, and I changed the proposed prefix to "process_*" based on the idea of documentating them as preprocessing and postprocessing steps for encoding and decoding.

    @ncoghlan
    Copy link
    Contributor Author

    As far as the rationale for adding the functions at all goes, my main interest is still in having somewhere in the codecs module documentation to *define the problem*, and to my mind that entails also offering a simple way to do the relevant pre-/post-processing.

    The nice aspect of building any related capabilities atop the standard error handlers is that it also means that third party modules can provide custom error handlers to support further escaping techniques, and those will also be available for use in decoding and encoding operations, rather than being specific to pre-/post-processing of the data.

    However, it's also the case that we're generally going to be talking about the combination of encoding misconfiguration *and* processing data that gets potentially corrupted by the misconfiguration *and* doing something with it that isn't already handled by a surrogateescape round-trip, which is why I suspect in practice most applications are going to be able to get away with ignoring the problem entirely (especially with C.UTF-8 support coming to Fedora 24, so the Fedora/RHEL/CentOS ecosystem will be joining the Debian/Ubuntu ecosystem in offering that by default)

    @bitdancer
    Copy link
    Member

    I also want "detect if there are any surrogates".

    @stevendaprano
    Copy link
    Member

    On Sun, Sep 27, 2015 at 04:17:45PM +0000, R. David Murray wrote:

    I also want "detect if there are any surrogates".

    I think that's useful enough it should be a str method. Here's a
    pure-Python implementation:

    def is_surrogate(s):
        return any(0xD800 <= ord(c) <= 0xDFFF for c in s)

    The actual Flexible String Representation implementation can be even
    more efficient. All-ASCII and all-Latin1 strings cannot possibly include
    surrogates, so is_surrogate will be a constant-time operation.

    (If we care about making this a constant-time operation for all
    strings, the constructor could set a flag on the string object. Since
    the constructor has to walk the string anyway, I don't think that will
    cost much time-wise.)

    @vstinner
    Copy link
    Member

    I also want "detect if there are any surrogates".

    Could you please open a separated issue for this function/method?

    I believe that it's very different than other proposed functions/methods.

    It was proposed before to add methods like "is_ascii()" but the request was rejected because other Python implementations don't implement Unicode using the PEP-393 and so the method would be less efficient than expected, depending on the implementation of Python.

    Well, I don't know if we should add methods relying on the PEP-393 or not. It's probably better to discuss such "political" question on python-dev.

    @bitdancer
    Copy link
    Member

    Done: bpo-25269.

    @ncoghlan
    Copy link
    Contributor Author

    With PEPs 538 and 540 implemented for 3.7, my thinking on this has evolved a bit.

    A recent discussion on python-ideas [1] also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data: https://ftfy.readthedocs.io/en/latest/

    That includes a lone surrogate fixer: https://ftfy.readthedocs.io/en/latest/#ftfy.fixes.fix_surrogates

    So a potential way to go here would be to a section on "Handling Improperly Decoded Text Data" to the codecs module documentation, and include ftfy as a See Also link in that new section.

    If folks think that would be a reasonable way to go, then I think the clearest way to handle it would be to close this issue as "later" (which still implies "maybe never", but not as strongly as "rejected" does), and open a new issue for the suggested new section in the docs.

    [1] https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

    @ncoghlan ncoghlan added the 3.8 only security fixes label Mar 30, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    Development

    No branches or pull requests

    9 participants