Add utilities to "clean" surrogate code points from strings #63014

ncoghlan · 2013-08-23T04:02:32Z

BPO	18814
Nosy	@malemburg, @ncoghlan, @pitrou, @vstinner, @ezio-melotti, @stevendaprano, @bitdancer, @yaseppochi, @vadmium, @serhiy-storchaka
Dependencies	bpo-22286: Allow backslashreplace error handler to be used on input bpo-23676: Add support of UnicodeTranslateError in standard error handlers
Files	convert_surrogates.py codecs_convert_escapes.patch codecs_convert_escapes_2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2013-08-23.04:02:31.968>
labels = ['3.8', 'type-feature', 'library']
title = 'Add utilities to "clean" surrogate code points from strings'
updated_at = <Date 2018-03-30.06:52:39.866>
user = 'https://github.com/ncoghlan'

bugs.python.org fields:

activity = <Date 2018-03-30.06:52:39.866>
actor = 'ncoghlan'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2013-08-23.04:02:31.968>
creator = 'ncoghlan'
dependencies = ['22286', '23676']
files = ['36700', '38506', '38520']
hgrepos = []
issue_num = 18814
keywords = ['patch']
message_count = 55.0
messages = ['195937', '195959', '195960', '196047', '196049', '196054', '196063', '225791', '225793', '225800', '225801', '225806', '225809', '225811', '225817', '225818', '225822', '225858', '225869', '225875', '225889', '225972', '227334', '227339', '227340', '227341', '227342', '227343', '227344', '227345', '227346', '227362', '227364', '227365', '227367', '227368', '227369', '238182', '238193', '238277', '238285', '238286', '238309', '242799', '242810', '242939', '251682', '251693', '251694', '251701', '251708', '251740', '251836', '251854', '314682']
nosy_count = 11.0
nosy_names = ['lemburg', 'ncoghlan', 'pitrou', 'vstinner', 'ezio.melotti', 'Arfrever', 'steven.daprano', 'r.david.murray', 'sjt', 'martin.panter', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue18814'
versions = ['Python 3.8']

ncoghlan · 2013-08-23T04:02:31Z

Prompted by bpo-18713 and http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/, here are some possible utilities we could add to the codecs module to help deal with/debug issues related to surrogate escaped strings:

    def has_escaped_bytes(s):
        """Returns true if string contains surrogate escaped bytes"""
        ...

    def replace_escaped_bytes(s):
        """Replaces each surrogate escaped byte with a valid code point"""
        ...

    def decode_escaped_bytes(s, nominal_encoding, actual_encoding):
        """Reinterprets incorrectly decoded text using a new encoding"""
        return s.encode(nominal_encoding, 'surrogateescape').decode(actual_encoding)

bitdancer · 2013-08-23T11:02:55Z

The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand.

What does replace do? Replace it with the unknown character code point? Email also needs that.

serhiy-storchaka · 2013-08-23T11:55:16Z

These functions are trivial.

    def has_escaped_bytes(s):
        try:
            s.encode()
            return False
        except UnicodeEncodeError:
            return True

    def replace_escaped_bytes(s):
        return s.encode('utf-8', 'replace').decode()

pitrou · 2013-08-23T22:02:00Z

Can you sum the use cases for these?
(don't want to read a blog post, sorry :-))

vstinner · 2013-08-23T22:06:40Z

"The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand."

In which function you need to check this? What do you do if there are surrogate characters?

bitdancer · 2013-08-23T23:28:46Z

The email package uses surrogateescape to store unknown bytes in unicode strings, just as with the handle-bad-data-from-os API surrogateescape was introduced for. (For the same reason: the source data may have improperly encoded bytes that we must nevertheless preserve). What action certain parts of the code takes differs depending on whether or not there are such encoded bytes in the string. So the code needs to check and branch based on that.

The utility function is email.utils._has_surrogates. You can grep the email source to see where it is used, if you like.

ncoghlan · 2013-08-24T00:27:00Z

The use case is to take data from a surrogate escaped interface and either
filter it out entirely or convert it to a valid Unicode string at the point
of *input*, before letting it make its way into the rest of the
application. For example, this approach permits the Python equivalent of
the Linux console behaviour when it attempts to display the improperly
encoded file name.

This approach is likely to be easier to implement in cross-platform code
than dropping down to the bytes interfaces on POSIX systems and using the
Unicode APIs on Windows.

ncoghlan · 2014-08-24T03:00:11Z

Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal:

    # Constant in the string module (akin to string.ascii_letters et al)
    escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape')

    # Helper to ensure a string contains no escaped surrogates
    # This allows it to be safely encoded without surrogateescape
    _match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
    def clean(s, repl='\ufffd'):
        return _match_surrogates.sub(repl, s)

    # Helper to redecode a string that was decoded incorrectly
    # For example, WSGI strings are passed from the server to the
    # framework as latin-1 by default and may need to be redecoded
    def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'):
        return s.encode(old_encoding, old_errors).decode(encoding, errors)

In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).

ncoghlan · 2014-08-24T04:44:13Z

Note: I dropped the "has_escaped_bytes" idea, as "s != string.clean(s)" is functionally equivalent, albeit a bit more expensive if you don't actually want the cleaned string)

ezio-melotti · 2014-08-24T07:58:21Z

I think similar functions should be added in the unicodedata module rather than the string module or as str methods. If I'm not mistaken this was already proposed in another issue.
In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet.
As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear. If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates). These can also be used to build oneliner to check if a string contains surrogates and/or to remove them.
clean has a very generic name with no hints about surrogates, and its purpose is quite specific.
I'm also not a big fan of redecode. The equivalent calls to encode/decode are not much longer and more explicit. Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).

ncoghlan · 2014-08-24T08:24:10Z

The purpose of these changes it to provide tools specifically for working with surrogate escaped data, not for working with arbitrary lone Unicode surrogates.

"escaped_surrogates" is not defined by the Unicode spec, it's defined by the behaviour of the surrogateescape error handler that lets us tunnel arbitrary bytes through str objects and reproduce them faithfully at the far end. On reflection, I think codecs would be a better home than string (as that's where the error handler is defined), but it doesn't belong in unicodedata.

I'd be OK with changing the name of the clean function to "clean_escaped_surrogates".

Needing redecode is not a bug: it's baked into the WSGI spec in PEP-3333. I would be OK with providing it in wsgiref rather than the codecs or string modules, but I think we should provide it somewhere.

serhiy-storchaka · 2014-08-24T09:11:24Z

I agree with Ezio in all points. escaped_surrogates is inefficient for any purposes and incomplete. _match_surrogates can be created in more efficient way. clean() has too general and misleading name. redecode() looks just ridiculous, this cumbersome 4-argument function just combine two methods.

ncoghlan · 2014-08-24T11:02:06Z

Guys, you're Python 3 unicode experts, already thoroughly familiar with how surrogateescape works. These features are not for you.

The exact implementations don't matter. These need to exist, and they need to be documented with detailed explanations. People don't yet understand what the surrogateescape error handler is, and what it's for, or how to work with it.

People consider surrogate escaping this weird mysterious thing. It's not - it's a really simple mapping of the 128 bytes with the high order bit set to a different part of the Unicode code space. These functions make it obvious.

wsgiref.redecode is a natural consequence of the design of PEP-3333 - it makes sense to offer it there, as a hint that frameworks are likely to need it.

We have a serious, serious, learnability problem around binary data handling in Python 3. It's nowhere near as bad as the learnability problem for text handling in Python 2, but it still needs more scaffolding to help people grow in understanding.

ezio-melotti · 2014-08-24T11:46:14Z

That's why I think a function like redecode is a bad idea.
With Python 2 I've seen lot of people blindingly trying .decode when .encode failed (and the other way around) whenever they were getting an UnicodeError (and the fact that decoding Unicode results in an encode error didn't help).
I'm afraid that confused developers will try to (mis)use redecode as a workaround to attempt to fix something that shouldn't be broken in the first place, without actually understanding what the real problem is.
There's really no excuse for developers not to know Unicode nowadays (Joel Spolsky said the same thing more than 10 years ago).
IME people also tend to overestimate how difficult it is to understand Unicode. While it is true that you can go pretty deep down the rabbit hole, the basic concepts necessary for everyday use are pretty simple and straightforward.

ncoghlan · 2014-08-24T13:00:36Z

The redecode thing is a distraction from my core concern here, so I've split that out to issue bpo-22264, a separate RFE for a "wsgiref.fix_encoding" function.

For this issue, my main concern is the function to *clean* a string of escaped binary data, so it can be displayed easily, or otherwise purged of the escaped characters. Preserving the data by default is good, but you have to know a *lot* about how Python 3 works in order to be able figure out how to clean it out.

For that, not knowing Unicode in general isn't the problem: it's not knowing PEP-383. If we forget the idea of exposing the constant with the escaped values (I agree that's not very useful), it suggests "codecs.clean_surrogate_escapes" as a possible name:

    # Helper to ensure a string contains no escaped surrogates
    # This allows it to be safely encoded without surrogateescape
    _extended_ascii = bytes(range(128, 256))
    _escaped_surrogates = _extended_ascii.decode('ascii',
                                    errors='surrogateescape')
    _match_escaped = re.compile('[{}]'.format(_escaped_surrogates))
    def clean_surrogate_escapes(s, repl='\ufffd'):
        return _match_escaped.sub(repl, s)

A more efficient implementation in C would also be fine, this is just an easy way to define the exact semantics.

(I also just noticed that unlike other error handlers, surrogateespace and surrogatepass do not have corresponding codecs.surrogateescape_errors and codecs.surrogatepass_errors functions)

serhiy-storchaka · 2014-08-24T13:49:12Z

What problem is purposed to solve clean_surrogate_escapes()? Could you please provide user scenario or two?

Possible alternative implementation is:

def clean_surrogate_escapes(s):
    return s.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')

It can be faster for some data (for mostly ASCII with rare surrogates it is superfast). For other data 'utf-16' can be better choice.

ncoghlan · 2014-08-24T14:16:53Z

My main use case is for passing data to other applications that *don't* have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do it. (I had never even heard of surrogatepass until this evening)

Situation:

What I have: data decoded with surrogateescape
What I want: that same data with all the surrogates gone, replaced with either the Unicode replacement character or an ASCII question mark (which I want will depend on the exact situation)

Assume I am largely clueless about the codec system. I know nothing beyond the fact that Python 3 strings may have smuggled bytes in them and I want to get rid of them because they confuse the application I'm passing them to.

The concrete example that got me thinking about this again was the task of writing filenames into a UTF-8 encoded email, and wanting to scrub the output from os.listdir before writing the list into the email (s/email/web page/ also works).

For issue bpo-22016 I actually suggested doing this as *another* codec error handler ("surrogatereplace"), but Stephen Turnbull convinced me this original idea was better: it should just be a pure data transformation pass on the string, clearing the surrogates out, and leaving me with data that is identical to that I would have had if "surrogatereplace" had been used instead of "surrogateescape" in the first place.

As "errors='replace'" already covers the "ASCII ?" replacement case, that means your proposed "redecode" based solution would cover the rest of my use case.

vstinner · 2014-08-25T01:16:13Z

Your clean() function looses information. If a filename contains almost only undecodable characters, it will looks like ��.txt. It's not very useful. I would prefer to escape the byte. Mac OS X (HFS+ filesystem) uses for example %HH format: "\udc80" would be replaced with "%80" for example.

This format is also used in URLs. For example, "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt" in Firefox (when listing a local directory), but Firefox uses the URL "file://.../a%E9b.txt" (hexadecimal in uppercase).

In the Gnome file browser (Nautilus), "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt (invalid encoding)".

ncoghlan · 2014-08-25T06:59:28Z

Ideally we'd have string modification support for all the translations we offer as codec error handlers:

Unicode replacement character ('replace' on input)
ASCII question mark ('replace' on output)
Dropping them entirely ('ignore')
XML character reference ('xmlcharrefreplace')
Python escape sequence ('backslashreplace')

The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases.

Now, some of these things *can* be done today using Serhiy's trick of encoding to UTF-8 and then decoding again:

data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
data.encode('utf-8', 'replace').decode('utf-8')
data.encode('utf-8', 'ignore').decode('utf-8')

However, these two don't work properly:

data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
data.encode('utf-8', 'backslashreplace').decode('utf-8')

The reason those don't work is because they'll encode the *surrogate escaped bytes*, rather than the originals.

Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.

bitdancer · 2014-08-25T14:50:42Z

Right now having has_escaped_bytes isn't too important, since I've done nothing to profile and improve the performance of the new email code. But eventually I'll need it, because detecting the existence of escaped bytes is inside some of the inner loops in the header processing code. "s != s.clean_escaped_bytes()" (or whatever you call it) just smells wrong as a way to spell has_escaped_bytes.

pitrou · 2014-08-25T18:55:02Z

data.encode('utf-8', 'replace').decode('utf-8')
data.encode('utf-8', 'ignore').decode('utf-8')

Why not the reverse:

os.fsencode(data).decode('utf-8', 'replace')
os.fsencode(data).decode('utf-8', 'ignore')

Note that "backslashreplace" needs to be enhanced to work when decoding too.
Note that "xmlcharrefreplace" doesn't make sense here: it encodes a *character* reference, but you're precisely trying to represent something which fails interpreting as a character.

(AFAIK, XML can't represent non-text data, except in NDATA sequences)

ncoghlan · 2014-08-27T11:00:40Z

Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case.

Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286

I also agree we can ignore xmlcharrefreplace here.

So that leaves the basic pattern as:

data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace')

This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that.

Possible function name and implementation:

    def convert_surrogateescape(data, errors='replace'):
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    
Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here)

Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE.

ncoghlan · 2014-09-23T09:59:44Z

Updated issue title to reflect current proposal.

malemburg · 2014-09-23T10:39:51Z

Don't like the function name :-)

How about codecs.filter_non_utf8_data(), since that's closer
to what the function is really doing and doesn't require
knowledge about what "surrogateescape" is.

ncoghlan · 2014-09-23T10:51:12Z

The error handler is called "surrogateescape". That means "convert_surrogateescape" is always only a single step away from thinking "I want to remove the smuggled bytes from a surrogateescape'd string", without needing to assume any knowledge on the part of the user other than the name of the error handler and the fact that it is used to smuggle arbitrary bytes through the Python 3 str type.

Getting from "this string was decoded with the surrogateescape handler and may contain smuggled bytes" to "filter_non_utf8_data" as the relevant cleanup function is a much bigger leap that requires more assumed knowledge on the part of the user, and also one that confuses the conceptual purpose of the function (cleaning up the output of the surrogateescape error handler to ensure it is a pure Unicode string) with the internal details of the proposed approach to implementing that cleanup operation (encoding to UTF-8 with surrogateescape, and then decoding again with a different error handler).

ncoghlan · 2014-09-23T10:57:55Z

The function definition again, this time with a draft docstring:

    def convert_surrogateescape(data, errors='replace'):
        """Convert escaped raw bytes by applying a different error handler

    Uses the "replace" error handler by default, but any input
    error handler may be specified.
    """
    return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

ncoghlan · 2014-09-23T11:00:44Z

Note I would also be OK with "convert_surrogates", as that's the term that appears in the relevant error message:

>>> b'\xe9'.decode('ascii', 'surrogateescape').encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed

pitrou · 2014-09-23T11:04:47Z

Le 23/09/2014 12:57, Nick Coghlan a écrit :

The function definition again, this time with a draft docstring:

def convert_surrogateescape(data, errors='replace'):
    """Convert escaped raw bytes by applying a different error handler

    Uses the "replace" error handler by default, but any input
    error handler may be specified.
    """
    return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

'utf-8' is hardcoded?

malemburg · 2014-09-23T14:50:13Z

On 23.09.2014 13:12, Nick Coghlan wrote:

Nick Coghlan added the comment:

Draft docstring for that version

def convert_surrogates(data, errors='replace'):
    """Convert escaped surrogates by applying a different error handler

    Uses the "replace" error handler by default, but any input
    error handler may be specified.
    """
    return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

Nick, the doc string is not correct. It is not working on escaped
surrogates. Instead it is working on lone surrogates that were used
to encode undecodable bytes from some input data.

The longer story goes like this:

The "surrogateescape" error handler in the .decode() call that lead up
to the data you want this function to take as input, will convert
undecodable data to lone low surrogates.

The function then reverts these bytes back into UTF-8 (which may well
not be the original encoding, as Antoine has already pointed out, but
that's not really important for the use case), recreating the
unencodable bytes and then decodes the result again using the UTF-8
codec using a new error handler.

So in summary, the function is supposed to retroactively apply
a different error handler to the input data, undoing the effects
of the "surrogateescapes" error handler.

The name still doesn't match this functionality.

BTW: There's a catch in the approach. The encoding used to decode
the original data may well be 'ascii'. Now, if the original input
data was in fact UTF-8, the input decoding would have mapped the
UTF-8 code points to lone surrogates. The above function would then
turn these back into UTF-8, redecode and get a completely different
string back (since the error handlers would not trigger).

I'm not sure whether adding such a small function with so many
unclear implications is a good idea. Either it should be
made more specific, e.g. be reserved for use on data from input
streams with known encoding, or be put into the documentation as
example for people to use and adapt as necessary.

bitdancer · 2014-09-23T15:06:57Z

And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8.

I'm also not sure that it is worth adding this. If you know what you are doing the solution is obvious, and if you don't know what you are doing you shouldn't be using surrogateescape in the first place :)

Now, if there were or there is intended to be a more efficient C level implementation, that answer might be different.

bitdancer · 2014-09-23T15:15:19Z

Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio. So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module. os.convert_surrogates?

ncoghlan · 2014-09-23T15:32:35Z

As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it by default on all OS interfaces (including the standard streams when we detect 'ascii' as the filesystem encoding in 3.5+).

This *needs* to be a case that folks can handle without needing to spend years learning about encodings and error handlers first. That means being able to tell them "use this documented function to remove the surrogates" rather than "use this magic incantation that you don't understand, and that other people may not be able to read".

I know more about Unicode encodings than the average programmer at this point, yet I still needed to be schooled by true experts in this thread to learn how to solve the problem properly.

Look at this as an opportunity to encapsulate that knowledge in executable form, as while the code is short, it is conceptually *very* dense.

If there's a dedicated function, then replacing the encode/decode dance with a faster pure C alternative also becomes a future possibility (with only a recipe, there's no opportunity to ever optimise it).

With the additional clarification, it is also clear to me that Antoine is correct that the encoding needs to be configurable and should default to the appropriate setting to remove the surrogates from OS provided data.

With that change:

    def convert_surrogates(data, encoding=None, errors='replace'):
        """Convert escaped surrogates by applying a different error handler

    If no encoding is given, defaults to sys.getfilesystemencoding()
    Uses the "replace" error handler by default, but any input
    error handler may be specified.
    """
    if encoding is None:
        encoding = sys.getfilesystemencoding()
    return data.encode(encoding, 'surrogateescape').decode(encoding, errors)

Since it's primarily intended for cleaning OS provided data, then I agree os.convert_surrogates() could be a good choice. It would be appropriate to reference it from os.fsdecode() as a way to clean escaped data when the original binary data was no longer available to be decoded again with a different error handler.

serhiy-storchaka · 2014-09-23T15:34:11Z

Good catch Antoine!

Here is a sample of more complicated implementation.

ncoghlan · 2014-09-23T15:46:20Z

Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding.

serhiy-storchaka · 2015-03-16T07:12:52Z

Proposed preliminary patch adds three functions in the codecs module:

convert_surrogates(data, errors) -- handle lone surrogates with specified error handler.

>>> codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace')
'a€\\udca4'

convert_surrogateescape(data, errors) -- handle surrogateescaped bytes with specified error handler

>>> codecs.convert_surrogateescape('a\u20ac\udca4', 'backslashreplace')
'a€\\xa4'

convert_astrals(data, errors) -- handle astral (non-BMP) characters with specified error handler.

>>> codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace')
'a€\\U000e007f'

Names are discussable.

I think also about adding two functions or error handlers (that can used with convert_surrogates and convert_astrals) for composing astral characters from surrogate pairs and vice versa.

ncoghlan · 2015-03-16T11:00:12Z

(Serhiy, did you miss uploading the new patch?)

Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the case of handling lone surrogates and escaped surrogates, what about:

    rehandle_surrogatepass(data, errors="strict")
    rehandle_surrogateescape(data, errors="strict")

That is, we know we have data that was decoded with either surrogatepass or surrogateespace (respectively) as the error handler, and we want to process the results of that with a different error handler.

I believe those two would be enough to address the specific cases this issue was raised to cover, so it may make sense to file a separate issue to discuss the use cases for the custom astral handling.

Since astrals aren't actually errors in the first place, that could become:

    handle_astrals(data, errors="strict")

As in "pass every astral code point in this string through the named error handler".

The astral -> surrogate pair and surrogate pair -> astral converters do sound potentially interesting, but as noted above, I think they may call for a separate issue that better explains the specific use cases.

serhiy-storchaka · 2015-03-17T11:13:28Z

I uploaded the patch just before your comment Nick.

Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other issue.

I hesitate about the rehandle_surrogatepass name. This function handles surrogates than can be created not only with the "surrogatepass" handler, but also with different ways, e.g. with the "surrogateescape" handler, with chr(), handle_astral() or decompose_astrals(). Actually it checks that the string is valid Unicode (not containing surrogates) and handle errors if found with specified error handler.

May be there is a time for wider discussion on Python-Dev. I especially want to hear opinions of Ezio and Martin.

ncoghlan · 2015-03-17T12:18:58Z

I'd wondered about that with respect to rehandle_surrogatepass.

The current implementation looks like it processes *all* surrogates (even valid surrogate pairs), so "handle_surrogates" might be a suitable name.

If the intent is for it to be "handle_lone_surrogates", I'm not sure the current implementation achieves that, as a valid surrogate pair will match re.compile('[\ud800-\uefff]+').

The rest looks OK to me, including the decompose_astrals() and compose_surrogate_pairs() functions. Regardless of any practical utility, the latter two seem useful for *educational* purposes when it comes to unicode, by making it clear how to switch between the single code point and dual code point representations of the astrals.

ncoghlan · 2015-03-17T12:22:21Z

Oh, and yes, I agree a python-dev discussion would be a good idea.

From my perspective, "rehandle_surrogateescape" is the key function for making it easier to check for malformed input data from operating system interfaces.

The other items I don't personally have a use case for, but they seem potentially valuable in make some key Unicode concepts a bit more discoverable.

serhiy-storchaka · 2015-03-17T16:14:15Z

Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings).

yaseppochi · 2015-05-09T07:53:05Z

Please do not add the "rehandle" functions to codecs. They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation.

I suggest a "validation" submodule of the unicodedata package, or perhaps a new "unicodeutils" package, for these functions, as well as those that just detect the surrogates, etc.

Because they change the semantics of data they should be documented as potentially dangerous because they can't be inverted back to bytes without knowledge of the history of transformations they perform (and not even then in the case of the "replace" error handler). This matters in applications where the input bytes may have been digitally signed, for example.

ncoghlan · 2015-05-09T13:15:33Z

surrogateescape and surrogateepass data *already* can't be inverted back to
bytes reliably without knowing the original encoding - if you encode them
as something else when they contain surrogates, you'll either get an
exception (the default) or mojibake (if you use
surrogateescape/surrogateepass as the output error handler). They only work
as a transparent pass through if the input and output encodings match.

I'd be fine with putting these data scrubbing functions somewhere other
than in codecs, though (I'm not sure unicodedata is the right place, but a
new module like "string.internals" might be, as these functions have more
to do with Python's internal text representation than they do anything
else. A module like the latter could also be a home for things like a
chunking utility that splits a string up into substrings that use as little
memory as possible for feeding into a StringIO instance before throwing the
original away).

I also don't think they're urgent - the introduction of /etc/locale.conf
makes modern Linux far more consistent in getting locale settings right,
and even older platforms tend to get the locale right for user processes.

ncoghlan · 2015-05-12T04:53:59Z

I suggest we defer this one to 3.6 - I still think it's worth doing, but I don't think it's a major barrier to migration, and it would be good to get some real world experience with the new sys.stdin behaviour of defaulting to using surrogateescape in the POSIX locale in 3.5 before committing to a particular design for the surrogate cleaning API.

I do like the idea of a "string.internals" submodule as a possible home for exposing the Python level API.

vadmium · 2015-09-27T05:00:34Z

[padding]

I think my suggested colours for the bikeshed would be handle_surrogates() and handle_surrogateescape(). “Rehandle” seems awkward and too assuming to me. And I agree with Serhiy that surrogates are a Unicode thing, not just related to the “surrogatepass” handler.

Adding them to “codecs” makes sense to me. The most important one, handle_surrogateescape() or equivalent, is closely related to the error handler of that module.

Having handle_surrogateescape or equivalent would probably be useful for bpo-25184 (displaying an arbitrary file path in a UTF-8 HTML file).

vstinner · 2015-09-27T08:55:47Z

Hum, I suggest to put these functions in a package on PyPI, or recipes on a
website like stackoverfkow., and close the issue.

I'm still not convinced that these functions are useful . Usually we take a
function from an existing project used in applications to put it in the
stdlib. Here the use case still looks artifical. For example which
application requires to escape non-BMP character? How does it handle them
currently?

Threre are too many ways to handle surrogate characters. The common ways to
show undecodable bytes are not supported by functions proposed by Serhiy.
Example: %80 on Mac OS X. Gnome uses something else.

It was said that one reason to add new functions is performance. I'm not
convinced neither that such function is the bottleneck on any application.

I prefer to wait until users experiment with their own implementation and
see if a common function can be extracted from this to put it in the stdlib.

ncoghlan · 2015-09-27T09:30:58Z

I think moving this forward mainly needs someone with the time and energy wrangle a python-ideas/dev discussion to get some additional feedback on the API design. As I see it, there are 2 main questions to be resolved:

Where to expose these functions

The default location would be the codecs module, as they're closely related to the error handlers in that module, and the main reasons for needing to clean data at all are handling dirty data produced by an interface that uses surrogatepass or surrogateescape when decoding (handle_surrogates, handle_surrogateescape), or encoding data for use in a context which doesn't correctly handle code points outside the basic multilingual plane (handle_astrals).

If added to the codecs module, they could be documented in new sections on "Postprocessing decoded text" and "Preprocessing text for encoding".

The main argument against that would be Stephen's one, which is that these aren't themselves encoding or decoding operations, but rather internal state manipulations on Python strings.

The exact function set to be provided.

The three potential data cleaning cases currently being considered:

process_surrogates: reprocessing all surrogates in the string, including lone surrogates and valid surrogate pairs. Such strings may be produced by using the "surrogatepass" handler when decoding, or by decomposing astral characters to surrogate pairs.
process_surrogateescape: reprocessing only lone surrogates in the U+DC80 to U+DCFF range, with other surrogate pairs or lone surrogates triggering UnicodeTranslateError. Such strings may be produced by using the "surrogateescape" error handler when decoding.
process_astrals: reprocessing all code points in the astral plane.

These seem to cover the essentials to me, and I changed the proposed prefix to "process_*" based on the idea of documentating them as preprocessing and postprocessing steps for encoding and decoding.

ncoghlan · 2015-09-27T12:32:15Z

As far as the rationale for adding the functions at all goes, my main interest is still in having somewhere in the codecs module documentation to *define the problem*, and to my mind that entails also offering a simple way to do the relevant pre-/post-processing.

The nice aspect of building any related capabilities atop the standard error handlers is that it also means that third party modules can provide custom error handlers to support further escaping techniques, and those will also be available for use in decoding and encoding operations, rather than being specific to pre-/post-processing of the data.

However, it's also the case that we're generally going to be talking about the combination of encoding misconfiguration *and* processing data that gets potentially corrupted by the misconfiguration *and* doing something with it that isn't already handled by a surrogateescape round-trip, which is why I suspect in practice most applications are going to be able to get away with ignoring the problem entirely (especially with C.UTF-8 support coming to Fedora 24, so the Fedora/RHEL/CentOS ecosystem will be joining the Debian/Ubuntu ecosystem in offering that by default)

bitdancer · 2015-09-27T16:17:45Z

I also want "detect if there are any surrogates".

stevendaprano · 2015-09-28T02:23:56Z

On Sun, Sep 27, 2015 at 04:17:45PM +0000, R. David Murray wrote:

I also want "detect if there are any surrogates".

I think that's useful enough it should be a str method. Here's a
pure-Python implementation:

def is_surrogate(s):
    return any(0xD800 <= ord(c) <= 0xDFFF for c in s)

The actual Flexible String Representation implementation can be even
more efficient. All-ASCII and all-Latin1 strings cannot possibly include
surrogates, so is_surrogate will be a constant-time operation.

(If we care about making this a constant-time operation for all
strings, the constructor could set a flag on the string object. Since
the constructor has to walk the string anyway, I don't think that will
cost much time-wise.)

vstinner · 2015-09-29T09:52:34Z

I also want "detect if there are any surrogates".

Could you please open a separated issue for this function/method?

I believe that it's very different than other proposed functions/methods.

It was proposed before to add methods like "is_ascii()" but the request was rejected because other Python implementations don't implement Unicode using the PEP-393 and so the method would be less efficient than expected, depending on the implementation of Python.

Well, I don't know if we should add methods relying on the PEP-393 or not. It's probably better to discuss such "political" question on python-dev.

bitdancer · 2015-09-29T12:48:39Z

Done: bpo-25269.

ncoghlan · 2018-03-30T06:52:39Z

With PEPs 538 and 540 implemented for 3.7, my thinking on this has evolved a bit.

A recent discussion on python-ideas [1] also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data: https://ftfy.readthedocs.io/en/latest/

That includes a lone surrogate fixer: https://ftfy.readthedocs.io/en/latest/#ftfy.fixes.fix_surrogates

So a potential way to go here would be to a section on "Handling Improperly Decoded Text Data" to the codecs module documentation, and include ftfy as a See Also link in that new section.

If folks think that would be a reasonable way to go, then I think the clearest way to handle it would be to close this issue as "later" (which still implies "maybe never", but not as strongly as "rejected" does), and open a new issue for the suggested new section in the docs.

[1] https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

ncoghlan added the type-feature A feature request or enhancement label Aug 23, 2013

ncoghlan changed the title ~~Add tools for "cleaning" surrogate escaped strings~~ Add codecs.convert_surrogateescape to "clean" surrogate escaped strings Sep 23, 2014

ncoghlan changed the title ~~Add codecs.convert_surrogateescape to "clean" surrogate escaped strings~~ Add a convert_surrogates function to "clean" surrogate escaped strings Sep 23, 2014

serhiy-storchaka changed the title ~~Add a convert_surrogates function to "clean" surrogate escaped strings~~ Add codecs.convert_surrogateescape to "clean" surrogate escaped strings Sep 23, 2014

serhiy-storchaka added the stdlib Python modules in the Lib dir label Mar 16, 2015

ncoghlan changed the title ~~Add codecs.convert_surrogateescape to "clean" surrogate escaped strings~~ Add utilities to "clean" surrogate code points from strings May 12, 2015

ncoghlan added the 3.8 only security fixes label Mar 30, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Add utilities to "clean" surrogate code points from strings #63014

Add utilities to "clean" surrogate code points from strings #63014

Comments

ncoghlan commented Aug 23, 2013

ncoghlan commented Aug 23, 2013

bitdancer commented Aug 23, 2013

serhiy-storchaka commented Aug 23, 2013

pitrou commented Aug 23, 2013

vstinner commented Aug 23, 2013

bitdancer commented Aug 23, 2013

ncoghlan commented Aug 24, 2013

ncoghlan commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

ezio-melotti commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

serhiy-storchaka commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

ezio-melotti commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

serhiy-storchaka commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

vstinner commented Aug 25, 2014

ncoghlan commented Aug 25, 2014

bitdancer commented Aug 25, 2014

pitrou commented Aug 25, 2014

ncoghlan commented Aug 27, 2014

ncoghlan commented Sep 23, 2014

malemburg commented Sep 23, 2014

ncoghlan commented Sep 23, 2014

ncoghlan commented Sep 23, 2014

ncoghlan commented Sep 23, 2014

pitrou commented Sep 23, 2014

malemburg commented Sep 23, 2014

bitdancer commented Sep 23, 2014

bitdancer commented Sep 23, 2014

ncoghlan commented Sep 23, 2014

serhiy-storchaka commented Sep 23, 2014

ncoghlan commented Sep 23, 2014

serhiy-storchaka commented Mar 16, 2015

ncoghlan commented Mar 16, 2015

serhiy-storchaka commented Mar 17, 2015

ncoghlan commented Mar 17, 2015

ncoghlan commented Mar 17, 2015

serhiy-storchaka commented Mar 17, 2015

yaseppochi mannequin commented May 9, 2015

ncoghlan commented May 9, 2015

ncoghlan commented May 12, 2015

vadmium commented Sep 27, 2015

vstinner commented Sep 27, 2015

ncoghlan commented Sep 27, 2015

ncoghlan commented Sep 27, 2015

bitdancer commented Sep 27, 2015

stevendaprano commented Sep 28, 2015

vstinner commented Sep 29, 2015

bitdancer commented Sep 29, 2015

ncoghlan commented Mar 30, 2018