This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add utilities to "clean" surrogate code points from strings
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 22286 23676 Superseder:
Assigned To: Nosy List: Arfrever, ezio.melotti, lemburg, martin.panter, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, sjt, steven.daprano, vstinner
Priority: normal Keywords: patch

Created on 2013-08-23 04:02 by ncoghlan, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
convert_surrogates.py serhiy.storchaka, 2014-09-23 15:34
codecs_convert_escapes.patch serhiy.storchaka, 2015-03-16 10:52 review
codecs_convert_escapes_2.patch serhiy.storchaka, 2015-03-17 11:30 review
Messages (55)
msg195937 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-08-23 04:02
Prompted by issue 18713 and http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/, here are some possible utilities we could add to the codecs module to help deal with/debug issues related to surrogate escaped strings:

    def has_escaped_bytes(s):
        """Returns true if string contains surrogate escaped bytes"""
        ...

    def replace_escaped_bytes(s):
        """Replaces each surrogate escaped byte with a valid code point"""
        ...

    def decode_escaped_bytes(s, nominal_encoding, actual_encoding):
        """Reinterprets incorrectly decoded text using a new encoding"""
        return s.encode(nominal_encoding, 'surrogateescape').decode(actual_encoding)
msg195959 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-23 11:02
The email package needs has_escaped_bytes.  Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand.

What does replace do?  Replace it with the unknown character code point?  Email also needs that.
msg195960 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-23 11:55
These functions are trivial.

    def has_escaped_bytes(s):
        try:
            s.encode()
            return False
        except UnicodeEncodeError:
            return True

    def replace_escaped_bytes(s):
        return s.encode('utf-8', 'replace').decode()
msg196047 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-23 22:02
Can you sum the use cases for these?
(don't want to read a blog post, sorry :-))
msg196049 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-08-23 22:06
"The email package needs has_escaped_bytes.  Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand."

In which function you need to check this? What do you do if there are surrogate characters?
msg196054 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-23 23:28
The email package uses surrogateescape to store unknown bytes in unicode strings, just as with the handle-bad-data-from-os API surrogateescape was introduced for.  (For the same reason: the source data may have improperly encoded bytes that we must nevertheless preserve).  What action certain parts of the code takes differs depending on whether or not there are such encoded bytes in the string.  So the code needs to check and branch based on that.

The utility function is email.utils._has_surrogates.  You can grep the email source to see where it is used, if you like.
msg196063 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-08-24 00:26
The use case is to take data from a surrogate escaped interface and either
filter it out entirely or convert it to a valid Unicode string at the point
of *input*, before letting it make its way into the rest of the
application. For example, this approach permits the Python equivalent of
the Linux console behaviour when it attempts to display the improperly
encoded file name.

This approach is likely to be easier to implement in cross-platform code
than dropping down to the bytes interfaces on POSIX systems and using the
Unicode APIs on Windows.
msg225791 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-24 03:00
Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal:

    # Constant in the string module (akin to string.ascii_letters et al)
    escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape')

    # Helper to ensure a string contains no escaped surrogates
    # This allows it to be safely encoded without surrogateescape
    _match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
    def clean(s, repl='\ufffd'):
        return _match_surrogates.sub(repl, s)

    # Helper to redecode a string that was decoded incorrectly
    # For example, WSGI strings are passed from the server to the
    # framework as latin-1 by default and may need to be redecoded
    def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'):
        return s.encode(old_encoding, old_errors).decode(encoding, errors)

In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).
msg225793 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-24 04:44
Note: I dropped the "has_escaped_bytes" idea, as "s != string.clean(s)" is functionally equivalent, albeit a bit more expensive if you don't actually want the cleaned string)
msg225800 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-08-24 07:58
I think similar functions should be added in the unicodedata module rather than the string module or as str methods.  If I'm not mistaken this was already proposed in another issue.
In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet.
As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear.  If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates).  These can also be used to build oneliner to check if a string contains surrogates and/or to remove them.
clean has a very generic name with no hints about surrogates, and its purpose is quite specific.
I'm also not a big fan of redecode.  The equivalent calls to encode/decode are not much longer and more explicit.  Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).
msg225801 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-24 08:24
The purpose of these changes it to provide tools specifically for working with surrogate escaped data, not for working with arbitrary lone Unicode surrogates.

"escaped_surrogates" is not defined by the Unicode spec, it's defined by the behaviour of the surrogateescape error handler that lets us tunnel arbitrary bytes through str objects and reproduce them faithfully at the far end. On reflection, I think codecs would be a better home than string (as that's where the error handler is defined), but it doesn't belong in unicodedata.

I'd be OK with changing the name of the clean function to "clean_escaped_surrogates".

Needing redecode is not a bug: it's baked into the WSGI spec in PEP 3333. I would be OK with providing it in wsgiref rather than the codecs or string modules, but I think we should provide it somewhere.
msg225806 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-08-24 09:11
I agree with Ezio in all points. escaped_surrogates is inefficient for any purposes and incomplete. _match_surrogates can be created in more efficient way. clean() has too general and misleading name. redecode() looks just ridiculous, this cumbersome 4-argument function just combine two methods.
msg225809 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-24 11:02
Guys, you're Python 3 unicode experts, already thoroughly familiar with how surrogateescape works. These features are not for you.

The exact implementations don't matter. These need to exist, and they need to be documented with detailed explanations. People don't yet understand what the surrogateescape error handler is, and what it's for, or how to work with it.

People consider surrogate escaping this weird mysterious thing. It's not - it's a really simple mapping of the 128 bytes with the high order bit set to a different part of the Unicode code space. These functions make it obvious.

wsgiref.redecode is a natural consequence of the design of PEP 3333 - it makes sense to offer it there, as a hint that frameworks are likely to need it.

We have a serious, serious, learnability problem around binary data handling in Python 3. It's nowhere near as bad as the learnability problem for text handling in Python 2, but it still needs more scaffolding to help people grow in understanding.
msg225811 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-08-24 11:46
That's why I think a function like redecode is a bad idea.
With Python 2 I've seen lot of people blindingly trying .decode when .encode failed (and the other way around) whenever they were getting an UnicodeError (and the fact that decoding Unicode results in an encode error didn't help).
I'm afraid that confused developers will try to (mis)use redecode as a workaround to attempt to fix something that shouldn't be broken in the first place, without actually understanding what the real problem is.
There's really no excuse for developers not to know Unicode nowadays (Joel Spolsky said the same thing more than 10 years ago).
IME people also tend to overestimate how difficult it is to understand Unicode.  While it is true that you can go pretty deep down the rabbit hole, the basic concepts necessary for everyday use are pretty simple and straightforward.
msg225817 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-24 13:00
The redecode thing is a distraction from my core concern here, so I've split that out to issue #22264, a separate RFE for a "wsgiref.fix_encoding" function.

For this issue, my main concern is the function to *clean* a string of escaped binary data, so it can be displayed easily, or otherwise purged of the escaped characters. Preserving the data by default is good, but you have to know a *lot* about how Python 3 works in order to be able figure out how to clean it out.

For that, not knowing Unicode in general isn't the problem: it's not knowing PEP 383. If we forget the idea of exposing the constant with the escaped values (I agree that's not very useful), it suggests "codecs.clean_surrogate_escapes" as a possible name:


    # Helper to ensure a string contains no escaped surrogates
    # This allows it to be safely encoded without surrogateescape
    _extended_ascii = bytes(range(128, 256))
    _escaped_surrogates = _extended_ascii.decode('ascii',
                                    errors='surrogateescape')
    _match_escaped = re.compile('[{}]'.format(_escaped_surrogates))
    def clean_surrogate_escapes(s, repl='\ufffd'):
        return _match_escaped.sub(repl, s)

A more efficient implementation in C would also be fine, this is just an easy way to define the exact semantics.

(I also just noticed that unlike other error handlers, surrogateespace and surrogatepass do not have corresponding codecs.surrogateescape_errors and codecs.surrogatepass_errors functions)
msg225818 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-08-24 13:49
What problem is purposed to solve clean_surrogate_escapes()? Could you please provide user scenario or two?

Possible alternative implementation is:

def clean_surrogate_escapes(s):
    return s.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')

It can be faster for some data (for mostly ASCII with rare surrogates it is superfast). For other data 'utf-16' can be better choice.
msg225822 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-24 14:16
My main use case is for passing data to other applications that *don't* have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do it. (I had never even heard of surrogatepass until this evening)

Situation:

What I have: data decoded with surrogateescape
What I want: that same data with all the surrogates gone, replaced with either the Unicode replacement character or an ASCII question mark (which I want will depend on the exact situation)

Assume I am largely clueless about the codec system. I know nothing beyond the fact that Python 3 strings may have smuggled bytes in them and I want to get rid of them because they confuse the application I'm passing them to.

The concrete example that got me thinking about this again was the task of writing filenames into a UTF-8 encoded email, and wanting to scrub the output from os.listdir before writing the list into the email (s/email/web page/ also works).

For issue #22016 I actually suggested doing this as *another* codec error handler ("surrogatereplace"), but Stephen Turnbull convinced me this original idea was better: it should just be a pure data transformation pass on the string, clearing the surrogates out, and leaving me with data that is identical to that I would have had if "surrogatereplace" had been used instead of "surrogateescape" in the first place.

As "errors='replace'" already covers the "ASCII ?" replacement case, that means your proposed "redecode" based solution would cover the rest of my use case.
msg225858 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-08-25 01:16
Your clean() function looses information. If a filename contains almost only undecodable characters, it will looks like ����.txt. It's not very useful. I would prefer to escape the byte. Mac OS X (HFS+ filesystem) uses for example %HH format: "\udc80" would be replaced with "%80" for example.

This format is also used in URLs. For example, "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt" in Firefox (when listing a local directory), but Firefox uses the URL "file://.../a%E9b.txt" (hexadecimal in uppercase).

In the Gnome file browser (Nautilus), "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt (invalid encoding)".
msg225869 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-25 06:59
Ideally we'd have string modification support for all the translations we offer as codec error handlers:

* Unicode replacement character ('replace' on input)
* ASCII question mark ('replace' on output)
* Dropping them entirely ('ignore')
* XML character reference ('xmlcharrefreplace')
* Python escape sequence ('backslashreplace')

The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases.

Now, some of these things *can* be done today using Serhiy's trick of encoding to UTF-8 and then decoding again:

    data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
    data.encode('utf-8', 'replace').decode('utf-8')
    data.encode('utf-8', 'ignore').decode('utf-8')

However, these two don't work properly:

    data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
    data.encode('utf-8', 'backslashreplace').decode('utf-8')

The reason those don't work is because they'll encode the *surrogate escaped bytes*, rather than the originals.

Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.
msg225875 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-08-25 14:50
Right now having has_escaped_bytes isn't too important, since I've done nothing to profile and improve the performance of the new email code.  But eventually I'll need it, because detecting the existence of escaped bytes is inside some of the inner loops in the header processing code.  "s != s.clean_escaped_bytes()" (or whatever you call it) just smells wrong as a way to spell has_escaped_bytes.
msg225889 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-08-25 18:55
>    data.encode('utf-8', 'replace').decode('utf-8')
>    data.encode('utf-8', 'ignore').decode('utf-8')

Why not the reverse:

os.fsencode(data).decode('utf-8', 'replace')
os.fsencode(data).decode('utf-8', 'ignore')

Note that "backslashreplace" needs to be enhanced to work when decoding too.
Note that "xmlcharrefreplace" doesn't make sense here: it encodes a *character* reference, but you're precisely trying to represent something which fails interpreting as a character.

(AFAIK, XML can't represent non-text data, except in NDATA sequences)
msg225972 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-27 11:00
Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case.

Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286

I also agree we can ignore xmlcharrefreplace here.

So that leaves the basic pattern as:

data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace')

This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that.

Possible function name and implementation:

    def convert_surrogateescape(data, errors='replace'):
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    
Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here)

Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE.
msg227334 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 09:59
Updated issue title to reflect current proposal.
msg227339 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-09-23 10:39
Don't like the function name :-)

How about codecs.filter_non_utf8_data(), since that's closer
to what the function is really doing and doesn't require
knowledge about what "surrogateescape" is.
msg227340 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 10:51
The error handler is called "surrogateescape". That means "convert_surrogateescape" is always only a single step away from thinking "I want to remove the smuggled bytes from a surrogateescape'd string", without needing to assume any knowledge on the part of the user other than the name of the error handler and the fact that it is used to smuggle arbitrary bytes through the Python 3 str type.

Getting from "this string was decoded with the surrogateescape handler and may contain smuggled bytes" to "filter_non_utf8_data" as the relevant cleanup function is a much bigger leap that requires more assumed knowledge on the part of the user, and also one that confuses the conceptual purpose of the function (cleaning up the output of the surrogateescape error handler to ensure it is a pure Unicode string) with the internal details of the proposed approach to implementing that cleanup operation (encoding to UTF-8 with surrogateescape, and then decoding again with a different error handler).
msg227341 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 10:57
The function definition again, this time with a draft docstring:

    def convert_surrogateescape(data, errors='replace'):
        """Convert escaped raw bytes by applying a different error handler

        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
msg227342 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 11:00
Note I would also be OK with "convert_surrogates", as that's the term that appears in the relevant error message:

>>> b'\xe9'.decode('ascii', 'surrogateescape').encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed
msg227343 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-23 11:04
Le 23/09/2014 12:57, Nick Coghlan a écrit :
> The function definition again, this time with a draft docstring:
> 
>     def convert_surrogateescape(data, errors='replace'):
>         """Convert escaped raw bytes by applying a different error handler
> 
>         Uses the "replace" error handler by default, but any input
>         error handler may be specified.
>         """
>         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

'utf-8' is hardcoded?
msg227344 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 11:12
Draft docstring for that version

    def convert_surrogates(data, errors='replace'):
        """Convert escaped surrogates by applying a different error handler

        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
msg227345 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 11:13
Antoine: what would be the use case for using a different encoding for the temporary bytes object? It's discarded anyway, so the encoding used isn't externally visible.
msg227346 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-23 11:23
The encoding used impacts the result:

>>> s = 'abc\udcc3\udca9'
>>> s.encode('ascii', 'surrogateescape').decode('ascii', 'replace')
'abc��'
>>> s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
'abcé'

The original string ('abc\udcc3\udca9') was obtained by decoding a valid utf-8 string with the 'ascii' codec and the 'surrogateescape' error handler.

If anything, the default encoding should probably be sys.getfilesystemencoding().
msg227362 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-09-23 14:50
On 23.09.2014 13:12, Nick Coghlan wrote:
> 
> Nick Coghlan added the comment:
> 
> Draft docstring for that version
> 
>     def convert_surrogates(data, errors='replace'):
>         """Convert escaped surrogates by applying a different error handler
> 
>         Uses the "replace" error handler by default, but any input
>         error handler may be specified.
>         """
>         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

Nick, the doc string is not correct. It is not working on escaped
surrogates. Instead it is working on lone surrogates that were used
to encode undecodable bytes from some input data.

The longer story goes like this:

The "surrogateescape" error handler in the .decode() call that lead up
to the data you want this function to take as input, will convert
undecodable data to lone low surrogates.

The function then reverts these bytes back into UTF-8 (which may well
not be the original encoding, as Antoine has already pointed out, but
that's not really important for the use case), recreating the
unencodable bytes and then decodes the result again using the UTF-8
codec using a new error handler.

So in summary, the function is supposed to retroactively apply
a different error handler to the input data, undoing the effects
of the "surrogateescapes" error handler.

The name still doesn't match this functionality.

BTW: There's a catch in the approach. The encoding used to decode
the original data may well be 'ascii'. Now, if the original input
data was in fact UTF-8, the input decoding would have mapped the
UTF-8 code points to lone surrogates. The above function would then
turn these back into UTF-8, redecode and get a completely different
string back (since the error handlers would not trigger).

I'm not sure whether adding such a small function with so many
unclear implications is a good idea. Either it should be
made more specific, e.g. be reserved for use on data from input
streams with known encoding, or be put into the documentation as
example for people to use and adapt as necessary.
msg227364 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-09-23 15:06
And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8.

I'm also not sure that it is worth adding this.  If you know what you are doing the solution is obvious, and if you don't know what you are doing you shouldn't be using surrogateescape in the first place :)

Now, if there were or there is intended to be a more efficient C level implementation, that answer might be different.
msg227365 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-09-23 15:15
Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio.  So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module.  os.convert_surrogates?
msg227367 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 15:32
As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it by default on all OS interfaces (including the standard streams when we detect 'ascii' as the filesystem encoding in 3.5+).

This *needs* to be a case that folks can handle without needing to spend years learning about encodings and error handlers first. That means being able to tell them "use this documented function to remove the surrogates" rather than "use this magic incantation that you don't understand, and that other people may not be able to read".

I know more about Unicode encodings than the average programmer at this point, yet I still needed to be schooled by true experts in this thread to learn how to solve the problem properly.

Look at this as an opportunity to encapsulate that knowledge in executable form, as while the code is short, it is conceptually *very* dense.

If there's a dedicated function, then replacing the encode/decode dance with a faster pure C alternative also becomes a future possibility (with only a recipe, there's no opportunity to ever optimise it).

With the additional clarification, it is also clear to me that Antoine is correct that the encoding needs to be configurable and should default to the appropriate setting to remove the surrogates from OS provided data.

With that change:

    def convert_surrogates(data, encoding=None, errors='replace'):
        """Convert escaped surrogates by applying a different error handler

        If no encoding is given, defaults to sys.getfilesystemencoding()
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        if encoding is None:
            encoding = sys.getfilesystemencoding()
        return data.encode(encoding, 'surrogateescape').decode(encoding, errors)

Since it's primarily intended for cleaning OS provided data, then I agree os.convert_surrogates() could be a good choice. It would be appropriate to reference it from os.fsdecode() as a way to clean escaped data when the original binary data was no longer available to be decoded again with a different error handler.
msg227368 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-23 15:34
Good catch Antoine!

Here is a sample of more complicated implementation.
msg227369 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-09-23 15:46
Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding.
msg238182 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-16 07:12
Proposed preliminary patch adds three functions in the codecs module:

convert_surrogates(data, errors) -- handle lone surrogates with specified error handler.

>>> codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace')
'a€\\udca4'

convert_surrogateescape(data, errors) -- handle surrogateescaped bytes with specified error handler

>>> codecs.convert_surrogateescape('a\u20ac\udca4', 'backslashreplace')
'a€\\xa4'

convert_astrals(data, errors) -- handle astral (non-BMP) characters with specified error handler.

>>> codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace')
'a€\\U000e007f'

Names are discussable.

I think also about adding two functions or error handlers (that can used with convert_surrogates and convert_astrals) for composing astral characters from surrogate pairs and vice versa.
msg238193 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-03-16 11:00
(Serhiy, did you miss uploading the new patch?)

Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the case of handling lone surrogates and escaped surrogates, what about:

    rehandle_surrogatepass(data, errors="strict")
    rehandle_surrogateescape(data, errors="strict")

That is, we know we have data that was decoded with either surrogatepass or surrogateespace (respectively) as the error handler, and we want to process the results of that with a different error handler.

I believe those two would be enough to address the specific cases this issue was raised to cover, so it may make sense to file a separate issue to discuss the use cases for the custom astral handling.

Since astrals aren't actually errors in the first place, that could become:

    handle_astrals(data, errors="strict")

As in "pass every astral code point in this string through the named error handler".

The astral -> surrogate pair and surrogate pair -> astral converters do sound potentially interesting, but as noted above, I think they may call for a separate issue that better explains the specific use cases.
msg238277 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-17 11:13
I uploaded the patch just before your comment Nick.

Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other issue.

I hesitate about the rehandle_surrogatepass name. This function handles surrogates than can be created not only with the "surrogatepass" handler, but also with different ways, e.g. with the "surrogateescape" handler, with chr(), handle_astral() or decompose_astrals(). Actually it checks that the string is valid Unicode (not containing surrogates) and handle errors if found with specified error handler.

May be there is a time for wider discussion on Python-Dev. I especially want to hear opinions of Ezio and Martin.
msg238285 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-03-17 12:18
I'd wondered about that with respect to rehandle_surrogatepass.

The current implementation looks like it processes *all* surrogates (even valid surrogate pairs), so "handle_surrogates" might be a suitable name.

If the intent is for it to be "handle_lone_surrogates", I'm not sure the current implementation achieves that, as a valid surrogate pair will match re.compile('[\ud800-\uefff]+').

The rest looks OK to me, including the decompose_astrals() and compose_surrogate_pairs() functions. Regardless of any practical utility, the latter two seem useful for *educational* purposes when it comes to unicode, by making it clear how to switch between the single code point and dual code point representations of the astrals.
msg238286 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-03-17 12:22
Oh, and yes, I agree a python-dev discussion would be a good idea.

From my perspective, "rehandle_surrogateescape" is the key function for making it easier to check for malformed input data from operating system interfaces.

The other items I don't personally have a use case for, but they seem potentially valuable in make some key Unicode concepts a bit more discoverable.
msg238309 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-17 16:14
Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings).
msg242799 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2015-05-09 07:53
Please do not add the "rehandle" functions to codecs.  They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation.

I suggest a "validation" submodule of the unicodedata package, or perhaps a new "unicodeutils" package, for these functions, as well as those that just detect the surrogates, etc.

Because they change the semantics of data they should be documented as potentially dangerous because they can't be inverted back to bytes without knowledge of the history of transformations they perform (and not even then in the case of the "replace" error handler).  This matters in applications where the input bytes may have been digitally signed, for example.
msg242810 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-05-09 13:15
surrogateescape and surrogateepass data *already* can't be inverted back to
bytes reliably without knowing the original encoding - if you encode them
as something else when they contain surrogates, you'll either get an
exception (the default) or mojibake (if you use
surrogateescape/surrogateepass as the output error handler). They only work
as a transparent pass through if the input and output encodings match.

I'd be fine with putting these data scrubbing functions somewhere other
than in codecs, though (I'm not sure unicodedata is the right place, but a
new module like "string.internals" might be, as these functions have more
to do with Python's internal text representation than they do anything
else. A module like the latter could also be a home for things like a
chunking utility that splits a string up into substrings that use as little
memory as possible for feeding into a StringIO instance before throwing the
original away).

I also don't think they're urgent - the introduction of /etc/locale.conf
makes modern Linux far more consistent in getting locale settings right,
and even older platforms tend to get the locale right for user processes.
msg242939 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-05-12 04:53
I suggest we defer this one to 3.6 - I still think it's worth doing, but I don't think it's a major barrier to migration, and it would be good to get some real world experience with the new sys.stdin behaviour of defaulting to using surrogateescape in the POSIX locale in 3.5 before committing to a particular design for the surrogate cleaning API.

I do like the idea of a "string.internals" submodule as a possible home for exposing the Python level API.
msg251682 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-09-27 05:00
[padding]

I think my suggested colours for the bikeshed would be handle_surrogates() and handle_surrogateescape(). “Rehandle” seems awkward and too assuming to me. And I agree with Serhiy that surrogates are a Unicode thing, not just related to the “surrogatepass” handler.

Adding them to “codecs” makes sense to me. The most important one, handle_surrogateescape() or equivalent, is closely related to the error handler of that module.

Having handle_surrogateescape or equivalent would probably be useful for Issue 25184 (displaying an arbitrary file path in a UTF-8 HTML file).
msg251693 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-09-27 08:55
Hum, I suggest to put these functions in a package on PyPI, or recipes on a
website like stackoverfkow., and close the issue.

I'm still not convinced that these functions are useful . Usually we take a
function from an existing project used in applications to put it in the
stdlib. Here the use case still looks artifical. For example which
application requires to escape non-BMP character? How does it handle them
currently?

Threre are too many ways to handle surrogate characters. The common ways to
show undecodable bytes are not supported by functions proposed by Serhiy.
Example: %80 on Mac OS X. Gnome uses something else.

It was said that one reason to add new functions is performance. I'm not
convinced neither that such function is the bottleneck on any application.

I prefer to wait until users experiment with their own implementation and
see if a common function can be extracted from this to put it in the stdlib.
msg251694 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-09-27 09:30
I think moving this forward mainly needs someone with the time and energy wrangle a python-ideas/dev discussion to get some additional feedback on the API design. As I see it, there are 2 main questions to be resolved:

1. Where to expose these functions

The default location would be the codecs module, as they're closely related to the error handlers in that module, and the main reasons for needing to clean data at all are handling dirty data produced by an interface that uses surrogatepass or surrogateescape when decoding (handle_surrogates, handle_surrogateescape), or encoding data for use in a context which doesn't correctly handle code points outside the basic multilingual plane (handle_astrals).

If added to the codecs module, they could be documented in new sections on "Postprocessing decoded text" and "Preprocessing text for encoding".

The main argument against that would be Stephen's one, which is that these aren't themselves encoding or decoding operations, but rather internal state manipulations on Python strings.

2. The exact function set to be provided.

The three potential data cleaning cases currently being considered:

* process_surrogates: reprocessing all surrogates in the string, including lone surrogates and valid surrogate pairs. Such strings may be produced by using the "surrogatepass" handler when decoding, or by decomposing astral characters to surrogate pairs.
* process_surrogateescape: reprocessing only lone surrogates in the U+DC80 to U+DCFF range, with other surrogate pairs or lone surrogates triggering UnicodeTranslateError. Such strings may be produced by using the "surrogateescape" error handler when decoding.
* process_astrals: reprocessing all code points in the astral plane.

These seem to cover the essentials to me, and I changed the proposed prefix to "process_*" based on the idea of documentating them as preprocessing and postprocessing steps for encoding and decoding.
msg251701 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-09-27 12:32
As far as the rationale for adding the functions at all goes, my main interest is still in having somewhere in the codecs module documentation to *define the problem*, and to my mind that entails also offering a simple way to do the relevant pre-/post-processing.

The nice aspect of building any related capabilities atop the standard error handlers is that it also means that third party modules can provide custom error handlers to support further escaping techniques, and those will also be available for use in decoding and encoding operations, rather than being specific to pre-/post-processing of the data.

However, it's also the case that we're generally going to be talking about the combination of encoding misconfiguration *and* processing data that gets potentially corrupted by the misconfiguration *and* doing something with it that isn't already handled by a surrogateescape round-trip, which is why I suspect in practice most applications are going to be able to get away with ignoring the problem entirely (especially with C.UTF-8 support coming to Fedora 24, so the Fedora/RHEL/CentOS ecosystem will be joining the Debian/Ubuntu ecosystem in offering that by default)
msg251708 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-09-27 16:17
I also want "detect if there are any surrogates".
msg251740 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-09-28 02:23
On Sun, Sep 27, 2015 at 04:17:45PM +0000, R. David Murray wrote:
> 
> I also want "detect if there are any surrogates".

I think that's useful enough it should be a str method. Here's a 
pure-Python implementation:

def is_surrogate(s):
    return any(0xD800 <= ord(c) <= 0xDFFF for c in s)

The actual Flexible String Representation implementation can be even 
more efficient. All-ASCII and all-Latin1 strings cannot possibly include 
surrogates, so is_surrogate will be a constant-time operation.

(If we care about making this a constant-time operation for all 
strings, the constructor could set a flag on the string object. Since 
the constructor has to walk the string anyway, I don't think that will 
cost much time-wise.)
msg251836 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-09-29 09:52
> I also want "detect if there are any surrogates".

Could you please open a separated issue for this function/method?

I believe that it's very different than other proposed functions/methods.

It was proposed before to add methods like "is_ascii()" but the request was rejected because other Python implementations don't implement Unicode using the PEP 393 and so the method would be less efficient than expected, depending on the implementation of Python.

Well, I don't know if we should add methods relying on the PEP 393 or not. It's probably better to discuss such "political" question on python-dev.
msg251854 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-09-29 12:48
Done: issue 25269.
msg314682 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-03-30 06:52
With PEPs 538 and 540 implemented for 3.7, my thinking on this has evolved a bit.

A recent discussion on python-ideas [1] also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data: https://ftfy.readthedocs.io/en/latest/

That includes a lone surrogate fixer: https://ftfy.readthedocs.io/en/latest/#ftfy.fixes.fix_surrogates

So a potential way to go here would be to a section on "Handling Improperly Decoded Text Data" to the codecs module documentation, and include ftfy as a See Also link in that new section.

If folks think that would be a reasonable way to go, then I think the clearest way to handle it would be to close this issue as "later" (which still implies "maybe never", but not as strongly as "rejected" does), and open a new issue for the suggested new section in the docs.

[1] https://mail.python.org/pipermail/python-ideas/2018-January/048583.html
History
Date User Action Args
2022-04-11 14:57:49adminsetgithub: 63014
2018-06-09 11:03:02ncoghlanunlinkissue22555 dependencies
2018-03-30 06:52:39ncoghlansetmessages: + msg314682
versions: + Python 3.8, - Python 3.6
2015-09-29 12:48:38r.david.murraysetmessages: + msg251854
2015-09-29 09:52:34vstinnersetmessages: + msg251836
2015-09-28 02:23:56steven.dapranosetmessages: + msg251740
2015-09-27 16:17:44r.david.murraysetmessages: + msg251708
2015-09-27 12:32:15ncoghlansetmessages: + msg251701
2015-09-27 09:30:58ncoghlansetmessages: + msg251694
2015-09-27 08:55:47vstinnersetmessages: + msg251693
2015-09-27 05:00:34martin.pantersetnosy: + martin.panter
messages: + msg251682
2015-06-07 15:39:00steven.dapranosetnosy: + steven.daprano
2015-05-12 04:54:00ncoghlansettitle: Add codecs.convert_surrogateescape to "clean" surrogate escaped strings -> Add utilities to "clean" surrogate code points from strings
messages: + msg242939
versions: + Python 3.6, - Python 3.5
2015-05-09 13:15:33ncoghlansetmessages: + msg242810
2015-05-09 07:53:04sjtsetnosy: + sjt
messages: + msg242799
2015-03-17 16:14:15serhiy.storchakasetmessages: + msg238309
2015-03-17 12:22:21ncoghlansetmessages: + msg238286
2015-03-17 12:18:58ncoghlansetmessages: + msg238285
2015-03-17 11:30:08serhiy.storchakasetfiles: + codecs_convert_escapes_2.patch
2015-03-17 11:13:28serhiy.storchakasetmessages: + msg238277
2015-03-16 11:00:12ncoghlansetmessages: + msg238193
2015-03-16 10:52:49serhiy.storchakasetfiles: + codecs_convert_escapes.patch
keywords: + patch
2015-03-16 07:12:52serhiy.storchakasetmessages: + msg238182
components: + Library (Lib)
versions: + Python 3.5, - Python 3.4
2015-03-15 22:09:04serhiy.storchakasetdependencies: + Add support of UnicodeTranslateError in standard error handlers
2014-10-05 04:09:50ncoghlanlinkissue22555 dependencies
2014-09-23 15:46:19ncoghlansetmessages: + msg227369
2014-09-23 15:34:11serhiy.storchakasetfiles: + convert_surrogates.py

messages: + msg227368
title: Add a convert_surrogates function to "clean" surrogate escaped strings -> Add codecs.convert_surrogateescape to "clean" surrogate escaped strings
2014-09-23 15:32:36ncoghlansetmessages: + msg227367
title: Add codecs.convert_surrogateescape to "clean" surrogate escaped strings -> Add a convert_surrogates function to "clean" surrogate escaped strings
2014-09-23 15:15:19r.david.murraysetmessages: + msg227365
2014-09-23 15:06:57r.david.murraysetmessages: + msg227364
2014-09-23 14:50:13lemburgsetmessages: + msg227362
2014-09-23 11:23:47pitrousetmessages: + msg227346
2014-09-23 11:13:12ncoghlansetmessages: + msg227345
2014-09-23 11:12:04ncoghlansetmessages: + msg227344
2014-09-23 11:04:47pitrousetmessages: + msg227343
2014-09-23 11:00:44ncoghlansetmessages: + msg227342
2014-09-23 10:57:54ncoghlansetmessages: + msg227341
2014-09-23 10:51:12ncoghlansetmessages: + msg227340
2014-09-23 10:39:50lemburgsetnosy: + lemburg
messages: + msg227339
2014-09-23 09:59:43ncoghlansetmessages: + msg227334
title: Add tools for "cleaning" surrogate escaped strings -> Add codecs.convert_surrogateescape to "clean" surrogate escaped strings
2014-08-27 11:00:40ncoghlansetdependencies: + Allow backslashreplace error handler to be used on input
messages: + msg225972
2014-08-25 18:55:02pitrousetmessages: + msg225889
2014-08-25 14:50:42r.david.murraysetmessages: + msg225875
2014-08-25 06:59:28ncoghlansetmessages: + msg225869
2014-08-25 01:16:14vstinnersetmessages: + msg225858
2014-08-24 14:16:53ncoghlansetmessages: + msg225822
2014-08-24 13:49:12serhiy.storchakasetmessages: + msg225818
2014-08-24 13:00:36ncoghlansetmessages: + msg225817
2014-08-24 11:46:14ezio.melottisetmessages: + msg225811
2014-08-24 11:02:05ncoghlansetmessages: + msg225809
2014-08-24 09:11:24serhiy.storchakasetmessages: + msg225806
2014-08-24 08:24:10ncoghlansetmessages: + msg225801
2014-08-24 07:58:21ezio.melottisetmessages: + msg225800
2014-08-24 04:44:12ncoghlansetmessages: + msg225793
2014-08-24 03:00:11ncoghlansetmessages: + msg225791
2013-08-24 00:27:00ncoghlansetmessages: + msg196063
2013-08-23 23:28:45r.david.murraysetmessages: + msg196054
2013-08-23 22:06:39vstinnersetnosy: + vstinner
messages: + msg196049
2013-08-23 22:02:00pitrousetnosy: + pitrou
messages: + msg196047
2013-08-23 20:28:58ezio.melottisetnosy: + ezio.melotti
2013-08-23 11:55:16serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg195960
2013-08-23 11:02:55r.david.murraysetnosy: + r.david.murray
messages: + msg195959
2013-08-23 04:39:55Arfreversetnosy: + Arfrever
2013-08-23 04:02:32ncoghlancreate