Issue 18814: Add utilities to "clean" surrogate code points from strings

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63014

classification

Title:	Add utilities to "clean" surrogate code points from strings
Type:	enhancement	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:	22286 23676	Superseder:
Assigned To:		Nosy List:	Arfrever, ezio.melotti, lemburg, martin.panter, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, sjt, steven.daprano, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-08-23 04:02 by ncoghlan, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
convert_surrogates.py	serhiy.storchaka, 2014-09-23 15:34
codecs_convert_escapes.patch	serhiy.storchaka, 2015-03-16 10:52		review
codecs_convert_escapes_2.patch	serhiy.storchaka, 2015-03-17 11:30		review

Messages (55)
msg195937 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-08-23 04:02
Prompted by issue 18713 and http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/, here are some possible utilities we could add to the codecs module to help deal with/debug issues related to surrogate escaped strings: def has_escaped_bytes(s): """Returns true if string contains surrogate escaped bytes""" ... def replace_escaped_bytes(s): """Replaces each surrogate escaped byte with a valid code point""" ... def decode_escaped_bytes(s, nominal_encoding, actual_encoding): """Reinterprets incorrectly decoded text using a new encoding""" return s.encode(nominal_encoding, 'surrogateescape').decode(actual_encoding)
msg195959 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-23 11:02
The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand. What does replace do? Replace it with the unknown character code point? Email also needs that.
msg195960 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-08-23 11:55
These functions are trivial. def has_escaped_bytes(s): try: s.encode() return False except UnicodeEncodeError: return True def replace_escaped_bytes(s): return s.encode('utf-8', 'replace').decode()
msg196047 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-23 22:02
Can you sum the use cases for these? (don't want to read a blog post, sorry :-))
msg196049 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-08-23 22:06
"The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand." In which function you need to check this? What do you do if there are surrogate characters?
msg196054 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-23 23:28
The email package uses surrogateescape to store unknown bytes in unicode strings, just as with the handle-bad-data-from-os API surrogateescape was introduced for. (For the same reason: the source data may have improperly encoded bytes that we must nevertheless preserve). What action certain parts of the code takes differs depending on whether or not there are such encoded bytes in the string. So the code needs to check and branch based on that. The utility function is email.utils._has_surrogates. You can grep the email source to see where it is used, if you like.
msg196063 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-08-24 00:26
The use case is to take data from a surrogate escaped interface and either filter it out entirely or convert it to a valid Unicode string at the point of input, before letting it make its way into the rest of the application. For example, this approach permits the Python equivalent of the Linux console behaviour when it attempts to display the improperly encoded file name. This approach is likely to be easier to implement in cross-platform code than dropping down to the bytes interfaces on POSIX systems and using the Unicode APIs on Windows.
msg225791 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-24 03:00
Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal: # Constant in the string module (akin to string.ascii_letters et al) escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape') # Helper to ensure a string contains no escaped surrogates # This allows it to be safely encoded without surrogateescape _match_surrogates = re.compile('[{}]'.format(escaped_surrogates)) def clean(s, repl='\ufffd'): return _match_surrogates.sub(repl, s) # Helper to redecode a string that was decoded incorrectly # For example, WSGI strings are passed from the server to the # framework as latin-1 by default and may need to be redecoded def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'): return s.encode(old_encoding, old_errors).decode(encoding, errors) In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).
msg225793 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-24 04:44
Note: I dropped the "has_escaped_bytes" idea, as "s != string.clean(s)" is functionally equivalent, albeit a bit more expensive if you don't actually want the cleaned string)
msg225800 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-08-24 07:58
I think similar functions should be added in the unicodedata module rather than the string module or as str methods. If I'm not mistaken this was already proposed in another issue. In C we already added macros like IS_{HIGH\|LOW\|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet. As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear. If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates). These can also be used to build oneliner to check if a string contains surrogates and/or to remove them. clean has a very generic name with no hints about surrogates, and its purpose is quite specific. I'm also not a big fan of redecode. The equivalent calls to encode/decode are not much longer and more explicit. Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).
msg225801 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-24 08:24
The purpose of these changes it to provide tools specifically for working with surrogate escaped data, not for working with arbitrary lone Unicode surrogates. "escaped_surrogates" is not defined by the Unicode spec, it's defined by the behaviour of the surrogateescape error handler that lets us tunnel arbitrary bytes through str objects and reproduce them faithfully at the far end. On reflection, I think codecs would be a better home than string (as that's where the error handler is defined), but it doesn't belong in unicodedata. I'd be OK with changing the name of the clean function to "clean_escaped_surrogates". Needing redecode is not a bug: it's baked into the WSGI spec in PEP 3333. I would be OK with providing it in wsgiref rather than the codecs or string modules, but I think we should provide it somewhere.
msg225806 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-08-24 09:11
I agree with Ezio in all points. escaped_surrogates is inefficient for any purposes and incomplete. _match_surrogates can be created in more efficient way. clean() has too general and misleading name. redecode() looks just ridiculous, this cumbersome 4-argument function just combine two methods.
msg225809 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-24 11:02
Guys, you're Python 3 unicode experts, already thoroughly familiar with how surrogateescape works. These features are not for you. The exact implementations don't matter. These need to exist, and they need to be documented with detailed explanations. People don't yet understand what the surrogateescape error handler is, and what it's for, or how to work with it. People consider surrogate escaping this weird mysterious thing. It's not - it's a really simple mapping of the 128 bytes with the high order bit set to a different part of the Unicode code space. These functions make it obvious. wsgiref.redecode is a natural consequence of the design of PEP 3333 - it makes sense to offer it there, as a hint that frameworks are likely to need it. We have a serious, serious, learnability problem around binary data handling in Python 3. It's nowhere near as bad as the learnability problem for text handling in Python 2, but it still needs more scaffolding to help people grow in understanding.
msg225811 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-08-24 11:46
That's why I think a function like redecode is a bad idea. With Python 2 I've seen lot of people blindingly trying .decode when .encode failed (and the other way around) whenever they were getting an UnicodeError (and the fact that decoding Unicode results in an encode error didn't help). I'm afraid that confused developers will try to (mis)use redecode as a workaround to attempt to fix something that shouldn't be broken in the first place, without actually understanding what the real problem is. There's really no excuse for developers not to know Unicode nowadays (Joel Spolsky said the same thing more than 10 years ago). IME people also tend to overestimate how difficult it is to understand Unicode. While it is true that you can go pretty deep down the rabbit hole, the basic concepts necessary for everyday use are pretty simple and straightforward.
msg225817 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-24 13:00
The redecode thing is a distraction from my core concern here, so I've split that out to issue #22264, a separate RFE for a "wsgiref.fix_encoding" function. For this issue, my main concern is the function to clean a string of escaped binary data, so it can be displayed easily, or otherwise purged of the escaped characters. Preserving the data by default is good, but you have to know a lot about how Python 3 works in order to be able figure out how to clean it out. For that, not knowing Unicode in general isn't the problem: it's not knowing PEP 383. If we forget the idea of exposing the constant with the escaped values (I agree that's not very useful), it suggests "codecs.clean_surrogate_escapes" as a possible name: # Helper to ensure a string contains no escaped surrogates # This allows it to be safely encoded without surrogateescape _extended_ascii = bytes(range(128, 256)) _escaped_surrogates = _extended_ascii.decode('ascii', errors='surrogateescape') _match_escaped = re.compile('[{}]'.format(_escaped_surrogates)) def clean_surrogate_escapes(s, repl='\ufffd'): return _match_escaped.sub(repl, s) A more efficient implementation in C would also be fine, this is just an easy way to define the exact semantics. (I also just noticed that unlike other error handlers, surrogateespace and surrogatepass do not have corresponding codecs.surrogateescape_errors and codecs.surrogatepass_errors functions)
msg225818 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-08-24 13:49
What problem is purposed to solve clean_surrogate_escapes()? Could you please provide user scenario or two? Possible alternative implementation is: def clean_surrogate_escapes(s): return s.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace') It can be faster for some data (for mostly ASCII with rare surrogates it is superfast). For other data 'utf-16' can be better choice.
msg225822 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-24 14:16
My main use case is for passing data to other applications that don't have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do it. (I had never even heard of surrogatepass until this evening) Situation: What I have: data decoded with surrogateescape What I want: that same data with all the surrogates gone, replaced with either the Unicode replacement character or an ASCII question mark (which I want will depend on the exact situation) Assume I am largely clueless about the codec system. I know nothing beyond the fact that Python 3 strings may have smuggled bytes in them and I want to get rid of them because they confuse the application I'm passing them to. The concrete example that got me thinking about this again was the task of writing filenames into a UTF-8 encoded email, and wanting to scrub the output from os.listdir before writing the list into the email (s/email/web page/ also works). For issue #22016 I actually suggested doing this as another codec error handler ("surrogatereplace"), but Stephen Turnbull convinced me this original idea was better: it should just be a pure data transformation pass on the string, clearing the surrogates out, and leaving me with data that is identical to that I would have had if "surrogatereplace" had been used instead of "surrogateescape" in the first place. As "errors='replace'" already covers the "ASCII ?" replacement case, that means your proposed "redecode" based solution would cover the rest of my use case.
msg225858 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-08-25 01:16
Your clean() function looses information. If a filename contains almost only undecodable characters, it will looks like ��.txt. It's not very useful. I would prefer to escape the byte. Mac OS X (HFS+ filesystem) uses for example %HH format: "\udc80" would be replaced with "%80" for example. This format is also used in URLs. For example, "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt" in Firefox (when listing a local directory), but Firefox uses the URL "file://.../a%E9b.txt" (hexadecimal in uppercase). In the Gnome file browser (Nautilus), "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt (invalid encoding)".
msg225869 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-25 06:59
Ideally we'd have string modification support for all the translations we offer as codec error handlers: * Unicode replacement character ('replace' on input) * ASCII question mark ('replace' on output) * Dropping them entirely ('ignore') * XML character reference ('xmlcharrefreplace') * Python escape sequence ('backslashreplace') The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases. Now, some of these things can be done today using Serhiy's trick of encoding to UTF-8 and then decoding again: data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace') data.encode('utf-8', 'replace').decode('utf-8') data.encode('utf-8', 'ignore').decode('utf-8') However, these two don't work properly: data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8') data.encode('utf-8', 'backslashreplace').decode('utf-8') The reason those don't work is because they'll encode the surrogate escaped bytes, rather than the originals. Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.
msg225875 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-08-25 14:50
Right now having has_escaped_bytes isn't too important, since I've done nothing to profile and improve the performance of the new email code. But eventually I'll need it, because detecting the existence of escaped bytes is inside some of the inner loops in the header processing code. "s != s.clean_escaped_bytes()" (or whatever you call it) just smells wrong as a way to spell has_escaped_bytes.
msg225889 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-08-25 18:55
> data.encode('utf-8', 'replace').decode('utf-8') > data.encode('utf-8', 'ignore').decode('utf-8') Why not the reverse: os.fsencode(data).decode('utf-8', 'replace') os.fsencode(data).decode('utf-8', 'ignore') Note that "backslashreplace" needs to be enhanced to work when decoding too. Note that "xmlcharrefreplace" doesn't make sense here: it encodes a character reference, but you're precisely trying to represent something which fails interpreting as a character. (AFAIK, XML can't represent non-text data, except in NDATA sequences)
msg225972 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-08-27 11:00
Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case. Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286 I also agree we can ignore xmlcharrefreplace here. So that leaves the basic pattern as: data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace') This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that. Possible function name and implementation: def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here) Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE.
msg227334 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 09:59
Updated issue title to reflect current proposal.
msg227339 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2014-09-23 10:39
Don't like the function name :-) How about codecs.filter_non_utf8_data(), since that's closer to what the function is really doing and doesn't require knowledge about what "surrogateescape" is.
msg227340 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 10:51
The error handler is called "surrogateescape". That means "convert_surrogateescape" is always only a single step away from thinking "I want to remove the smuggled bytes from a surrogateescape'd string", without needing to assume any knowledge on the part of the user other than the name of the error handler and the fact that it is used to smuggle arbitrary bytes through the Python 3 str type. Getting from "this string was decoded with the surrogateescape handler and may contain smuggled bytes" to "filter_non_utf8_data" as the relevant cleanup function is a much bigger leap that requires more assumed knowledge on the part of the user, and also one that confuses the conceptual purpose of the function (cleaning up the output of the surrogateescape error handler to ensure it is a pure Unicode string) with the internal details of the proposed approach to implementing that cleanup operation (encoding to UTF-8 with surrogateescape, and then decoding again with a different error handler).
msg227341 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 10:57
The function definition again, this time with a draft docstring: def convert_surrogateescape(data, errors='replace'): """Convert escaped raw bytes by applying a different error handler Uses the "replace" error handler by default, but any input error handler may be specified. """ return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
msg227342 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 11:00
Note I would also be OK with "convert_surrogates", as that's the term that appears in the relevant error message: >>> b'\xe9'.decode('ascii', 'surrogateescape').encode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed
msg227343 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-09-23 11:04
Le 23/09/2014 12:57, Nick Coghlan a écrit : > The function definition again, this time with a draft docstring: > > def convert_surrogateescape(data, errors='replace'): > """Convert escaped raw bytes by applying a different error handler > > Uses the "replace" error handler by default, but any input > error handler may be specified. > """ > return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) 'utf-8' is hardcoded?
msg227344 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 11:12
Draft docstring for that version def convert_surrogates(data, errors='replace'): """Convert escaped surrogates by applying a different error handler Uses the "replace" error handler by default, but any input error handler may be specified. """ return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
msg227345 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 11:13
Antoine: what would be the use case for using a different encoding for the temporary bytes object? It's discarded anyway, so the encoding used isn't externally visible.
msg227346 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-09-23 11:23
The encoding used impacts the result: >>> s = 'abc\udcc3\udca9' >>> s.encode('ascii', 'surrogateescape').decode('ascii', 'replace') 'abc��' >>> s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') 'abcé' The original string ('abc\udcc3\udca9') was obtained by decoding a valid utf-8 string with the 'ascii' codec and the 'surrogateescape' error handler. If anything, the default encoding should probably be sys.getfilesystemencoding().
msg227362 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2014-09-23 14:50
On 23.09.2014 13:12, Nick Coghlan wrote: > > Nick Coghlan added the comment: > > Draft docstring for that version > > def convert_surrogates(data, errors='replace'): > """Convert escaped surrogates by applying a different error handler > > Uses the "replace" error handler by default, but any input > error handler may be specified. > """ > return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) Nick, the doc string is not correct. It is not working on escaped surrogates. Instead it is working on lone surrogates that were used to encode undecodable bytes from some input data. The longer story goes like this: The "surrogateescape" error handler in the .decode() call that lead up to the data you want this function to take as input, will convert undecodable data to lone low surrogates. The function then reverts these bytes back into UTF-8 (which may well not be the original encoding, as Antoine has already pointed out, but that's not really important for the use case), recreating the unencodable bytes and then decodes the result again using the UTF-8 codec using a new error handler. So in summary, the function is supposed to retroactively apply a different error handler to the input data, undoing the effects of the "surrogateescapes" error handler. The name still doesn't match this functionality. BTW: There's a catch in the approach. The encoding used to decode the original data may well be 'ascii'. Now, if the original input data was in fact UTF-8, the input decoding would have mapped the UTF-8 code points to lone surrogates. The above function would then turn these back into UTF-8, redecode and get a completely different string back (since the error handlers would not trigger). I'm not sure whether adding such a small function with so many unclear implications is a good idea. Either it should be made more specific, e.g. be reserved for use on data from input streams with known encoding, or be put into the documentation as example for people to use and adapt as necessary.
msg227364 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-09-23 15:06
And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8. I'm also not sure that it is worth adding this. If you know what you are doing the solution is obvious, and if you don't know what you are doing you shouldn't be using surrogateescape in the first place :) Now, if there were or there is intended to be a more efficient C level implementation, that answer might be different.
msg227365 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-09-23 15:15
Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio. So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module. os.convert_surrogates?
msg227367 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 15:32
As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it by default on all OS interfaces (including the standard streams when we detect 'ascii' as the filesystem encoding in 3.5+). This needs to be a case that folks can handle without needing to spend years learning about encodings and error handlers first. That means being able to tell them "use this documented function to remove the surrogates" rather than "use this magic incantation that you don't understand, and that other people may not be able to read". I know more about Unicode encodings than the average programmer at this point, yet I still needed to be schooled by true experts in this thread to learn how to solve the problem properly. Look at this as an opportunity to encapsulate that knowledge in executable form, as while the code is short, it is conceptually very dense. If there's a dedicated function, then replacing the encode/decode dance with a faster pure C alternative also becomes a future possibility (with only a recipe, there's no opportunity to ever optimise it). With the additional clarification, it is also clear to me that Antoine is correct that the encoding needs to be configurable and should default to the appropriate setting to remove the surrogates from OS provided data. With that change: def convert_surrogates(data, encoding=None, errors='replace'): """Convert escaped surrogates by applying a different error handler If no encoding is given, defaults to sys.getfilesystemencoding() Uses the "replace" error handler by default, but any input error handler may be specified. """ if encoding is None: encoding = sys.getfilesystemencoding() return data.encode(encoding, 'surrogateescape').decode(encoding, errors) Since it's primarily intended for cleaning OS provided data, then I agree os.convert_surrogates() could be a good choice. It would be appropriate to reference it from os.fsdecode() as a way to clean escaped data when the original binary data was no longer available to be decoded again with a different error handler.
msg227368 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-09-23 15:34
Good catch Antoine! Here is a sample of more complicated implementation.
msg227369 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-09-23 15:46
Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding.
msg238182 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-16 07:12
Proposed preliminary patch adds three functions in the codecs module: convert_surrogates(data, errors) -- handle lone surrogates with specified error handler. >>> codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace') 'a€\\udca4' convert_surrogateescape(data, errors) -- handle surrogateescaped bytes with specified error handler >>> codecs.convert_surrogateescape('a\u20ac\udca4', 'backslashreplace') 'a€\\xa4' convert_astrals(data, errors) -- handle astral (non-BMP) characters with specified error handler. >>> codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace') 'a€\\U000e007f' Names are discussable. I think also about adding two functions or error handlers (that can used with convert_surrogates and convert_astrals) for composing astral characters from surrogate pairs and vice versa.
msg238193 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-03-16 11:00
(Serhiy, did you miss uploading the new patch?) Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the case of handling lone surrogates and escaped surrogates, what about: rehandle_surrogatepass(data, errors="strict") rehandle_surrogateescape(data, errors="strict") That is, we know we have data that was decoded with either surrogatepass or surrogateespace (respectively) as the error handler, and we want to process the results of that with a different error handler. I believe those two would be enough to address the specific cases this issue was raised to cover, so it may make sense to file a separate issue to discuss the use cases for the custom astral handling. Since astrals aren't actually errors in the first place, that could become: handle_astrals(data, errors="strict") As in "pass every astral code point in this string through the named error handler". The astral -> surrogate pair and surrogate pair -> astral converters do sound potentially interesting, but as noted above, I think they may call for a separate issue that better explains the specific use cases.
msg238277 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-17 11:13
I uploaded the patch just before your comment Nick. Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other issue. I hesitate about the rehandle_surrogatepass name. This function handles surrogates than can be created not only with the "surrogatepass" handler, but also with different ways, e.g. with the "surrogateescape" handler, with chr(), handle_astral() or decompose_astrals(). Actually it checks that the string is valid Unicode (not containing surrogates) and handle errors if found with specified error handler. May be there is a time for wider discussion on Python-Dev. I especially want to hear opinions of Ezio and Martin.
msg238285 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-03-17 12:18
I'd wondered about that with respect to rehandle_surrogatepass. The current implementation looks like it processes all surrogates (even valid surrogate pairs), so "handle_surrogates" might be a suitable name. If the intent is for it to be "handle_lone_surrogates", I'm not sure the current implementation achieves that, as a valid surrogate pair will match re.compile('[\ud800-\uefff]+'). The rest looks OK to me, including the decompose_astrals() and compose_surrogate_pairs() functions. Regardless of any practical utility, the latter two seem useful for educational purposes when it comes to unicode, by making it clear how to switch between the single code point and dual code point representations of the astrals.
msg238286 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-03-17 12:22
Oh, and yes, I agree a python-dev discussion would be a good idea. From my perspective, "rehandle_surrogateescape" is the key function for making it easier to check for malformed input data from operating system interfaces. The other items I don't personally have a use case for, but they seem potentially valuable in make some key Unicode concepts a bit more discoverable.
msg238309 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-17 16:14
Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings).
msg242799 - (view)	Author: Stephen J. Turnbull (sjt) *	Date: 2015-05-09 07:53
Please do not add the "rehandle" functions to codecs. They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation. I suggest a "validation" submodule of the unicodedata package, or perhaps a new "unicodeutils" package, for these functions, as well as those that just detect the surrogates, etc. Because they change the semantics of data they should be documented as potentially dangerous because they can't be inverted back to bytes without knowledge of the history of transformations they perform (and not even then in the case of the "replace" error handler). This matters in applications where the input bytes may have been digitally signed, for example.
msg242810 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-05-09 13:15
surrogateescape and surrogateepass data already can't be inverted back to bytes reliably without knowing the original encoding - if you encode them as something else when they contain surrogates, you'll either get an exception (the default) or mojibake (if you use surrogateescape/surrogateepass as the output error handler). They only work as a transparent pass through if the input and output encodings match. I'd be fine with putting these data scrubbing functions somewhere other than in codecs, though (I'm not sure unicodedata is the right place, but a new module like "string.internals" might be, as these functions have more to do with Python's internal text representation than they do anything else. A module like the latter could also be a home for things like a chunking utility that splits a string up into substrings that use as little memory as possible for feeding into a StringIO instance before throwing the original away). I also don't think they're urgent - the introduction of /etc/locale.conf makes modern Linux far more consistent in getting locale settings right, and even older platforms tend to get the locale right for user processes.
msg242939 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-05-12 04:53
I suggest we defer this one to 3.6 - I still think it's worth doing, but I don't think it's a major barrier to migration, and it would be good to get some real world experience with the new sys.stdin behaviour of defaulting to using surrogateescape in the POSIX locale in 3.5 before committing to a particular design for the surrogate cleaning API. I do like the idea of a "string.internals" submodule as a possible home for exposing the Python level API.
msg251682 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-09-27 05:00
[padding] I think my suggested colours for the bikeshed would be handle_surrogates() and handle_surrogateescape(). “Rehandle” seems awkward and too assuming to me. And I agree with Serhiy that surrogates are a Unicode thing, not just related to the “surrogatepass” handler. Adding them to “codecs” makes sense to me. The most important one, handle_surrogateescape() or equivalent, is closely related to the error handler of that module. Having handle_surrogateescape or equivalent would probably be useful for Issue 25184 (displaying an arbitrary file path in a UTF-8 HTML file).
msg251693 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-09-27 08:55
Hum, I suggest to put these functions in a package on PyPI, or recipes on a website like stackoverfkow., and close the issue. I'm still not convinced that these functions are useful . Usually we take a function from an existing project used in applications to put it in the stdlib. Here the use case still looks artifical. For example which application requires to escape non-BMP character? How does it handle them currently? Threre are too many ways to handle surrogate characters. The common ways to show undecodable bytes are not supported by functions proposed by Serhiy. Example: %80 on Mac OS X. Gnome uses something else. It was said that one reason to add new functions is performance. I'm not convinced neither that such function is the bottleneck on any application. I prefer to wait until users experiment with their own implementation and see if a common function can be extracted from this to put it in the stdlib.
msg251694 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-09-27 09:30
I think moving this forward mainly needs someone with the time and energy wrangle a python-ideas/dev discussion to get some additional feedback on the API design. As I see it, there are 2 main questions to be resolved: 1. Where to expose these functions The default location would be the codecs module, as they're closely related to the error handlers in that module, and the main reasons for needing to clean data at all are handling dirty data produced by an interface that uses surrogatepass or surrogateescape when decoding (handle_surrogates, handle_surrogateescape), or encoding data for use in a context which doesn't correctly handle code points outside the basic multilingual plane (handle_astrals). If added to the codecs module, they could be documented in new sections on "Postprocessing decoded text" and "Preprocessing text for encoding". The main argument against that would be Stephen's one, which is that these aren't themselves encoding or decoding operations, but rather internal state manipulations on Python strings. 2. The exact function set to be provided. The three potential data cleaning cases currently being considered: * process_surrogates: reprocessing all surrogates in the string, including lone surrogates and valid surrogate pairs. Such strings may be produced by using the "surrogatepass" handler when decoding, or by decomposing astral characters to surrogate pairs. * process_surrogateescape: reprocessing only lone surrogates in the U+DC80 to U+DCFF range, with other surrogate pairs or lone surrogates triggering UnicodeTranslateError. Such strings may be produced by using the "surrogateescape" error handler when decoding. * process_astrals: reprocessing all code points in the astral plane. These seem to cover the essentials to me, and I changed the proposed prefix to "process_*" based on the idea of documentating them as preprocessing and postprocessing steps for encoding and decoding.
msg251701 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-09-27 12:32
As far as the rationale for adding the functions at all goes, my main interest is still in having somewhere in the codecs module documentation to define the problem, and to my mind that entails also offering a simple way to do the relevant pre-/post-processing. The nice aspect of building any related capabilities atop the standard error handlers is that it also means that third party modules can provide custom error handlers to support further escaping techniques, and those will also be available for use in decoding and encoding operations, rather than being specific to pre-/post-processing of the data. However, it's also the case that we're generally going to be talking about the combination of encoding misconfiguration and processing data that gets potentially corrupted by the misconfiguration and doing something with it that isn't already handled by a surrogateescape round-trip, which is why I suspect in practice most applications are going to be able to get away with ignoring the problem entirely (especially with C.UTF-8 support coming to Fedora 24, so the Fedora/RHEL/CentOS ecosystem will be joining the Debian/Ubuntu ecosystem in offering that by default)
msg251708 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-09-27 16:17
I also want "detect if there are any surrogates".
msg251740 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2015-09-28 02:23
On Sun, Sep 27, 2015 at 04:17:45PM +0000, R. David Murray wrote: > > I also want "detect if there are any surrogates". I think that's useful enough it should be a str method. Here's a pure-Python implementation: def is_surrogate(s): return any(0xD800 <= ord(c) <= 0xDFFF for c in s) The actual Flexible String Representation implementation can be even more efficient. All-ASCII and all-Latin1 strings cannot possibly include surrogates, so is_surrogate will be a constant-time operation. (If we care about making this a constant-time operation for all strings, the constructor could set a flag on the string object. Since the constructor has to walk the string anyway, I don't think that will cost much time-wise.)
msg251836 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-09-29 09:52
> I also want "detect if there are any surrogates". Could you please open a separated issue for this function/method? I believe that it's very different than other proposed functions/methods. It was proposed before to add methods like "is_ascii()" but the request was rejected because other Python implementations don't implement Unicode using the PEP 393 and so the method would be less efficient than expected, depending on the implementation of Python. Well, I don't know if we should add methods relying on the PEP 393 or not. It's probably better to discuss such "political" question on python-dev.
msg251854 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-09-29 12:48
Done: issue 25269.
msg314682 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2018-03-30 06:52
With PEPs 538 and 540 implemented for 3.7, my thinking on this has evolved a bit. A recent discussion on python-ideas [1] also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data: https://ftfy.readthedocs.io/en/latest/ That includes a lone surrogate fixer: https://ftfy.readthedocs.io/en/latest/#ftfy.fixes.fix_surrogates So a potential way to go here would be to a section on "Handling Improperly Decoded Text Data" to the codecs module documentation, and include ftfy as a See Also link in that new section. If folks think that would be a reasonable way to go, then I think the clearest way to handle it would be to close this issue as "later" (which still implies "maybe never", but not as strongly as "rejected" does), and open a new issue for the suggested new section in the docs. [1] https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

History
Date	User	Action	Args
2022-04-11 14:57:49	admin	set	github: 63014
2018-06-09 11:03:02	ncoghlan	unlink	issue22555 dependencies
2018-03-30 06:52:39	ncoghlan	set	messages: + msg314682 versions: + Python 3.8, - Python 3.6
2015-09-29 12:48:38	r.david.murray	set	messages: + msg251854
2015-09-29 09:52:34	vstinner	set	messages: + msg251836
2015-09-28 02:23:56	steven.daprano	set	messages: + msg251740
2015-09-27 16:17:44	r.david.murray	set	messages: + msg251708
2015-09-27 12:32:15	ncoghlan	set	messages: + msg251701
2015-09-27 09:30:58	ncoghlan	set	messages: + msg251694
2015-09-27 08:55:47	vstinner	set	messages: + msg251693
2015-09-27 05:00:34	martin.panter	set	nosy: + martin.panter messages: + msg251682
2015-06-07 15:39:00	steven.daprano	set	nosy: + steven.daprano
2015-05-12 04:54:00	ncoghlan	set	title: Add codecs.convert_surrogateescape to "clean" surrogate escaped strings -> Add utilities to "clean" surrogate code points from strings messages: + msg242939 versions: + Python 3.6, - Python 3.5
2015-05-09 13:15:33	ncoghlan	set	messages: + msg242810
2015-05-09 07:53:04	sjt	set	nosy: + sjt messages: + msg242799
2015-03-17 16:14:15	serhiy.storchaka	set	messages: + msg238309
2015-03-17 12:22:21	ncoghlan	set	messages: + msg238286
2015-03-17 12:18:58	ncoghlan	set	messages: + msg238285
2015-03-17 11:30:08	serhiy.storchaka	set	files: + codecs_convert_escapes_2.patch
2015-03-17 11:13:28	serhiy.storchaka	set	messages: + msg238277
2015-03-16 11:00:12	ncoghlan	set	messages: + msg238193
2015-03-16 10:52:49	serhiy.storchaka	set	files: + codecs_convert_escapes.patch keywords: + patch
2015-03-16 07:12:52	serhiy.storchaka	set	messages: + msg238182 components: + Library (Lib) versions: + Python 3.5, - Python 3.4
2015-03-15 22:09:04	serhiy.storchaka	set	dependencies: + Add support of UnicodeTranslateError in standard error handlers
2014-10-05 04:09:50	ncoghlan	link	issue22555 dependencies
2014-09-23 15:46:19	ncoghlan	set	messages: + msg227369
2014-09-23 15:34:11	serhiy.storchaka	set	files: + convert_surrogates.py messages: + msg227368 title: Add a convert_surrogates function to "clean" surrogate escaped strings -> Add codecs.convert_surrogateescape to "clean" surrogate escaped strings
2014-09-23 15:32:36	ncoghlan	set	messages: + msg227367 title: Add codecs.convert_surrogateescape to "clean" surrogate escaped strings -> Add a convert_surrogates function to "clean" surrogate escaped strings
2014-09-23 15:15:19	r.david.murray	set	messages: + msg227365
2014-09-23 15:06:57	r.david.murray	set	messages: + msg227364
2014-09-23 14:50:13	lemburg	set	messages: + msg227362
2014-09-23 11:23:47	pitrou	set	messages: + msg227346
2014-09-23 11:13:12	ncoghlan	set	messages: + msg227345
2014-09-23 11:12:04	ncoghlan	set	messages: + msg227344
2014-09-23 11:04:47	pitrou	set	messages: + msg227343
2014-09-23 11:00:44	ncoghlan	set	messages: + msg227342
2014-09-23 10:57:54	ncoghlan	set	messages: + msg227341
2014-09-23 10:51:12	ncoghlan	set	messages: + msg227340
2014-09-23 10:39:50	lemburg	set	nosy: + lemburg messages: + msg227339
2014-09-23 09:59:43	ncoghlan	set	messages: + msg227334 title: Add tools for "cleaning" surrogate escaped strings -> Add codecs.convert_surrogateescape to "clean" surrogate escaped strings
2014-08-27 11:00:40	ncoghlan	set	dependencies: + Allow backslashreplace error handler to be used on input messages: + msg225972
2014-08-25 18:55:02	pitrou	set	messages: + msg225889
2014-08-25 14:50:42	r.david.murray	set	messages: + msg225875
2014-08-25 06:59:28	ncoghlan	set	messages: + msg225869
2014-08-25 01:16:14	vstinner	set	messages: + msg225858
2014-08-24 14:16:53	ncoghlan	set	messages: + msg225822
2014-08-24 13:49:12	serhiy.storchaka	set	messages: + msg225818
2014-08-24 13:00:36	ncoghlan	set	messages: + msg225817
2014-08-24 11:46:14	ezio.melotti	set	messages: + msg225811
2014-08-24 11:02:05	ncoghlan	set	messages: + msg225809
2014-08-24 09:11:24	serhiy.storchaka	set	messages: + msg225806
2014-08-24 08:24:10	ncoghlan	set	messages: + msg225801
2014-08-24 07:58:21	ezio.melotti	set	messages: + msg225800
2014-08-24 04:44:12	ncoghlan	set	messages: + msg225793
2014-08-24 03:00:11	ncoghlan	set	messages: + msg225791
2013-08-24 00:27:00	ncoghlan	set	messages: + msg196063
2013-08-23 23:28:45	r.david.murray	set	messages: + msg196054
2013-08-23 22:06:39	vstinner	set	nosy: + vstinner messages: + msg196049
2013-08-23 22:02:00	pitrou	set	nosy: + pitrou messages: + msg196047
2013-08-23 20:28:58	ezio.melotti	set	nosy: + ezio.melotti
2013-08-23 11:55:16	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg195960
2013-08-23 11:02:55	r.david.murray	set	nosy: + r.david.murray messages: + msg195959
2013-08-23 04:39:55	Arfrever	set	nosy: + Arfrever
2013-08-23 04:02:32	ncoghlan	create