classification
Title: include a codec to handle escaping only control characters but not any others
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: martin.panter, r.david.murray, serhiy.storchaka, underrun
Priority: normal Keywords:

Created on 2013-08-07 21:22 by underrun, last changed 2014-12-13 01:42 by martin.panter.

Messages (9)
msg194625 - (view) Author: Derek Wilson (underrun) Date: 2013-08-07 21:22
Escaping strings for serialization or display is a common problem. Currently, in python3, in order to escape a sting, you need to do this:

'my\tstring'.encode('unicode_escape').decode('ascii')

This would give you a string that was represented like this:

'my\\tstring'

But this does not present a suitable representation when the string contains unicode characters. Consider this example:

s = 'Α\tΩ'

There is no method to write this string this with only the control character escaped.

Even python itself recognizes this as a problem and implemented a "solution" for it.

>>> s = 'Α\tΩ'
>>> print(s)
Α	Ω
>>> print(repr(s))
'Α\tΩ'
>>> print(s.encode('unicode_escape').decode('ascii'))
\u0391\t\u03a9

What I want is public exposure of the functionality to represent control characters with their common \ escape sequences (or \x## for control characters where necessary - for instance unit and record separators).

I have numerous use cases for this and python's own str.__repr__ implementation shows that the functionality is valuable. I would bet that the majority of cases where people use unicode_escape something like a control_escape is more along the lines of what is desired.

And while we're at it, it would be great if this were a unicode->unicode codec like the rot_13 codec. My desired soluiton would look like this:

>>> import codecs
>>> s = 'Α\tΩ'
>>> e = codecs.encode(s, 'control_escape'))
>>> print(e)
Α\tΩ
>>> print(codecs.decode(e, 'control_escape'))
Α	Ω

If this is something that could be included in python 3.4, that would be awesome. I am willing to work on this if so.
msg194632 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-07 22:18
In what way does repr(x)[1:-1] not serve your use case?
msg194685 - (view) Author: Derek Wilson (underrun) Date: 2013-08-08 15:23
using repr(x)[1:-1] is not safe for my use case as i need this for encoding and decoding data. the "deserialization" of repr would be eval, and aside from the security issues with that, if I strip the quotes off I can't reliably eval the result and get back the original. On top of that, quote escape handling makes this non-portable to other languages/tools that do understand control character escapes. Consider:

>>> s = """Α""\t'''Ω"""
>>> print(s)
Α""	'''Ω
>>> e = repr(s)[1:-1]
>>> print(e)
Α""\t\'\'\'Ω

how do i know what to quote e with before I eval it to get back the value? I can't even try all the quoting options and stop when i don't get a syntax error because more than one could work and give me a bad result:

>>> d = eval('"{}"'.format(e))
>>> d == s
False
>>> print(d)
Α	'''Ω

Aside from python not being able to handle the repr(x)[1:-1] case itself, the goal is to use output generated in common tools from cut to hadoop where tab is a field separator (aside: wouldn't adoption of ascii 0x1f as a common unit separator be great). Sometimes it is useful to separate newlines in data from a literal new line in formats (again like hadoop or unix utilities) that treat lines as records (and here again ascii 0x1e would have been a nice solution).

But we have to work with what we've got and there are many tools that care about tab separated fields and per line records. In these cases, the right tool for the interoperability job is a codec that simply backslash escapes control characters and nothing else.
msg194689 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-08 15:45
ast.literal_eval("'%s'" % e)
e.encode().decode('unicode-escape').encode('latin1').decode()
e.encode('latin1', 'backslashescape').decode('unicode-escape')
msg194690 - (view) Author: Derek Wilson (underrun) Date: 2013-08-08 16:23
> ast.literal_eval("'%s'" % e)

this doesn't work if you use the wrong quote. without introspecting the data in e you can't reliably choose whether to use "'%s'" '"%s"' '"""%s"""' or "'''%s'''". which ones break (and break siliently) depend on the data.


> e.encode().decode('unicode-escape').encode('latin1').decode()

so ... encode the repr()[1:-1] string in utf-8 bytes, decode backslash escape sequences and individual bytes as if they are latin1, encode as latin1 (which is just byte for byte serialization), then decode the byte representation as if it is utf-8 encoded to recombine the characters that were broken with the 'unicode-escape' decode earlier? 

this may work for my example, but this looks and feels very hacky for something that should be simple and straight forward. and again tools other than python will run into escaped quotes in the data which may cause problems.

> e.encode('latin1', 'backslashescape').decode('unicode-escape')

when i execute this i get a traceback

LookupError: unknown error handler name 'backslashescape'
msg194700 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-08 17:31
> this doesn't work if you use the wrong quote. without introspecting the data in e you can't reliably choose whether to use "'%s'" '"%s"' '"""%s"""' or "'''%s'''".

Indeed.

> and again tools other than python will run into escaped quotes in the data which may cause problems.

Then use s.translate() or re.sub() for encoding.

> when i execute this i get a traceback

Sorry, it should be

e.encode('latin1', 'backslashreplace').decode('unicode-escape').
msg194708 - (view) Author: Derek Wilson (underrun) Date: 2013-08-08 20:47
> e.encode('latin1', 'backslashreplace').decode('unicode-escape')

this works, but still the quotes are backslash escaped. 

translate will do what i need for my use case, but it doesn't support streaming for larger chunks of data.

it is nice that there is a workaround but i do still think this is a valuable enough feature that there should be a builtin codec for it.
msg198965 - (view) Author: Derek Wilson (underrun) Date: 2013-10-04 20:54
Any update on this? Just so you can see what my work around is, I'll paste in the code I'm using. The major issue I have with this is that performance doesn't scale to large strings.

This is also a bytes-to-bytes or str-to-str encoding, because this is the type of operation that one plans to do with the data one has.

Having a full fledged streaming codec to handle this would be very helpful when writing applications that stream tab and newline separated utf-8 data over stdin/stdout.
                                                                                                                  
text_types = (str, )                                                      

escape_tm = dict((k, repr(chr(k))[1:-1]) for k in range(32))              
escape_tm[0] = '\0'                                                            
escape_tm[7] = '\a'                                                            
escape_tm[8] = '\b'                                                            
escape_tm[11] = '\v'                                                           
escape_tm[12] = '\f'                                                           
escape_tm[ord('\\')] = '\\\\'

def escape_control(s):                                                          
    if isinstance(s, text_types):                                               
        return s.translate(escape_tm)
    else:
        return s.decode('utf-8', 'surrogateescape').translate(escape_tm).encode('utf-8', 'surrogateescape')

def unescape_control(s):                                                        
    if isinstance(s, text_types):                                               
        return s.encode('latin1', 'backslashreplace').decode('unicode_escape')
    else:                                                                       
        return s.decode('utf-8', 'surrogateescape').encode('latin1', 'backslashreplace').decode('unicode_escape').encode('utf-8', 'surrogateescape')
msg199210 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-08 15:38
Well, you could writing a streaming codec.  Even if it didn't get accepted for the stdlib, you could put it up on pypi.
History
Date User Action Args
2014-12-13 01:42:12martin.pantersetnosy: + martin.panter
2013-10-08 15:38:17r.david.murraysetmessages: + msg199210
2013-10-04 20:54:34underrunsetmessages: + msg198965
2013-08-08 20:47:28underrunsetmessages: + msg194708
2013-08-08 17:31:07serhiy.storchakasetmessages: + msg194700
2013-08-08 16:23:03underrunsetmessages: + msg194690
2013-08-08 15:45:07serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg194689
2013-08-08 15:23:09underrunsetmessages: + msg194685
2013-08-07 22:18:32r.david.murraysetnosy: + r.david.murray
messages: + msg194632
2013-08-07 21:22:45underruncreate