Message194625
Escaping strings for serialization or display is a common problem. Currently, in python3, in order to escape a sting, you need to do this:
'my\tstring'.encode('unicode_escape').decode('ascii')
This would give you a string that was represented like this:
'my\\tstring'
But this does not present a suitable representation when the string contains unicode characters. Consider this example:
s = 'Α\tΩ'
There is no method to write this string this with only the control character escaped.
Even python itself recognizes this as a problem and implemented a "solution" for it.
>>> s = 'Α\tΩ'
>>> print(s)
Α Ω
>>> print(repr(s))
'Α\tΩ'
>>> print(s.encode('unicode_escape').decode('ascii'))
\u0391\t\u03a9
What I want is public exposure of the functionality to represent control characters with their common \ escape sequences (or \x## for control characters where necessary - for instance unit and record separators).
I have numerous use cases for this and python's own str.__repr__ implementation shows that the functionality is valuable. I would bet that the majority of cases where people use unicode_escape something like a control_escape is more along the lines of what is desired.
And while we're at it, it would be great if this were a unicode->unicode codec like the rot_13 codec. My desired soluiton would look like this:
>>> import codecs
>>> s = 'Α\tΩ'
>>> e = codecs.encode(s, 'control_escape'))
>>> print(e)
Α\tΩ
>>> print(codecs.decode(e, 'control_escape'))
Α Ω
If this is something that could be included in python 3.4, that would be awesome. I am willing to work on this if so. |
|
Date |
User |
Action |
Args |
2013-08-07 21:22:45 | underrun | set | recipients:
+ underrun |
2013-08-07 21:22:45 | underrun | set | messageid: <1375910565.85.0.834240880509.issue18679@psf.upfronthosting.co.za> |
2013-08-07 21:22:45 | underrun | link | issue18679 messages |
2013-08-07 21:22:45 | underrun | create | |
|