Message225791
Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal:
# Constant in the string module (akin to string.ascii_letters et al)
escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape')
# Helper to ensure a string contains no escaped surrogates
# This allows it to be safely encoded without surrogateescape
_match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
def clean(s, repl='\ufffd'):
return _match_surrogates.sub(repl, s)
# Helper to redecode a string that was decoded incorrectly
# For example, WSGI strings are passed from the server to the
# framework as latin-1 by default and may need to be redecoded
def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'):
return s.encode(old_encoding, old_errors).decode(encoding, errors)
In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding). |
|
Date |
User |
Action |
Args |
2014-08-24 03:00:11 | ncoghlan | set | recipients:
+ ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka |
2014-08-24 03:00:11 | ncoghlan | set | messageid: <1408849211.6.0.994491412192.issue18814@psf.upfronthosting.co.za> |
2014-08-24 03:00:11 | ncoghlan | link | issue18814 messages |
2014-08-24 03:00:10 | ncoghlan | create | |
|