Message 225791 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Arfrever, ezio.melotti, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date	2014-08-24.03:00:10
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1408849211.6.0.994491412192.issue18814@psf.upfronthosting.co.za>
In-reply-to

Content
Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal: # Constant in the string module (akin to string.ascii_letters et al) escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape') # Helper to ensure a string contains no escaped surrogates # This allows it to be safely encoded without surrogateescape _match_surrogates = re.compile('[{}]'.format(escaped_surrogates)) def clean(s, repl='\ufffd'): return _match_surrogates.sub(repl, s) # Helper to redecode a string that was decoded incorrectly # For example, WSGI strings are passed from the server to the # framework as latin-1 by default and may need to be redecoded def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'): return s.encode(old_encoding, old_errors).decode(encoding, errors) In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).

Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal:

    # Constant in the string module (akin to string.ascii_letters et al)
    escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape')

    # Helper to ensure a string contains no escaped surrogates
    # This allows it to be safely encoded without surrogateescape
    _match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
    def clean(s, repl='\ufffd'):
        return _match_surrogates.sub(repl, s)

    # Helper to redecode a string that was decoded incorrectly
    # For example, WSGI strings are passed from the server to the
    # framework as latin-1 by default and may need to be redecoded
    def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'):
        return s.encode(old_encoding, old_errors).decode(encoding, errors)

In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).

History
Date	User	Action	Args
2014-08-24 03:00:11	ncoghlan	set	recipients: + ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka
2014-08-24 03:00:11	ncoghlan	set	messageid: <1408849211.6.0.994491412192.issue18814@psf.upfronthosting.co.za>
2014-08-24 03:00:11	ncoghlan	link	issue18814 messages
2014-08-24 03:00:10	ncoghlan	create