Message 225869 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Arfrever, ezio.melotti, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date	2014-08-25.06:59:28
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1408949968.78.0.485567815592.issue18814@psf.upfronthosting.co.za>
In-reply-to

Content
Ideally we'd have string modification support for all the translations we offer as codec error handlers: * Unicode replacement character ('replace' on input) * ASCII question mark ('replace' on output) * Dropping them entirely ('ignore') * XML character reference ('xmlcharrefreplace') * Python escape sequence ('backslashreplace') The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases. Now, some of these things can be done today using Serhiy's trick of encoding to UTF-8 and then decoding again: data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace') data.encode('utf-8', 'replace').decode('utf-8') data.encode('utf-8', 'ignore').decode('utf-8') However, these two don't work properly: data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8') data.encode('utf-8', 'backslashreplace').decode('utf-8') The reason those don't work is because they'll encode the surrogate escaped bytes, rather than the originals. Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.

Ideally we'd have string modification support for all the translations we offer as codec error handlers:

* Unicode replacement character ('replace' on input)
* ASCII question mark ('replace' on output)
* Dropping them entirely ('ignore')
* XML character reference ('xmlcharrefreplace')
* Python escape sequence ('backslashreplace')

The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases.

Now, some of these things *can* be done today using Serhiy's trick of encoding to UTF-8 and then decoding again:

    data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
    data.encode('utf-8', 'replace').decode('utf-8')
    data.encode('utf-8', 'ignore').decode('utf-8')

However, these two don't work properly:

    data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
    data.encode('utf-8', 'backslashreplace').decode('utf-8')

The reason those don't work is because they'll encode the *surrogate escaped bytes*, rather than the originals.

Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.

History
Date	User	Action	Args
2014-08-25 06:59:28	ncoghlan	set	recipients: + ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka
2014-08-25 06:59:28	ncoghlan	set	messageid: <1408949968.78.0.485567815592.issue18814@psf.upfronthosting.co.za>
2014-08-25 06:59:28	ncoghlan	link	issue18814 messages
2014-08-25 06:59:28	ncoghlan	create