Message 225972 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Arfrever, ezio.melotti, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date	2014-08-27.11:00:40
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1409137240.61.0.949352197472.issue18814@psf.upfronthosting.co.za>
In-reply-to

Content
Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case. Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286 I also agree we can ignore xmlcharrefreplace here. So that leaves the basic pattern as: data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace') This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that. Possible function name and implementation: def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here) Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE.

Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case.

Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286

I also agree we can ignore xmlcharrefreplace here.

So that leaves the basic pattern as:

data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace')

This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that.

Possible function name and implementation:

    def convert_surrogateescape(data, errors='replace'):
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    
Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here)

Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE.

History
Date	User	Action	Args
2014-08-27 11:00:40	ncoghlan	set	recipients: + ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka
2014-08-27 11:00:40	ncoghlan	set	messageid: <1409137240.61.0.949352197472.issue18814@psf.upfronthosting.co.za>
2014-08-27 11:00:40	ncoghlan	link	issue18814 messages
2014-08-27 11:00:40	ncoghlan	create