Message225972
Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case.
Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286
I also agree we can ignore xmlcharrefreplace here.
So that leaves the basic pattern as:
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore')
data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace')
This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that.
Possible function name and implementation:
def convert_surrogateescape(data, errors='replace'):
return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here)
Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE. |
|
Date |
User |
Action |
Args |
2014-08-27 11:00:40 | ncoghlan | set | recipients:
+ ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka |
2014-08-27 11:00:40 | ncoghlan | set | messageid: <1409137240.61.0.949352197472.issue18814@psf.upfronthosting.co.za> |
2014-08-27 11:00:40 | ncoghlan | link | issue18814 messages |
2014-08-27 11:00:40 | ncoghlan | create | |
|