This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ncoghlan
Recipients Arfrever, ezio.melotti, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date 2014-08-24.14:16:52
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1408889813.23.0.681022056196.issue18814@psf.upfronthosting.co.za>
In-reply-to
Content
My main use case is for passing data to other applications that *don't* have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do it. (I had never even heard of surrogatepass until this evening)

Situation:

What I have: data decoded with surrogateescape
What I want: that same data with all the surrogates gone, replaced with either the Unicode replacement character or an ASCII question mark (which I want will depend on the exact situation)

Assume I am largely clueless about the codec system. I know nothing beyond the fact that Python 3 strings may have smuggled bytes in them and I want to get rid of them because they confuse the application I'm passing them to.

The concrete example that got me thinking about this again was the task of writing filenames into a UTF-8 encoded email, and wanting to scrub the output from os.listdir before writing the list into the email (s/email/web page/ also works).

For issue #22016 I actually suggested doing this as *another* codec error handler ("surrogatereplace"), but Stephen Turnbull convinced me this original idea was better: it should just be a pure data transformation pass on the string, clearing the surrogates out, and leaving me with data that is identical to that I would have had if "surrogatereplace" had been used instead of "surrogateescape" in the first place.

As "errors='replace'" already covers the "ASCII ?" replacement case, that means your proposed "redecode" based solution would cover the rest of my use case.
History
Date User Action Args
2014-08-24 14:16:53ncoghlansetrecipients: + ncoghlan, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, serhiy.storchaka
2014-08-24 14:16:53ncoghlansetmessageid: <1408889813.23.0.681022056196.issue18814@psf.upfronthosting.co.za>
2014-08-24 14:16:53ncoghlanlinkissue18814 messages
2014-08-24 14:16:52ncoghlancreate