This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ncoghlan
Recipients Arfrever, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, sjt, vstinner
Date 2015-05-09.13:15:33
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
In-reply-to <>
surrogateescape and surrogateepass data *already* can't be inverted back to
bytes reliably without knowing the original encoding - if you encode them
as something else when they contain surrogates, you'll either get an
exception (the default) or mojibake (if you use
surrogateescape/surrogateepass as the output error handler). They only work
as a transparent pass through if the input and output encodings match.

I'd be fine with putting these data scrubbing functions somewhere other
than in codecs, though (I'm not sure unicodedata is the right place, but a
new module like "string.internals" might be, as these functions have more
to do with Python's internal text representation than they do anything
else. A module like the latter could also be a home for things like a
chunking utility that splits a string up into substrings that use as little
memory as possible for feeding into a StringIO instance before throwing the
original away).

I also don't think they're urgent - the introduction of /etc/locale.conf
makes modern Linux far more consistent in getting locale settings right,
and even older platforms tend to get the locale right for user processes.
Date User Action Args
2015-05-09 13:15:33ncoghlansetrecipients: + ncoghlan, lemburg, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, sjt, serhiy.storchaka
2015-05-09 13:15:33ncoghlanlinkissue18814 messages
2015-05-09 13:15:33ncoghlancreate