Message 242810 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Arfrever, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, sjt, vstinner
Date	2015-05-09.13:15:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CADiSq7fe3jgLxTNAVnSdzKa-_2jaotcZPpQyaD7o4dJEmfWoeQ@mail.gmail.com>
In-reply-to	<1431157984.91.0.636458618195.issue18814@psf.upfronthosting.co.za>

Content
surrogateescape and surrogateepass data already can't be inverted back to bytes reliably without knowing the original encoding - if you encode them as something else when they contain surrogates, you'll either get an exception (the default) or mojibake (if you use surrogateescape/surrogateepass as the output error handler). They only work as a transparent pass through if the input and output encodings match. I'd be fine with putting these data scrubbing functions somewhere other than in codecs, though (I'm not sure unicodedata is the right place, but a new module like "string.internals" might be, as these functions have more to do with Python's internal text representation than they do anything else. A module like the latter could also be a home for things like a chunking utility that splits a string up into substrings that use as little memory as possible for feeding into a StringIO instance before throwing the original away). I also don't think they're urgent - the introduction of /etc/locale.conf makes modern Linux far more consistent in getting locale settings right, and even older platforms tend to get the locale right for user processes.

surrogateescape and surrogateepass data *already* can't be inverted back to
bytes reliably without knowing the original encoding - if you encode them
as something else when they contain surrogates, you'll either get an
exception (the default) or mojibake (if you use
surrogateescape/surrogateepass as the output error handler). They only work
as a transparent pass through if the input and output encodings match.

I'd be fine with putting these data scrubbing functions somewhere other
than in codecs, though (I'm not sure unicodedata is the right place, but a
new module like "string.internals" might be, as these functions have more
to do with Python's internal text representation than they do anything
else. A module like the latter could also be a home for things like a
chunking utility that splits a string up into substrings that use as little
memory as possible for feeding into a StringIO instance before throwing the
original away).

I also don't think they're urgent - the introduction of /etc/locale.conf
makes modern Linux far more consistent in getting locale settings right,
and even older platforms tend to get the locale right for user processes.

History
Date	User	Action	Args
2015-05-09 13:15:33	ncoghlan	set	recipients: + ncoghlan, lemburg, pitrou, vstinner, ezio.melotti, Arfrever, r.david.murray, sjt, serhiy.storchaka
2015-05-09 13:15:33	ncoghlan	link	issue18814 messages
2015-05-09 13:15:33	ncoghlan	create