This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ncoghlan
Recipients Arfrever, ezio.melotti, lemburg, martin.panter, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, sjt, steven.daprano, vstinner
Date 2015-09-27.09:30:58
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1443346258.51.0.297563428023.issue18814@psf.upfronthosting.co.za>
In-reply-to
Content
I think moving this forward mainly needs someone with the time and energy wrangle a python-ideas/dev discussion to get some additional feedback on the API design. As I see it, there are 2 main questions to be resolved:

1. Where to expose these functions

The default location would be the codecs module, as they're closely related to the error handlers in that module, and the main reasons for needing to clean data at all are handling dirty data produced by an interface that uses surrogatepass or surrogateescape when decoding (handle_surrogates, handle_surrogateescape), or encoding data for use in a context which doesn't correctly handle code points outside the basic multilingual plane (handle_astrals).

If added to the codecs module, they could be documented in new sections on "Postprocessing decoded text" and "Preprocessing text for encoding".

The main argument against that would be Stephen's one, which is that these aren't themselves encoding or decoding operations, but rather internal state manipulations on Python strings.

2. The exact function set to be provided.

The three potential data cleaning cases currently being considered:

* process_surrogates: reprocessing all surrogates in the string, including lone surrogates and valid surrogate pairs. Such strings may be produced by using the "surrogatepass" handler when decoding, or by decomposing astral characters to surrogate pairs.
* process_surrogateescape: reprocessing only lone surrogates in the U+DC80 to U+DCFF range, with other surrogate pairs or lone surrogates triggering UnicodeTranslateError. Such strings may be produced by using the "surrogateescape" error handler when decoding.
* process_astrals: reprocessing all code points in the astral plane.

These seem to cover the essentials to me, and I changed the proposed prefix to "process_*" based on the idea of documentating them as preprocessing and postprocessing steps for encoding and decoding.
History
Date User Action Args
2015-09-27 09:30:58ncoghlansetrecipients: + ncoghlan, lemburg, pitrou, vstinner, ezio.melotti, Arfrever, steven.daprano, r.david.murray, sjt, martin.panter, serhiy.storchaka
2015-09-27 09:30:58ncoghlansetmessageid: <1443346258.51.0.297563428023.issue18814@psf.upfronthosting.co.za>
2015-09-27 09:30:58ncoghlanlinkissue18814 messages
2015-09-27 09:30:58ncoghlancreate