Title: 'codecs' module functionality + its docs -- concerning custom codecs, especially non-string ones
Type: Stage:
Components: Documentation, Library (Lib) Versions: Python 3.4, Python 3.5
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, doerwalter, lemburg, martin.panter, ncoghlan, zuo
Priority: normal Keywords:

Created on 2015-01-13 15:04 by zuo, last changed 2015-01-27 16:02 by doerwalter.

Messages (6)
msg233940 - (view) Author: Jan Kaliszewski (zuo) Date: 2015-01-13 15:04
To some extent, this issue is a follow-up of Issue 20132. It concerns some parts of functionality + documentation of the 'codecs' module related to registering custom codecs, especially non-string ones (i.e., codecs that encode/decode between arbitrary types, not necessarily the str and bytes types).

A few fragments of documented behaviour and/or documentation itself bother me:

0. Ad "7.2.1. Codec Base Classes"

"Each codec has to define four interfaces to make it usable as codec in Python: stateless encoder, stateless decoder, stream reader and stream writer. The stream reader and writers typically reuse the stateless encoder/decoder to implement the file protocols. Codec authors also need to define how the codec will handle encoding and decoding errors."

IMHO it is still unclear:

a) what is the relation between codecs in this meaning and CodecInfo objects? (especially: CodecInfo contains information about six interfaces, not four)

b) How codec authors define "how the codec will handle encoding and decoding errors"? What is relation between this and error handling schemes (defined as generic, not per-codec ones) documented below? 

1. Ad " Error Handlers" and "codecs.strict_errors(exception)"

"'strict' 	Raise UnicodeError (or a subclass); this is the default. Implemented in strict_errors()."

Implements the 'strict' error handling: each encoding or decoding error raises a UnicodeError."

Is it true that always it is a UnicodeError or its subclass and not just ValueError or its subclass? (as it is described in other fragments of the module documentation).

Please note, that 'strict' is documented as a universal (and not e.g. text-encoding-only) error handling scheme. So, what about non-string codecs?

2. Ad "codecs.register_error(name, error_handler)"

"For encoding, error_handler will be called with a UnicodeEncodeError instance..." "Decoding and translating works similarly, except UnicodeDecodeError or UnicodeTranslateError will be passed..."

Again: what about non-string codecs? UnicodeError subclasses do not seem to be appropriate for them.

3. It would be nice to address the Zoinkity's concerns from the Issue 20132 (partially related to the above points):

One glaring omission is any information about multibyte codecs--the class, its methods, and how to even define one.  

Also, the primary use for codecs.register would be to append a single codec to the lookup registry.  Simple usage of the method only provides lookup for the provided codecs and will not include regularly-accessible ones such as "utf-8".  It would be enormously helpful to provide an example of proper, safe usage.
msg233941 - (view) Author: Jan Kaliszewski (zuo) Date: 2015-01-13 15:08

s/Issue 20132/Issue 19548/g

Issue 20132 is also related somehow, but here I ment that this is a follow-up of Issue 19548; and Zoinkity's concerns I cited are also from Issue 19548, and not from 20132.
msg234602 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-01-24 11:17
Unfortunately, a lot of these things aren't well defined in the docs because they're not especially well defined, period. The codecs module works very well if you stick to the common, well-tested paths (primarily the text encodings), but it's complex enough that there are quite a few dark corners as well.

The functional changes in 3.4 and Martin's documentation updates in issue 19548 certainly improved things a bit further.

I'm inclined to agree with Marc-Andre's comment on 20132, that we're a bit down in the weeds at the moment, without a clear shared vision of where we *want* to be for the codecs module. A couple of other big issues with the current design of the module are the fact you can't register a codec directly, you have to register a search function (which you then can't unregister) and the fact that the "is a text encoding" flag I added for 3.4 is private, rather than a generally available capability.

In terms of this issue, until Martin's last patch, the error handling documentation basically all assumed text codecs. The changes in that patch clarified some areas that could be tested with the bytes-bytes codecs, but left others still vague because it isn't clear what's intended behaviour, and what's an implementation accident in CPython.

I've added MAL to the nosy list here as well, since if anyone is going to know the *intended* interaction between error handlers and arbitrary codecs its MAL.
msg234603 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2015-01-24 11:20
Regarding the 6 vs 4 interfaces, what's really needed there is a clearer explanation of what functionality depends on each of the three interfaces (basic, stream, incremental), so that a codec developer has a clearer understanding of what won't work if they don't provide a particular interface.
msg234644 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-25 01:01
I am certainly no expert, but this is how I understand the three different kinds of codecs are used:

* Stateless codecs: str.encode(), bytes.decode(), etc
* Incremental codecs: TextIOWrapper, IncrementalNewlineDecoder
* Stream codecs: only stuff inside the “codecs” module as far as I know:, EncodedFile() etc.
msg234824 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2015-01-27 16:02
That analysis seems correct to me.

Stateless and stream codecs were the original implementation. 2006 I implemented incremental codecs:

The intent was to have stateful codecs that can work with iterators and generators.

When Guido began reimplementing the io machinery for Python 3 he used incremental codecs as the basis.
Date User Action Args
2015-01-27 16:02:42doerwaltersetnosy: + doerwalter
messages: + msg234824
2015-01-25 01:01:49martin.pantersetmessages: + msg234644
2015-01-24 11:20:18ncoghlansetmessages: + msg234603
2015-01-24 11:17:08ncoghlansetnosy: + lemburg
messages: + msg234602
2015-01-13 21:35:39ned.deilysetnosy: + ncoghlan
2015-01-13 20:54:38martin.pantersetnosy: + martin.panter
2015-01-13 15:08:39zuosetmessages: + msg233941
2015-01-13 15:04:27zuocreate