Issue 19548: 'codecs' module docs improvements

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63747

classification

Title:	'codecs' module docs improvements
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 3.4, Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ncoghlan	Nosy List:	Zoinkity.., berker.peksag, docs@python, doerwalter, ezio.melotti, lemburg, martin.panter, ncoghlan, python-dev, vstinner, zuo
Priority:	normal	Keywords:	patch

Created on 2013-11-11 01:29 by zuo, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
codecs-doc.patch	martin.panter, 2014-12-18 02:20		review
codecs-doc.v2.patch	martin.panter, 2014-12-22 11:51		review
codecs-doc.v3.patch	martin.panter, 2014-12-23 07:22		review
issue19548-codecs-doc.py34.patch	ncoghlan, 2014-12-29 09:10	Updated version of patch for Python 3.4 branch	review
issue19548-codecs-doc.v5.py3.4.patch	martin.panter, 2015-01-06 06:05		review
default-branch-followup.patch	martin.panter, 2015-01-06 22:02		review

Messages (23)
msg202593 - (view)	Author: Jan Kaliszewski (zuo)	Date: 2013-11-11 01:29
When learning about the 'codecs' module I encountered several places in the docs of the module that, I believe, could be improved to be clearer and easier for codecs-begginers: 1. Ad `codecs.encode` and `codecs.decode` descriptions: I believe it would be worth to mention that, unlike str.encode()/bytes.decode(), these functions (and all their counterparts in the classes the module contains) support not only "traditional str/bytes encodings", but also bytes-to-bytes as well as str-to-str encodings. 2. Ad 'codecs.register': in two places there is such a text: `These have to be factory functions providing the following interface: factory([...] errors='strict')` -- `errors='strict'` may be confusing (at the first sight it may suggest that the only valid value is 'strict'; maybe `factory(errors=<error handler label>)` with an appropriate description below would be better?). 3. Ad `codecs.open`: I believe there should be a reference to the built-in open() as an alternative that is better is most cases. 4. Ad `codecs.BOM`: `These constants define various encodings of the Unicode byte order mark (BOM).` -- the world `encodings` seems to be confusing here; maybe `These constants define various byte sequences being Unicode byte order marks (BOMs) for several encodings. They are used...` would be better? 5. Ad `7.2.1. Codec Base Classes` + `codecs.IncrementalEncoder`/`codecs/IncrementalDecoder`: `Each codec has to define four interfaces to make it usable as codec in Python: stateless encoder, stateless decoder, stream reader and stream writer` -- only four? Not six? What about incremental encoder/decoder??? * Comparing the fragments (and tables) about error halding methods (Codecs Base Classes, IncrementalEncoder, IncrementalDecoder) with similar fragment in the `codecs.register` description and with the `codecs.register_error` description I was confused: is it the matter of a particular codec implementation or of a registered error handler to implement a particular way of error handling? I believe it would be worth to describe clearly relations between these elements of the API. Also more detailed description of differences beetween error handling for encoding and decoding, and translation would be a good thing. 6. Ad `7.2.1.6. StreamReaderWriter Objects` and `7.2.1.7. StreamRecoder Objects`: It would be worth to say explicitly that, contrary to previously described abstract classes (IncrementalEncoder/Decoder, StreamReader/Writer), these classes are concrete ones (if I understand it correctly). 7. Ad `7.2.4. Python Specific Encodings`: * `raw_unicode_encoding` -- see: ticket #19539. * `unicode_encoding` -- `Produce a string that is suitable as Unicode literal in Python source code` but it is not a string; it's a bytes object (which could be used in source code using an `ascii`-compatibile encoding). * `bytes-to-bytes` and `str-to-str` encodings -- maybe it would be nice to mention that these encodings cannot be used with str.encode()/bytes.decode() methods (and to mention again they can be used with the functions/method provided by the `codecs` module).
msg202594 - (view)	Author: Jan Kaliszewski (zuo)	Date: 2013-11-11 01:31
s/world/word s/begginers/beginners (sorry, it's late night here)
msg202595 - (view)	Author: Jan Kaliszewski (zuo)	Date: 2013-11-11 01:56
8. Again ad `codecs.open`: the default file mode is actually 'rb', not 'r'. 9. Several places in the docs -- ad: `codecs.register_error`, `codecs.open`, `codecs.EncodedFile`, `Codec.encode/decode`, `codecs.StreamWriter/StreamReader` -- do not cover cases of using bytes-to-bytes and/or str-to-str encodings (especially when using `string`/`bytes` and `text`/`binary` terms). 10. `codecs.replace_errors` -- `bytestring` should be replaced with `bytes-like object` (as in other places).
msg202598 - (view)	Author: Jan Kaliszewski (zuo)	Date: 2013-11-11 02:14
11. Ad encoding 'undefined': The sentence `Can be used as the system encoding if no automatic coercion between byte and Unicode strings is desired.` was suitable for Python 2.x, but not for Python 3.x'. I believe, this sentence should be removed.
msg202604 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-11 05:27
A few more: - codec name normalisation (lower case, space to hyphen) is not mentioned in the codecs.register description - search function registration is not reversible, which doesn't play well with module reloading - codecs.CodecInfo init signature is not covered
msg203038 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-16 13:25
Another big one: the encodings module API is not documented in the prose docs, and nor is the interface between the default search function and the individual encoding definitions. There's some decent info in help(encoding) though. The interaction with the import system could also be documented better - you can actually blacklist codecs by manipulating sys.modules and the encodings namespace, and you can search additional locations for codec modules by manipulating encodings.__path__ (even without it being declared as a namespace package)
msg203040 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-11-16 13:33
On 16.11.2013 14:25, Nick Coghlan wrote: > > Nick Coghlan added the comment: > > Another big one: the encodings module API is not documented in the prose docs, and nor is the interface between the default search function and the individual encoding definitions. There's some decent info in help(encoding) though. > > The interaction with the import system could also be documented better - you can actually blacklist codecs by manipulating sys.modules and the encodings namespace, and you can search additional locations for codec modules by manipulating encodings.__path__ (even without it being declared as a namespace package) Those were not documented on purpose, since they are an implementation detail of the encodings package search function. If you document them now, you'll set the implementation in stone, making future changes to the logic difficult. I'd advise against this to stay flexible, unless you want to open up the encodings package as namespace package - then you'd have to add documentation for the import interface.
msg203043 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2013-11-16 14:03
Could they be documented with a massive warning in red "Cpython implementation detail - subject to change without notice"? Or documented in a place that is only accessible to developers and not users? Or...???
msg203044 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-11-16 14:13
On 16.11.2013 15:03, Mark Lawrence wrote: > > Mark Lawrence added the comment: > > Could they be documented with a massive warning in red "Cpython implementation detail - subject to change without notice"? Or documented in a place that is only accessible to developers and not users? Or...??? The API is documented in encodings/__init__.py for developers.
msg203049 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-16 14:55
On 16 November 2013 23:33, Marc-Andre Lemburg <report@bugs.python.org> wrote: > On 16.11.2013 14:25, Nick Coghlan wrote: > Those were not documented on purpose, since they are an implementation > detail of the encodings package search function. > > If you document them now, you'll set the implementation in stone, > making future changes to the logic difficult. I'd advise against > this to stay flexible, unless you want to open up the encodings > package as namespace package - then you'd have to add documentation > for the import interface. Yes, that was what got me thinking along those lines, but to make that possible, the contents of encodings/__init__.py would need to be moved somewhere else. So this probably isn't on the table for 3.4.
msg207375 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-01-05 13:51
Addition to the list of improvements: * Under codecs.IncrementalEncoder.reset() it mentions calling encode('', final=True). This call does not work as written for the byte encoders in my experience, because they do not accept empty text strings. Perhaps it should just say to use the final=True flag with no data.
msg219830 - (view)	Author: Zoinkity . (Zoinkity..)	Date: 2014-06-05 18:40
One glaring omission is any information about multibyte codecs--the class, its methods, and how to even define one. Also, the primary use for codecs.register would be to append a single codec to the lookup registry. Simple usage of the method only provides lookup for the provided codecs and will not include regularly-accessible ones such as "utf-8". It would be enormously helpful to provide an example of proper, safe usage.
msg232849 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-12-18 02:20
Here is a patch addressing many of the points raised. Please have a look and give any feedback. Beware I am not very familiar with the Restructured Text markup and haven’t tried compiling it. 1. Mentioned bytes-to-bytes and text-to-text in general right at the top. Any APIs (e.g. see Issue 20132) that don't support them should be pointed out as exceptions to the rule. 8. The underlying mode is forced to binary, so 'r' is the same as 'rb'. I removed the 'b' from the signature for clarity. ## Jan’s points not yet addressed: ## 3. I expect the built-in open() function would already be much more obvious and advertised, so I didn't add any cross-reference from codecs.open(). 5. Both points still need addressing: * Lack of requirement for implementing incremental codecs * Responsibility of implementing error handlers 9. First point left unaddressed: * register_error() error_handler replacement data type (unsure of details) ## Numbering Nick’s points: ## 12. Codec name normalization: Not addressed; what should be written? [13. Registration not reversible: Added in patch] [14. Added CodecInfo class, pulling out some existing details from register().] 15. “encodings” module: not done 16. Import system: not done ## My (Martin’s) point: ## [17. IncrementalEncoder.reset(): done] ## Zoinkity’s points, not addressed: ## 18. Multibyte codecs 19. register() usage example ## Some new points of my own that need fixing: ## 20. The doc string for register() says the search function is also allowed to return a tuple of functions, but the reference manual does not mention this. Which is more accurate? (I notice CodecInfo is a subclass of “tuple”.) 21. EncodedFile() seems to return StreamRecoder instances. Perhaps move them closer together? Should probably warn that EncodedFile's data_encoding is handled by a stateless codec. 22. The Codec.encode() and decode() methods return a length consumed, but I suspect they have to consume everything they are supplied because the code I have seen ignores this return value.
msg233017 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-12-22 11:51
Adding patch v2 after learning how to compile the docs and fixing my errors. I also simplified the descriptions of the CodecInfo attributes by defering the constructor signatures to where they are fully defined under “Codec base classes”, and merged the list of error handlers there as well. A side effect of merging error handler lists is that “surrogatepass” is now defined for codecs in general, not just Codec.encode() and decode(). Also I noticed that “unicode_escape” actually does Latin-1 decoding.
msg233020 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-12-22 15:26
Thanks for those drafts, Martin - they look like a strong improvement to me. While I still had plenty of comments/questions on v2, I think that's more a reflection on how long it has been since we gave these docs a thorough overall review, moreso than a reflection on the proposed changes. Victor - I added you to the nosy list for this one, as I'd specifically like your comments on the StreamReader/Writer docs updates. I'd like to make it clear that these are distinct from the "text encoding only" APIs in the io module, while still accurately describing the behaviour of the standard codecs.
msg233035 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-12-23 07:22
New patch version addressing many of the comments; thanks for reviewing! Also adds and extends some unit tests to confirm some of the corner cases I am documenting.
msg233167 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-12-29 09:10
I started making a few edits based on Zuo and Walter's comments while getting this patch ready for merging, and decided the end result could benefit from an additional round of feedback before committing it. This particular patch is also aimed at the Python 3.4 maintenance branch rather than at trunk - the introduction of the new namereplace error handler in 3.5 means that the previous patch didn't apply cleanly to the maintenance branch. While Zoinkity's feedback is also valid (i.e. multibyte codecs aren't documented properly, custom codec registration is both harder than it really should be and not well documented), I think those are better filed and handled as separate issues, rather than trying to handle them here as part of the general "bring the current content of the codec module documentation up to date with the current state of Python 3".
msg233505 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-06 06:05
Adding patch v5, for the 3.4 branch. There is at least one reference that still needs fixing in the default branch that is not applicable to the 3.4 branch. Main changes from Nick’s patch: * Removed sentence now redundant with introduction to open() and EncodedFile() * Fixed wording to allow for missing surrogateescape_errors() etc * Changed heading to clarify Codec objects are stateless * Restored relaxation for StreamWriter writing to text stream * New wording under “Encodings and Unicode” * Update cross references to new “Error Handlers” section
msg233543 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-01-06 14:38
New changeset 0646eee8296a by Nick Coghlan in branch '3.4': Issue 19548: update codecs module documentation https://hg.python.org/cpython/rev/0646eee8296a New changeset 4d00d0109147 by Nick Coghlan in branch 'default': Merge issue 19548 changes from 3.4 https://hg.python.org/cpython/rev/4d00d0109147
msg233544 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-01-06 14:42
Thanks for the work on this folks, both Jan for the feedback, Martin for the writing, and everyone else for their comments. I don't believe we addressed all of Jan's comments, but I'd like to request that any further comments be filed as separate issues, now that the larger restructure of the content is out of the way.
msg233552 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-06 22:02
Thanks Nick. Here is a small followup patch for the default (3.5) branch to keep things consistent.
msg233560 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-01-07 03:15
New changeset 20a5a56ce090 by Nick Coghlan in branch 'default': Issue #19548: clean up merge issues in codecs docs https://hg.python.org/cpython/rev/20a5a56ce090
msg233562 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-01-07 04:59
Thanks for the follow-up patch Martin - I missed those when I did the merge forward from 3.4.

History
Date	User	Action	Args
2022-04-11 14:57:53	admin	set	github: 63747
2021-03-17 22:14:42	iritkatriel	link	issue22128 superseder
2015-01-07 04:59:02	ncoghlan	set	messages: + msg233562
2015-01-07 03:15:02	python-dev	set	messages: + msg233560
2015-01-06 22:02:33	martin.panter	set	files: + default-branch-followup.patch messages: + msg233552
2015-01-06 14:42:05	ncoghlan	set	status: open -> closed resolution: fixed messages: + msg233544 stage: commit review -> resolved
2015-01-06 14:38:55	python-dev	set	nosy: + python-dev messages: + msg233543
2015-01-06 06:06:04	martin.panter	set	files: + issue19548-codecs-doc.v5.py3.4.patch messages: + msg233505
2014-12-29 09:10:35	ncoghlan	set	files: + issue19548-codecs-doc.py34.patch assignee: docs@python -> ncoghlan messages: + msg233167 stage: patch review -> commit review
2014-12-27 21:04:02	berker.peksag	link	issue19539 superseder
2014-12-23 07:22:29	martin.panter	set	files: + codecs-doc.v3.patch messages: + msg233035
2014-12-22 17:14:58	ezio.melotti	set	nosy: + ezio.melotti
2014-12-22 15:26:21	ncoghlan	set	nosy: + vstinner messages: + msg233020 versions: - Python 2.7
2014-12-22 11:51:12	martin.panter	set	files: + codecs-doc.v2.patch messages: + msg233017
2014-12-18 02:56:40	berker.peksag	set	nosy: + berker.peksag stage: needs patch -> patch review
2014-12-18 02:21:07	martin.panter	set	files: + codecs-doc.patch keywords: + patch messages: + msg232849
2014-12-17 09:49:38	serhiy.storchaka	set	stage: needs patch type: enhancement versions: + Python 2.7, Python 3.5, - Python 3.3
2014-06-05 18:40:34	Zoinkity..	set	nosy: + Zoinkity.. messages: + msg219830
2014-02-03 15:37:55	BreamoreBoy	set	nosy: - BreamoreBoy
2014-01-05 15:18:01	serhiy.storchaka	set	nosy: + doerwalter
2014-01-05 13:51:51	martin.panter	set	nosy: + martin.panter messages: + msg207375
2013-11-16 14:55:12	ncoghlan	set	messages: + msg203049
2013-11-16 14:13:42	lemburg	set	messages: + msg203044
2013-11-16 14:03:59	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg203043
2013-11-16 13:33:13	lemburg	set	nosy: + lemburg messages: + msg203040
2013-11-16 13:25:02	ncoghlan	set	messages: + msg203038
2013-11-16 00:44:05	terry.reedy	set	versions: - Python 3.2
2013-11-11 05:27:27	ncoghlan	set	messages: + msg202604
2013-11-11 04:51:59	ned.deily	set	nosy: + ncoghlan
2013-11-11 02:15:10	zuo	set	versions: - Python 2.6, Python 3.1, Python 2.7
2013-11-11 02:14:06	zuo	set	messages: + msg202598
2013-11-11 01:56:00	zuo	set	messages: + msg202595
2013-11-11 01:31:10	zuo	set	messages: + msg202594
2013-11-11 01:29:13	zuo	create