Issue 8838: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53084

classification

Title:	Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	doerwalter, eric.araujo, lemburg, loewis, pitrou, vstinner
Priority:	normal	Keywords:

Created on 2010-05-27 23:13 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (27)
msg106625 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-27 23:13
readbuffer_encode() and charbuffer_encode() are not really encoder nor related to encodings: they are related to PyBuffer. readbuffer_encode() uses "s#" format and charbuffer_encode() uses "t#" format to parse their arguments. Both functions were introduced by the creation of the _codecs module 10 years ago (r14660). I think that these functions should be removed. memoryview() should be used instead. Note: charbuffer_encode() is the last function using on of the "t" format (t, t#, t*) in Python3.
msg106626 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-27 23:17
A search in Google doesn't show anything interesting: it looks like these functions were never used outside Python test suite. I just noticed r41461: "Add tests for various error cases and for readbuffer_encode() and charbuffer_encode(). This increases code coverage in Modules/_codecsmodule.c from 83% to 95%." (4 years ago)
msg106640 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-28 07:54
STINNER Victor wrote: > > New submission from STINNER Victor <victor.stinner@haypocalc.com>: > > readbuffer_encode() and charbuffer_encode() are not really encoder nor related to encodings: they are related to PyBuffer. readbuffer_encode() uses "s#" format and charbuffer_encode() uses "t#" format to parse their arguments. Both functions were introduced by the creation of the _codecs module 10 years ago (r14660). > > I think that these functions should be removed. memoryview() should be used instead. > > Note: charbuffer_encode() is the last function using on of the "t" format (t, t#, t*) in Python3. Those two encoder functions were meant to be used by Python codec implementations which want to use the readbuffer and charbuffer interfaces available in Python via "s#" and "t#" to access input object data. They are not used by the builtin codecs, but may well be in use by 3rd party codecs. I'm not sure why you think those functions are not encoders.
msg106645 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-28 11:14
> Those two encoder functions were meant to be used by Python codec > implementations which want to use the readbuffer and charbuffer > interfaces available in Python via "s#" and "t#" to access input > object data. Ah ok. > They are not used by the builtin codecs, > but may well be in use by 3rd party codecs. My quick Google search didn't found any of those. I suppose that str and bytes are enough for most people. Do you know an usecase of text or bytes stored in different types than str and bytes? (I suppose the bytearray is compatible with bytes, and so it can be used instead of bytes) > I'm not sure why you think those functions are not encoders. I consider that Python3 codecs module only encode and decode text to/from an encoding, whereas Python2 had extra unrelated codecs like "base64" or "hex" (but it was decided to remove them to cleanup the codecs module).
msg106650 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-28 11:35
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> Those two encoder functions were meant to be used by Python codec >> implementations which want to use the readbuffer and charbuffer >> interfaces available in Python via "s#" and "t#" to access input >> object data. > > Ah ok. > >> They are not used by the builtin codecs, >> but may well be in use by 3rd party codecs. > > My quick Google search didn't found any of those. I suppose that str and bytes are enough for most people. Do you know an usecase of text or bytes stored in different types than str and bytes? (I suppose the bytearray is compatible with bytes, and so it can be used instead of bytes) Any Python object can expose a buffer interface and the above functions then allow accessing these interfaces from within Python. Think of e.g. memory mapped files, image/audio/video objects, database BLOBs, scientific data types, numeric arrays, etc. There are lots of such object types. >> I'm not sure why you think those functions are not encoders. > > I consider that Python3 codecs module only encode and decode text to/from an encoding, whereas Python2 had extra unrelated codecs like "base64" or "hex" (but it was decided to remove them to cleanup the codecs module). Those codecs will be reenabled in Python 3.2. Removing them was a mistake. The codec machinery is not limited to only working on Unicode and bytes. It can work on arbitrary type combinations, depending on what a codec wants to implement.
msg106653 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-28 12:19
> Any Python object can expose a buffer interface and the above > functions then allow accessing these interfaces from within > Python. What's the point? The codecs functions already support objects exposing the buffer interface: >>> b = b"\xe9" >>> codecs.latin_1_decode(memoryview(b)) ('é', 1) >>> codecs.latin_1_decode(array.array("b", b)) ('é', 1) Those two functions are undocumented. They serve no useful purpose (you can call the bytes(...) constructor instead, or even use the buffer object directly as showed above). They are badly named since they don't have anything to do with codecs. Google Code Search shows them not appearing anywhere else than implementations of the Python stdlib. Removing them only seems reasonable.
msg106656 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-28 12:39
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Any Python object can expose a buffer interface and the above >> functions then allow accessing these interfaces from within >> Python. > > What's the point? The codecs functions already support objects exposing the buffer interface: > >>>> b = b"\xe9" >>>> codecs.latin_1_decode(memoryview(b)) > ('é', 1) >>>> codecs.latin_1_decode(array.array("b", b)) > ('é', 1) > > Those two functions are undocumented. They serve no useful purpose (you can call the bytes(...) constructor instead, or even use the buffer object directly as showed above). They are badly named since they don't have anything to do with codecs. Google Code Search shows them not appearing anywhere else than implementations of the Python stdlib. Removing them only seems reasonable. readbuffer_encode and charbuffer_encode convert objects to bytes and provide a codec encoder interface for this, hence the naming. They are meant to be used as encode methods for codecs, just like the other *_encode functions exposed in the _codecs module, e.g. class BinaryDataCodec(codecs.Codec): # Note: Binding these as C functions will result in the class not # converting them to methods. This is intended. encode = codecs.readbuffer_encode decode = codecs.latin_1_decode While it's possible to emulate the functions via other methods, these methods always introduce intermediate objects, which isn't necessary and only costs performance. Given than "t#" was basically rendered useless in Python3 (see issue8839), removing charbuffer_encode() is indeed possible, so +1 on removing charbuffer_encode() -1 on removing readbuffer_encode()
msg106657 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-05-28 12:45
I’d be grateful if someone could post links to discussion about the removal of codecs like hex and rot13 and about their coming back. It may be useful for a NEWS entry too, not just for my personal curiosity ;) I’ll try to find them next week or so if nobody posts them before. Thanks.
msg106658 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-28 13:02
> class BinaryDataCodec(codecs.Codec): > > # Note: Binding these as C functions will result in the class not > # converting them to methods. This is intended. > encode = codecs.readbuffer_encode > decode = codecs.latin_1_decode What's the point, though? Creating a non-symmetrical codec doesn't sound like a very useful or recommandable thing to do. Especially in the py3k codec model where encode() only works on unicode objects. > While it's possible to emulate the functions via other methods, > these methods always introduce intermediate objects, which isn't > necessary and only costs performance. The bytes() constructor doesn't (shouldn't) create any more intermediate objects than read/charbuffer_encode() do. And all this doesn't address the fact that these functions have never been documented, and don't seem used in the outside world (understandably so, since there's no way to know about their existence, and their intended use).
msg106659 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-28 13:09
> I’d be grateful if someone could post links to discussion > about the removal of codecs like hex and rot13 r55932 (~3 years ago): "Rip out all codecs that can't work in a unicode/bytes world: base64, uu, zlib, rot_13, hex, quopri, bz2, string_escape. However codecs.escape_encode() and codecs.escape_decode() still exist, as they are used for pickling str8 objects (so those two functions can go, when the str8 type is removed)." There were removed 1 year and an half before Python 3.0 release. > ... and about their coming back which coming back?
msg106660 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-05-28 13:12
Thanks for the link. Do you have a pointer to the PEP or ML thread discussing that change? “Which coming back?” Martin said these codecs are coming back in 3.2.
msg106661 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-28 13:18
> Martin said these codecs are coming back in 3.2. Oh, there is the issue #7485 where Martin wrote: * 2009-12-10 23:15: "It was a mistake that they were integrated" * 2009-12-12 19:25: "I would still be opposed to such a change (...) adding them would be really confusing."
msg106662 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2010-05-28 13:20
> > I’d be grateful if someone could post links to discussion > > about the removal of codecs like hex and rot13 > r55932 (~3 years ago): That was my commit. ;) > Thanks for the link. Do you have a pointer to the PEP or ML thread > discussing that change? The removal is documented here: http://www.artima.com/weblogs/viewpost.jsp?thread=208549 """ We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API). """ A post by Georg Brandl about this is at http://mail.python.org/pipermail/python-3000/2007-June/008420.html (Note that this thread began in private email between Guido, MvL, Georg and myself. If needed I can dig up the emails.)
msg106663 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-28 13:23
> Oh, there is the issue #7485 where Martin wrote: Copy/paste failure: issue #7475.
msg106664 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-28 13:23
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> class BinaryDataCodec(codecs.Codec): >> >> # Note: Binding these as C functions will result in the class not >> # converting them to methods. This is intended. >> encode = codecs.readbuffer_encode >> decode = codecs.latin_1_decode > > What's the point, though? Creating a non-symmetrical codec doesn't sound > like a very useful or recommandable thing to do. Why not ? If you're only interested in the binary data and don't care about the original input object type, that's a very natural thing to do. E.g. you could use a memory mapped file as input to the encoder. Would you really expect the codec to recreate such a file object when decoding the binary data ? > Especially in the py3k > codec model where encode() only works on unicode objects. That's a common misunderstanding. The codec system does not mandate a specific type combination. Only the helper methods .encode() and .decode() on bytes and str objects in Python3 do. >> While it's possible to emulate the functions via other methods, >> these methods always introduce intermediate objects, which isn't >> necessary and only costs performance. > > The bytes() constructor doesn't (shouldn't) create any more intermediate > objects than read/charbuffer_encode() do. Looking at the code, the data takes quite a long path through the whole machinery. For non-Unicode objects, it always tries to create an integer and only if that fails reverts back to the buffer interface after a few more function calls. Furthermore, the bytes() constructor accepts a lot more objects than the "s#" parser marker, e.g. lists of integers, plain integers, arbitrary iterators, which a codec just interested in the binary representation of an object via the buffer interface most likely doesn't want to accept. > And all this doesn't address the fact that these functions have never > been documented, and don't seem used in the outside world > (understandably so, since there's no way to know about their existence, > and their intended use). That's a documentation bug and probably the result of the fact that none of the exposed encoder/decoder APIs are documented.
msg106665 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-28 13:25
STINNER Victor wrote: > >> Martin said these codecs are coming back in 3.2. I said that and it was discussed on the python-dev mailing list a while back. We'll also add .transform() methods on bytes and str objects to access same-type codecs.
msg106666 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-28 13:33
> readbuffer_encode() and charbuffer_encode() are not really encoder > nor related to encodings: they are related to PyBuffer That was the initial problem: codecs is specific to encodings (in Python3), encodes str to bytes, and decodes bytes (or any read buffer) to str. I don't like readbuffer_encode and charbuffer_encode function names, because there are different than other codecs: they encode bytes to bytes (and not str to bytes). I think that these functions should be removed or moved somewhere else under a different name.
msg106667 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-28 13:35
> > And all this doesn't address the fact that these functions have never > > been documented, and don't seem used in the outside world > > (understandably so, since there's no way to know about their existence, > > and their intended use). > > That's a documentation bug and probably the result of the fact > that none of the exposed encoder/decoder APIs are documented. Are you planning to fix it? It is not obvious anybody else is able to properly document those functions.
msg106672 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-05-28 13:52
> I don't like readbuffer_encode and charbuffer_encode > function names, because there are different than other codecs “transform” as hinted by MvL seems perfect. Thanks everyone for the pointers here and in #7475! I’ll search the missing one (“it was discussed on the python-dev mailing list a while back”) later.
msg106693 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-05-28 22:29
> Martin said these codecs are coming back in 3.2. I think you are confusing me with MAL. I remain opposed to adding them back. Users ought to use the modules that provide these these conversions as functions.
msg107288 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-07 22:48
MAL agreed to remove "t#" parsing format (#8839), whereas charbuffer_encode() main goal was to offer "t#" parsing format to Python object space. charbuffer_encode() is now useless in Python3. bytes() accepts any buffer object (read-only and read/write buffer), so readbuffer_encode() became useless in Python3. readbuffer_encode() and charbuffer_encode() were never documented, and are not used by any 3rd party library. Can we remove these two functions?
msg107307 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-08 08:00
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > MAL agreed to remove "t#" parsing format (#8839), whereas charbuffer_encode() main goal was to offer "t#" parsing format to Python object space. charbuffer_encode() is now useless in Python3. bytes() accepts any buffer object (read-only and read/write buffer), so readbuffer_encode() became useless in Python3. > > readbuffer_encode() and charbuffer_encode() were never documented, and are not used by any 3rd party library. > > Can we remove these two functions? Like I said before: We can remore charbuffer_encode() now and perhaps add it again later on when buffers have learned (again) to provide access to a text version of their data. In this case, we'd likely add t# back again as well. Please leave readbuffer_encode() as-is.
msg107318 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-08 12:42
> Please leave readbuffer_encode() as-is. Then please add documentation for it.
msg107319 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-08 12:44
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Please leave readbuffer_encode() as-is. > > Then please add documentation for it. Will do.
msg107363 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-08 23:13
r81854 removes codecs.charbuffer_encode() (and t# parsing format) from Python 3.2 (blocked in 3.1: r81855). -- My problem with codecs.readbuffer_encode() is that it does accept byte and character strings. If you want to get a byte string, just use bytes(input). If you want to convert a character string to a byte string, use input.encode("utf-8"). But accepting both types may lead to mojibake as we had in Python2. MAL> That's a common misunderstanding. The codec system does not MAL> mandate a specific type combination. Only the helper methods MAL> .encode() and .decode() on bytes and str objects in Python3 do. This is related to #7475: we have to decide if we drop completly this (currently unused) feature (eg. remove codecs.readbuffer_encode()), or if we "reenable" this feature again (reintroduce hex, bz2, rot13, ... codecs). This discussion should occur on the mailing list.
msg107373 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-09 08:20
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > r81854 removes codecs.charbuffer_encode() (and t# parsing format) from Python 3.2 (blocked in 3.1: r81855). > > -- > > My problem with codecs.readbuffer_encode() is that it does accept byte and character strings. If you want to get a byte string, just use bytes(input). If you want to convert a character string to a byte string, use input.encode("utf-8"). But accepting both types may lead to mojibake as we had in Python2. The point is to have an interface to the "s#" parser marker from Python. This accepts bytes, objects with a buffer interface and Unicode objects (via the default encoding). It does not accept e.g. lists, tuples or plain integers like bytes() does. > MAL> That's a common misunderstanding. The codec system does not > MAL> mandate a specific type combination. Only the helper methods > MAL> .encode() and .decode() on bytes and str objects in Python3 do. > > This is related to #7475: we have to decide if we drop completly this (currently unused) feature (eg. remove codecs.readbuffer_encode()), or if we "reenable" this feature again (reintroduce hex, bz2, rot13, ... codecs). This discussion should occur on the mailing list. We are not going to drop this design feature of the codec system and we've already had the discussion in 2008. The statement that it is an unused feature is plain wrong. Please don't forget that people are actually using these things in their applications, many of which have not been ported to Python3. We're not just talking about code that you find in CPython or the stdlib. The removed codecs will go back into 3.2.
msg107807 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-14 19:02
This issue was about removing codecs.readbuffer_encode() and codecs.charbuffer_encode(). codecs.charbuffer_encode() was removed, but codecs.readbuffer_encode() explained that it should be kept. So I close this issue because there is nothing more to do on this topic. @lemburg: You still have to write some doc (and tests?) for codecs.readbuffer_encode() ;-)

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53084
2010-06-14 19:03:04	vstinner	set	status: open -> closed resolution: fixed
2010-06-14 19:02:42	vstinner	set	messages: + msg107807
2010-06-09 08:20:32	lemburg	set	messages: + msg107373 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-08 23:13:33	vstinner	set	messages: + msg107363
2010-06-08 12:44:27	lemburg	set	messages: + msg107319 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-08 12:42:15	pitrou	set	messages: + msg107318 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-08 08:00:05	lemburg	set	messages: + msg107307 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-07 22:48:44	vstinner	set	messages: + msg107288
2010-05-28 22:29:27	loewis	set	messages: + msg106693 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:52:36	eric.araujo	set	messages: + msg106672 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:35:01	pitrou	set	messages: + msg106667 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:33:41	vstinner	set	messages: + msg106666
2010-05-28 13:25:43	lemburg	set	messages: + msg106665 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:23:29	lemburg	set	messages: + msg106664 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:23:18	vstinner	set	messages: + msg106663
2010-05-28 13:20:50	doerwalter	set	nosy: + doerwalter messages: + msg106662
2010-05-28 13:18:42	vstinner	set	messages: + msg106661
2010-05-28 13:12:04	eric.araujo	set	messages: + msg106660 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:09:57	vstinner	set	messages: + msg106659
2010-05-28 13:02:21	pitrou	set	messages: + msg106658 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 12:45:25	eric.araujo	set	nosy: + eric.araujo messages: + msg106657
2010-05-28 12:39:40	lemburg	set	messages: + msg106656 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 12:19:19	pitrou	set	nosy: + loewis, pitrou messages: + msg106653
2010-05-28 11:35:25	lemburg	set	messages: + msg106650 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 11:14:56	vstinner	set	messages: + msg106645 title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 07:54:17	lemburg	set	nosy: + lemburg title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() messages: + msg106640
2010-05-27 23:17:21	vstinner	set	messages: + msg106626
2010-05-27 23:13:35	vstinner	create