classification
Title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
Type: Stage:
Components: Interpreter Core Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: doerwalter, eric.araujo, lemburg, loewis, pitrou, vstinner
Priority: normal Keywords:

Created on 2010-05-27 23:13 by vstinner, last changed 2010-06-14 19:03 by vstinner. This issue is now closed.

Messages (27)
msg106625 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-27 23:13
readbuffer_encode() and charbuffer_encode() are not really encoder nor related to encodings: they are related to PyBuffer. readbuffer_encode() uses "s#" format and charbuffer_encode() uses "t#" format to parse their arguments. Both functions were introduced by the creation of the _codecs module 10 years ago (r14660).

I think that these functions should be removed. memoryview() should be used instead.

Note: charbuffer_encode() is the last function using on of the "t" format (t, t#, t*) in Python3.
msg106626 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-27 23:17
A search in Google doesn't show anything interesting: it looks like these functions were never used outside Python test suite. I just noticed r41461: "Add tests for various error cases and for readbuffer_encode() and
charbuffer_encode(). This increases code coverage in Modules/_codecsmodule.c from 83% to 95%." (4 years ago)
msg106640 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 07:54
STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> readbuffer_encode() and charbuffer_encode() are not really encoder nor related to encodings: they are related to PyBuffer. readbuffer_encode() uses "s#" format and charbuffer_encode() uses "t#" format to parse their arguments. Both functions were introduced by the creation of the _codecs module 10 years ago (r14660).
> 
> I think that these functions should be removed. memoryview() should be used instead.
> 
> Note: charbuffer_encode() is the last function using on of the "t" format (t, t#, t*) in Python3.

Those two encoder functions were meant to be used by Python codec
implementations which want to use the readbuffer and charbuffer
interfaces available in Python via "s#" and "t#" to access input
object data.

They are not used by the builtin codecs, but may well be in use
by 3rd party codecs.

I'm not sure why you think those functions are not encoders.
msg106645 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 11:14
> Those two encoder functions were meant to be used by Python codec
> implementations which want to use the readbuffer and charbuffer
> interfaces available in Python via "s#" and "t#" to access input
> object data.

Ah ok.

> They are not used by the builtin codecs, 
> but may well be in use by 3rd party codecs.

My quick Google search didn't found any of those. I suppose that str and bytes are enough for most people. Do you know an usecase of text or bytes stored in different types than str and bytes? (I suppose the bytearray is compatible with bytes, and so it can be used instead of bytes)

> I'm not sure why you think those functions are not encoders.

I consider that Python3 codecs module only encode and decode text to/from an encoding, whereas Python2 had extra unrelated codecs like "base64" or "hex" (but it was decided to remove them to cleanup the codecs module).
msg106650 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 11:35
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> Those two encoder functions were meant to be used by Python codec
>> implementations which want to use the readbuffer and charbuffer
>> interfaces available in Python via "s#" and "t#" to access input
>> object data.
> 
> Ah ok.
> 
>> They are not used by the builtin codecs, 
>> but may well be in use by 3rd party codecs.
> 
> My quick Google search didn't found any of those. I suppose that str and bytes are enough for most people. Do you know an usecase of text or bytes stored in different types than str and bytes? (I suppose the bytearray is compatible with bytes, and so it can be used instead of bytes)

Any Python object can expose a buffer interface and the above
functions then allow accessing these interfaces from within
Python.

Think of e.g. memory mapped files, image/audio/video objects,
database BLOBs, scientific data types, numeric arrays, etc.
There are lots of such object types.

>> I'm not sure why you think those functions are not encoders.
> 
> I consider that Python3 codecs module only encode and decode text to/from an encoding, whereas Python2 had extra unrelated codecs like "base64" or "hex" (but it was decided to remove them to cleanup the codecs module).

Those codecs will be reenabled in Python 3.2. Removing them was
a mistake. The codec machinery is not limited to only working
on Unicode and bytes. It can work on arbitrary type combinations,
depending on what a codec wants to implement.
msg106653 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-28 12:19
> Any Python object can expose a buffer interface and the above
> functions then allow accessing these interfaces from within
> Python.

What's the point? The codecs functions already support objects exposing the buffer interface:

>>> b = b"\xe9"
>>> codecs.latin_1_decode(memoryview(b))
('é', 1)
>>> codecs.latin_1_decode(array.array("b", b))
('é', 1)

Those two functions are undocumented. They serve no useful purpose (you can call the bytes(...) constructor instead, or even use the buffer object directly as showed above). They are badly named since they don't have anything to do with codecs. Google Code Search shows them not appearing anywhere else than implementations of the Python stdlib. Removing them only seems reasonable.
msg106656 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 12:39
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> Any Python object can expose a buffer interface and the above
>> functions then allow accessing these interfaces from within
>> Python.
> 
> What's the point? The codecs functions already support objects exposing the buffer interface:
> 
>>>> b = b"\xe9"
>>>> codecs.latin_1_decode(memoryview(b))
> ('é', 1)
>>>> codecs.latin_1_decode(array.array("b", b))
> ('é', 1)
>
> Those two functions are undocumented. They serve no useful purpose (you can call the bytes(...) constructor instead, or even use the buffer object directly as showed above). They are badly named since they don't have anything to do with codecs. Google Code Search shows them not appearing anywhere else than implementations of the Python stdlib. Removing them only seems reasonable.

readbuffer_encode and charbuffer_encode convert objects to bytes
and provide a codec encoder interface for this, hence the naming.

They are meant to be used as encode methods for codecs, just like
the other *_encode functions exposed in the _codecs module, e.g.

class BinaryDataCodec(codecs.Codec):

    # Note: Binding these as C functions will result in the class not
    # converting them to methods. This is intended.
    encode = codecs.readbuffer_encode
    decode = codecs.latin_1_decode

While it's possible to emulate the functions via other methods,
these methods always introduce intermediate objects, which isn't
necessary and only costs performance.

Given than "t#" was basically rendered useless in Python3 (see
issue8839), removing charbuffer_encode() is indeed possible,
so

+1 on removing charbuffer_encode()
-1 on removing readbuffer_encode()
msg106657 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-05-28 12:45
I’d be grateful if someone could post links to discussion about the removal of codecs like hex and rot13 and about their coming back. It may be useful for a NEWS entry too, not just for my personal curiosity ;) I’ll try to find them next week or so if nobody posts them before. Thanks.
msg106658 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-28 13:02
> class BinaryDataCodec(codecs.Codec):
> 
>     # Note: Binding these as C functions will result in the class not
>     # converting them to methods. This is intended.
>     encode = codecs.readbuffer_encode
>     decode = codecs.latin_1_decode

What's the point, though? Creating a non-symmetrical codec doesn't sound
like a very useful or recommandable thing to do. Especially in the py3k
codec model where encode() only works on unicode objects.

> While it's possible to emulate the functions via other methods,
> these methods always introduce intermediate objects, which isn't
> necessary and only costs performance.

The bytes() constructor doesn't (shouldn't) create any more intermediate
objects than read/charbuffer_encode() do.

And all this doesn't address the fact that these functions have never
been documented, and don't seem used in the outside world
(understandably so, since there's no way to know about their existence,
and their intended use).
msg106659 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 13:09
> I’d be grateful if someone could post links to discussion
> about the removal of codecs like hex and rot13

r55932 (~3 years ago):

"Rip out all codecs that can't work in a unicode/bytes world:
base64, uu, zlib, rot_13, hex, quopri, bz2, string_escape.

However codecs.escape_encode() and codecs.escape_decode()
still exist, as they are used for pickling str8 objects
(so those two functions can go, when the str8 type is removed)."

There were removed 1 year and an half before Python 3.0 release.

> ... and about their coming back

which coming back?
msg106660 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-05-28 13:12
Thanks for the link. Do you have a pointer to the PEP or ML thread
discussing that change?

“Which coming back?”
Martin said these codecs are coming back in 3.2.
msg106661 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 13:18
> Martin said these codecs are coming back in 3.2.

Oh, there is the issue #7485 where Martin wrote:
* 2009-12-10 23:15: "It was a mistake that they were integrated"
* 2009-12-12 19:25: "I would still be opposed to such a change (...) adding them would be really confusing."
msg106662 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2010-05-28 13:20
> > I’d be grateful if someone could post links to discussion
> > about the removal of codecs like hex and rot13
> r55932 (~3 years ago):

That was my commit. ;)

> Thanks for the link. Do you have a pointer to the PEP or ML thread
> discussing that change?

The removal is documented here: http://www.artima.com/weblogs/viewpost.jsp?thread=208549

"""
We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API).
"""

A post by Georg Brandl about this is at http://mail.python.org/pipermail/python-3000/2007-June/008420.html

(Note that this thread began in private email between Guido, MvL, Georg and myself. If needed I can dig up the emails.)
msg106663 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 13:23
> Oh, there is the issue #7485 where Martin wrote:

Copy/paste failure: issue #7475.
msg106664 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 13:23
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> class BinaryDataCodec(codecs.Codec):
>>
>>     # Note: Binding these as C functions will result in the class not
>>     # converting them to methods. This is intended.
>>     encode = codecs.readbuffer_encode
>>     decode = codecs.latin_1_decode
> 
> What's the point, though? Creating a non-symmetrical codec doesn't sound
> like a very useful or recommandable thing to do. 

Why not ? If you're only interested in the binary data and
don't care about the original input object type, that's a
very natural thing to do.

E.g. you could use a memory mapped file as input to the encoder.
Would you really expect the codec to recreate such a file object when
decoding the binary data ?

> Especially in the py3k
> codec model where encode() only works on unicode objects.

That's a common misunderstanding. The codec system does not
mandate a specific type combination. Only the helper methods
.encode() and .decode() on bytes and str objects in Python3 do.

>> While it's possible to emulate the functions via other methods,
>> these methods always introduce intermediate objects, which isn't
>> necessary and only costs performance.
> 
> The bytes() constructor doesn't (shouldn't) create any more intermediate
> objects than read/charbuffer_encode() do.

Looking at the code, the data takes quite a long path through
the whole machinery. For non-Unicode objects, it always tries to create
an integer and only if that fails reverts back to the buffer
interface after a few more function calls.

Furthermore, the bytes() constructor accepts a lot more
objects than the "s#" parser marker, e.g. lists of integers,
plain integers, arbitrary iterators, which a codec
just interested in the binary representation of an
object via the buffer interface most likely doesn't
want to accept.

> And all this doesn't address the fact that these functions have never
> been documented, and don't seem used in the outside world
> (understandably so, since there's no way to know about their existence,
> and their intended use).

That's a documentation bug and probably the result of the fact
that none of the exposed encoder/decoder APIs are documented.
msg106665 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 13:25
STINNER Victor wrote:
> 
>> Martin said these codecs are coming back in 3.2.

I said that and it was discussed on the python-dev mailing list
a while back.

We'll also add .transform() methods on bytes and str objects
to access same-type codecs.
msg106666 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 13:33
> readbuffer_encode() and charbuffer_encode() are not really encoder 
> nor related to encodings: they are related to PyBuffer

That was the initial problem: codecs is specific to encodings (in Python3), encodes str to bytes, and decodes bytes (or any read buffer) to str.

I don't like readbuffer_*encode* and *charbuffer_encode*  function names, because there are different than other codecs: they encode *bytes* to bytes (and not str to bytes). I think that these functions should be removed or moved somewhere else under a different name.
msg106667 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-28 13:35
> > And all this doesn't address the fact that these functions have never
> > been documented, and don't seem used in the outside world
> > (understandably so, since there's no way to know about their existence,
> > and their intended use).
> 
> That's a documentation bug and probably the result of the fact
> that none of the exposed encoder/decoder APIs are documented.

Are you planning to fix it? It is not obvious anybody else is able to
properly document those functions.
msg106672 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-05-28 13:52
> I don't like readbuffer_*encode* and *charbuffer_encode*
> function names, because there are different than other codecs
“transform” as hinted by MvL seems perfect.

Thanks everyone for the pointers here and in #7475! I’ll search the missing one (“it was discussed on the python-dev mailing list a while back”) later.
msg106693 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-28 22:29
> Martin said these codecs are coming back in 3.2.

I think you are confusing me with MAL. I remain opposed to adding them 
back. Users ought to use the modules that provide these these 
conversions as functions.
msg107288 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-07 22:48
MAL agreed to remove "t#" parsing format (#8839), whereas charbuffer_encode() main goal was to offer "t#" parsing format to Python object space. charbuffer_encode() is now useless in Python3. bytes() accepts any buffer object (read-only and read/write buffer), so readbuffer_encode() became useless in Python3.

readbuffer_encode() and charbuffer_encode() were never documented, and are not used by any 3rd party library.

Can we remove these two functions?
msg107307 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-08 08:00
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> MAL agreed to remove "t#" parsing format (#8839), whereas charbuffer_encode() main goal was to offer "t#" parsing format to Python object space. charbuffer_encode() is now useless in Python3. bytes() accepts any buffer object (read-only and read/write buffer), so readbuffer_encode() became useless in Python3.
> 
> readbuffer_encode() and charbuffer_encode() were never documented, and are not used by any 3rd party library.
> 
> Can we remove these two functions?

Like I said before:

We can remore charbuffer_encode() now and perhaps
add it again later on when buffers have learned (again) to
provide access to a text version of their data. In this
case, we'd likely add t# back again as well.

Please leave readbuffer_encode() as-is.
msg107318 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-08 12:42
> Please leave readbuffer_encode() as-is.

Then please add documentation for it.
msg107319 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-08 12:44
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> Please leave readbuffer_encode() as-is.
> 
> Then please add documentation for it.

Will do.
msg107363 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-08 23:13
r81854 removes codecs.charbuffer_encode() (and t# parsing format) from Python 3.2 (blocked in 3.1: r81855).

--

My problem with codecs.readbuffer_encode() is that it does accept byte *and* character strings. If you want to get a byte string, just use bytes(input). If you want to convert a character string to a byte string, use input.encode("utf-8"). But accepting both types may lead to mojibake as we had in Python2.

MAL> That's a common misunderstanding. The codec system does not
MAL> mandate a specific type combination. Only the helper methods
MAL> .encode() and .decode() on bytes and str objects in Python3 do.

This is related to #7475: we have to decide if we drop completly this  (currently unused) feature (eg. remove codecs.readbuffer_encode()), or if we "reenable" this feature again (reintroduce hex, bz2, rot13, ... codecs). This discussion should occur on the mailing list.
msg107373 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-09 08:20
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> r81854 removes codecs.charbuffer_encode() (and t# parsing format) from Python 3.2 (blocked in 3.1: r81855).
> 
> --
> 
> My problem with codecs.readbuffer_encode() is that it does accept byte *and* character strings. If you want to get a byte string, just use bytes(input). If you want to convert a character string to a byte string, use input.encode("utf-8"). But accepting both types may lead to mojibake as we had in Python2.

The point is to have an interface to the "s#" parser marker
from Python. This accepts bytes, objects with a buffer interface
and Unicode objects (via the default encoding).

It does not accept e.g. lists, tuples or plain integers like
bytes() does.

> MAL> That's a common misunderstanding. The codec system does not
> MAL> mandate a specific type combination. Only the helper methods
> MAL> .encode() and .decode() on bytes and str objects in Python3 do.
> 
> This is related to #7475: we have to decide if we drop completly this  (currently unused) feature (eg. remove codecs.readbuffer_encode()), or if we "reenable" this feature again (reintroduce hex, bz2, rot13, ... codecs). This discussion should occur on the mailing list.

We are not going to drop this design feature of the codec system
and we've already had the discussion in 2008.

The statement that it is an unused feature is plain wrong. Please
don't forget that people are actually using these things in their
applications, many of which have not been ported to Python3.
We're not just talking about code that you find in CPython or the
stdlib.

The removed codecs will go back into 3.2.
msg107807 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-14 19:02
This issue was about removing codecs.readbuffer_encode() and codecs.charbuffer_encode(). codecs.charbuffer_encode() was removed, but codecs.readbuffer_encode() explained that it should be kept. So I close this issue because there is nothing more to do on this topic.

@lemburg: You still have to write some doc (and tests?) for codecs.readbuffer_encode() ;-)
History
Date User Action Args
2010-06-14 19:03:04vstinnersetstatus: open -> closed
resolution: fixed
2010-06-14 19:02:42vstinnersetmessages: + msg107807
2010-06-09 08:20:32lemburgsetmessages: + msg107373
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-08 23:13:33vstinnersetmessages: + msg107363
2010-06-08 12:44:27lemburgsetmessages: + msg107319
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-08 12:42:15pitrousetmessages: + msg107318
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-08 08:00:05lemburgsetmessages: + msg107307
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-06-07 22:48:44vstinnersetmessages: + msg107288
2010-05-28 22:29:27loewissetmessages: + msg106693
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:52:36eric.araujosetmessages: + msg106672
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:35:01pitrousetmessages: + msg106667
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:33:41vstinnersetmessages: + msg106666
2010-05-28 13:25:43lemburgsetmessages: + msg106665
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:23:29lemburgsetmessages: + msg106664
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:23:18vstinnersetmessages: + msg106663
2010-05-28 13:20:50doerwaltersetnosy: + doerwalter
messages: + msg106662
2010-05-28 13:18:42vstinnersetmessages: + msg106661
2010-05-28 13:12:04eric.araujosetmessages: + msg106660
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 13:09:57vstinnersetmessages: + msg106659
2010-05-28 13:02:21pitrousetmessages: + msg106658
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 12:45:25eric.araujosetnosy: + eric.araujo
messages: + msg106657
2010-05-28 12:39:40lemburgsetmessages: + msg106656
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 12:19:19pitrousetnosy: + loewis, pitrou
messages: + msg106653
2010-05-28 11:35:25lemburgsetmessages: + msg106650
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 11:14:56vstinnersetmessages: + msg106645
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
2010-05-28 07:54:17lemburgsetnosy: + lemburg
title: Remove codecs.readbuffer_encode() and codecs.charbuffer_encode() -> Remove codecs.readbuffer_encode() and codecs.charbuffer_encode()
messages: + msg106640
2010-05-27 23:17:21vstinnersetmessages: + msg106626
2010-05-27 23:13:35vstinnercreate