codecs missing: base64 bz2 hex zlib hex_codec ... #51724

florentx · 2009-12-10T22:27:39Z

BPO	7475
Nosy	@malemburg, @loewis, @warsaw, @birkenfeld, @gpshead, @jcea, @cben, @ncoghlan, @abalkin, @vstinner, @benjaminp, @jwilk, @ezio-melotti, @merwok, @bitdancer, @ssbarnea, @florentx, @akheron, @serhiy-storchaka, @phmc
Dependencies	bpo-17828: More informative error handling when encoding and decoding bpo-17839: base64 module should use memoryview bpo-17844: Add link to alternatives for bytes-to-bytes codecs
Files	issue7475_warning.diff: Patch for documentation and warnings in 2.7 issue7475_missing_codecs_py3k.diff: Patch, apply to trunk issue7475_restore_codec_aliases_in_py34.diff: Patch to restore the transform aliases.

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ncoghlan'
closed_at = <Date 2013-11-23.01:14:37.682>
created_at = <Date 2009-12-10.22:27:38.811>
labels = ['type-feature', 'library', 'expert-unicode']
title = 'codecs missing: base64 bz2 hex zlib hex_codec ...'
updated_at = <Date 2014-03-14.00:55:23.273>
user = 'https://github.com/florentx'

bugs.python.org fields:

activity = <Date 2014-03-14.00:55:23.273>
actor = 'python-dev'
assignee = 'ncoghlan'
closed = True
closed_date = <Date 2013-11-23.01:14:37.682>
closer = 'python-dev'
components = ['Library (Lib)', 'Unicode']
creation = <Date 2009-12-10.22:27:38.811>
creator = 'flox'
dependencies = ['17828', '17839', '17844']
files = ['15523', '15526', '32663']
hgrepos = []
issue_num = 7475
keywords = ['patch']
message_count = 95.0
messages = ['96218', '96223', '96226', '96227', '96228', '96232', '96236', '96237', '96240', '96242', '96243', '96251', '96253', '96265', '96277', '96295', '96296', '96301', '96374', '96632', '106669', '106670', '106674', '107057', '107794', '109872', '109876', '109879', '109894', '109904', '109905', '123090', '123154', '123206', '123435', '123436', '123462', '123693', '125073', '145246', '145656', '145693', '145897', '145900', '145979', '145980', '145982', '145986', '145991', '145998', '149439', '153304', '153317', '164224', '164226', '164237', '165435', '170414', '187630', '187631', '187634', '187636', '187638', '187644', '187649', '187651', '187652', '187653', '187660', '187668', '187670', '187673', '187676', '187695', '187696', '187698', '187701', '187702', '187705', '187707', '187764', '187770', '198845', '198846', '202130', '202264', '202515', '203124', '203378', '203751', '203936', '203942', '203944', '207283', '213502']
nosy_count = 22.0
nosy_names = ['lemburg', 'loewis', 'barry', 'georg.brandl', 'gregory.p.smith', 'jcea', 'cben', 'ncoghlan', 'belopolsky', 'vstinner', 'benjamin.peterson', 'jwilk', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'ssbarnea', 'flox', 'python-dev', 'petri.lehtinen', 'serhiy.storchaka', 'pconnell', 'isoschiz']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue7475'
versions = ['Python 3.4']

florentx · 2009-12-10T22:27:38Z

AFAIK these codecs were not ported to Python 3.

I found no hint in documentation on this matter.
Is it possible to contribute some of them, or there's a good reason
to look elsewhere?

loewis · 2009-12-10T23:15:12Z

These are not encodings, in that they don't convert characters to bytes.
It was a mistake that they were integrated into the codecs interfaces in
Python 2.x; this mistake is corrected in 3.x.

malemburg · 2009-12-10T23:25:03Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

These are not encodings, in that they don't convert characters to bytes.
It was a mistake that they were integrated into the codecs interfaces in
Python 2.x; this mistake is corrected in 3.x.

Martin, I beg your pardon, but these codecs indeed implement valid
encodings and the fact that these codecs were removed was a
mistake.

They should be readded to Python 3.x.

Note that just because a codec doesn't convert between bytes
and characters only, doesn't make it wrong in any way. The codec
architecture in Python is designed to support same type encodings
just as well as ones between bytes and characters.

malemburg · 2009-12-10T23:26:10Z

Reopening the ticket.

loewis · 2009-12-10T23:28:52Z

It's not possible to add these codecs back. Bytes objects (correctly)
don't have an encode method, and string objects (correctly) don't have a
decode method. The codec architecture of Python 3.x just doesn't support
this kind of application; the codec architecture of 2.x was flawed.

benjaminp · 2009-12-11T02:09:19Z

I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal
strictly with unicode -> bytes.

florentx · 2009-12-11T08:21:38Z

«Everything you thought you knew about binary data and Unicode
has changed.»

Reopening for the documentation part.

This "mistake" deserves some words in the documentation:
docs.python.org/dev/py3k/whatsnew/3.0.html
#text-vs-data-instead-of-unicode-vs-8-bit

And the conversion may be automated with 2to3, maybe.

florentx · 2009-12-11T08:31:17Z

Is it possible to add "DeprecationWarning" for these codecs
when using "python -3" ?

>>> {}.has_key('a')
__main__:1: DeprecationWarning: dict.has_key() not supported in 3.x;
            use the in operator
False
>>> print `123`
<stdin>:1: SyntaxWarning: backquote not supported in 3.x; use repr()
123
>>> 'abc'.encode('base64')
'YWJj\n'

malemburg · 2009-12-11T09:46:55Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

It's not possible to add these codecs back. Bytes objects (correctly)
don't have an encode method, and string objects (correctly) don't have a
decode method. The codec architecture of Python 3.x just doesn't support
this kind of application; the codec architecture of 2.x was flawed.

Of course it does support these kinds of codecs. The codec
architecture hasn't changed between 2.x and 3.x, just the way
a few methods work.

All we agreed to is that unicode.encode() will only return bytes,
while bytes.decode() will only return unicode. So the methods won't
support same type conversions, because Guido didn't want to
have methods that return different types based on the chosen
parameter (the codec name in this case).

However, you can still use codecs.encode() and codecs.decode()
to work with codecs that return different combinations of
types. I explicitly added that support back to 3.0.

You can't argue that just because two methods don't support
a certain type combination, the whole architecture doesn't
support this anymore.

Also note that codecs allow a much more far-reaching use
than just through the unicode and bytes methods: you can
use them as seamless wrappers for streams, subclass from
them, use their methods directly, etc. etc.

So your argument that just because the two methods don't
support these codecs anymore is just not good enough
to warrant their removal.

malemburg · 2009-12-11T09:56:57Z

Benjamin Peterson wrote:

Benjamin Peterson <benjamin@python.org> added the comment:

I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal
strictly with unicode -> bytes.

Sorry, Bejamin, but that's simply not true.

Codecs can work with arbitrary types, it's just that the helper
methods on unicode and bytes objects only support one combination
of types in Python 3.x.

codecs.encode()/.decode() provide access to all codecs, regardless
of their supported type combinations and of course, you can use
them directly via the codec registry, subclass from them, etc.

florentx · 2009-12-11T10:22:23Z

Thinking about it, I am +1 to reimplement the codecs.

We could implement new methods to replace the old one.
(similar to base64.encodebytes and base64.decodebytes)

>>> b'abc'.encodebytes('base64')
b'YWJj\n'
>>> b'abc'.encodebytes('zlib').encodebytes('base64')
b'eJxLTEoGAAJNASc=\n'
>>> b'UHl0aG9u'.decodebytes('base64').decode('utf-8')
'Python'

benjaminp · 2009-12-11T12:54:39Z

2009/12/11 Marc-Andre Lemburg <report@bugs.python.org>:

codecs.encode()/.decode() provide access to all codecs, regardless
of their supported type combinations and of course, you can use
them directly via the codec registry, subclass from them, etc.

Didn't you have a proposal for bytes.transform/untransform for
operations like this?

malemburg · 2009-12-11T13:13:50Z

Benjamin Peterson wrote:

Benjamin Peterson <benjamin@python.org> added the comment:

2009/12/11 Marc-Andre Lemburg <report@bugs.python.org>:
> codecs.encode()/.decode() provide access to all codecs, regardless
> of their supported type combinations and of course, you can use
> them directly via the codec registry, subclass from them, etc.

Didn't you have a proposal for bytes.transform/untransform for
operations like this?

Yes. At the time it was postponed, since I brought it up late
in the 3.0 release process. Perhaps I should bring it up again.

Note that those methods are just convenient helpers to access
the codecs and as such only provide limited functionality.

The full machinery itself is accessible via the codecs module and
the code in the encodings package. Any decision to include a codec
or not needs to be based on whether it fits the framework in those
modules/packages, not the functionality we expose on unicode and
bytes objects.

florentx · 2009-12-11T17:05:42Z

I've ported the codecs from Py2:
base64, bytes_escape, bz2, hex, quopri, rot13, uu and zlib

It's not a big deal. Basically:

StringIO.StringIO --> io.BytesIO
'string_escape' --> 'bytes_escape'

Will add documentation if we agree on the feature.

loewis · 2009-12-11T23:09:08Z

codecs.encode()/.decode() provide access to all codecs, regardless
of their supported type combinations and of course, you can use
them directly via the codec registry, subclass from them, etc.

I presume that the OP didn't talk about codecs.encode, but about
the methods on string objects. flox, can you clarify what precisely
it is that you miss?

ncoghlan · 2013-04-23T21:46:42Z

No, transform/untransform as methods are a bad idea, but these *codecs*
should definitely come back.

The minimal change needed for that to be feasible is to give errors raised
during encoding and decoding more context information (at least the codec
name and error mode, and switching to the right kind of error).

MAL also stated on python-dev that codecs.encode and codecs.decode already
exist, so it should just be a matter of documenting them properly.

gpshead · 2013-04-23T22:19:41Z

okay, but i don't personally find any of these to be good ideas as "codecs" given they don't have anything to do with translating between bytes<->unicode.

ncoghlan · 2013-04-23T23:07:33Z

The codecs module is generic, text encodings are just the most common use
case (hence the associated method API).

loewis · 2013-04-24T11:45:24Z

I don't see any point in merely bringing the codecs back, without any convenience API to use them. If I need to do

  import codecs
  result = codecs.getencoder("base64").encode(data)

I don't think people would actually prefer this over

  import base64
  result = base64.encodebytes(data)

I't (IMO) only the convenience method (.encode) that made people love these codecs.

ezio-melotti · 2013-04-24T12:20:47Z

IMHO it's also a documentation problem. Once people figure out that they can't use encode/decode anymore, it's not immediately clear what they should do instead. By reading the codecs docs0 it's not obvious that it can be done with codecs.getencoder("...").encode/decode, so people waste time finding a solution, get annoyed, and blame Python 3 because it removed a simple way to use these codecs without making clear what should be used instead.
FWIW I don't care about having to do an extra import, but indeed something simpler than codecs.getencoder("...").encode/decode would be nice.

ncoghlan · 2013-04-24T13:43:13Z

It turns out MAL added the convenience API I'm looking for back in 2004, it just didn't get documented, and is hidden behind the "from _codecs import *" call in the codecs.py source code:

http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598

So, all the way from 2.4 to 2.7 you can write:

  from codecs import encode
  result = encode(data, "base64")

It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases:

>>> encode(b"example", "base64_codec")
b'ZXhhbXBsZQ==\n'
>>> decode(b"ZXhhbXBsZQ==\n", "base64_codec")
b'example'

Note that the convenience functions omit the extra checks that are part of the methods (although I admit the specific error here is rather quirky):

>>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode
    return (base64.decodebytes(input), len(input))
  File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes
    raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not memoryview

I'me going to create some additional issues, so this one can return to just being about restoring the missing aliases.

malemburg · 2013-04-24T13:47:10Z

Just copying some details here about codecs.encode() and
codec.decode() from python-dev:

"""
Just as reminder: we have the general purpose
encode()/decode() functions in the codecs module:

import codecs
r13 = codecs.encode('hello world', 'rot-13')

These interface directly to the codec interfaces, without
enforcing type restrictions. The codec defines the supported
input and output types.
"""

As Nick found, these aren't documented, which is a documentation
bug (I probably forgot to add documentation back then).
They have been in Python since 2004:

http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598

These API are nice for general purpose codec work and
that's why I added them back in 2004.

For the codecs in question, it would still be nice to have
a more direct way to access them via methods on the types
that you typically use them with.

ezio-melotti · 2013-04-24T13:53:35Z

It works in 3.x as well, you just need to add the "_codec" to the end
to account for the missing aliases:

FTR this is because of ff1261a14573 (see bpo-10807).

ncoghlan · 2013-04-24T14:11:28Z

bpo-17827 covers adding documentation for codecs.encode and codecs.decode

bpo-17828 covers adding exception handling improvements for all encoding and decoding operations

ncoghlan · 2013-04-24T14:22:38Z

For me, the killer argument *against* a method based API is memoryview (and, equivalently, array.array). It should be possible to use those as inputs for the bytes->bytes codecs, and once you endorse codecs.encode and codecs.decode for that use case, it's hard to justify adding more exclusive methods to the already broad bytes and bytearray APIs (particularly given the problems with conveying direction of conversion unambiguously).

By contrast, I think "the codecs functions are generic while the str, bytes and bytearray methods are specific to text encodings" is something we can explain fairly easily, thus allowing the aliases mentioned in this issue to be restored for use with the codecs module functions. To avoid reintroducing the quirky errors described in bpo-10807, the encoding and decoding error messages should first be improved as discussed in bpo-17828.

ncoghlan · 2013-04-25T07:49:12Z

Also adding 17839 as a dependency, since part of the reason the base64 errors in particular are so cryptic is because the base64 module doesn't accept arbitrary PEP-3118 compliant objects as input.

ncoghlan · 2013-04-25T08:31:46Z

I also created bpo-17841 to cover that that the 3.3 documentation incorrectly states that these aliases still exist, even though they were removed before 3.2 was released.

ncoghlan · 2013-10-02T15:08:16Z

With bpo-17839 fixed, the error from invoking the base64 codec through the method API is now substantially more sensible:

>>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoder did not return a str object (type=bytes)

ncoghlan · 2013-10-02T15:13:49Z

I just wanted to note something I realised in chatting to Armin Ronacher recently: in both Python 2.x and 3.x, the encode/decode method APIs are constrained by the text model, it's just that in 2.x that model was effectively basestring<->basestring, and thus still covered every codec in the standard library. This greatly limited the use cases for the codecs.encode/decode convenience functions, which is why the fact they were undocumented went unnoticed.

In 3.x, the changed text model meant the method API become limited to the Unicode codecs, making the function based API more important.

ncoghlan · 2013-11-04T13:21:33Z

For anyone interested, I have a patch up on bpo-17828 that produces the following output for various codec usage errors:

>>> import codecs
>>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types

>>> "hello".encode("bz2_codec")
TypeError: 'str' does not support the buffer interface

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)

>>> "hello".encode("rot_13")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types

ncoghlan · 2013-11-06T12:41:41Z

Providing the 2to3 fixers in bpo-17823 now depends on this issue rather than the other way around (since not having to translate the names simplifies the fixer a bit).

ncoghlan · 2013-11-10T09:25:10Z

bpo-17823 is now closed, but not because it has been implemented. It turns out that the data driven nature of the incompatibility means it isn't really amenable to being detected and fixed automatically via 2to3.

bpo-19543 is a replacement proposal for the introduction of some additional codec related Py3k warnings in Python 2.7.7.

ncoghlan · 2013-11-17T07:41:27Z

Attached patch restores the aliases for the binary and text transforms, adds a test to ensure they exist and restores the "Aliases" column to the relevant tables in the documentation. It also updates the relevant section in the What's New document.

I also tweaked the wording in the docs to use the phrases "binary transform" and "text transform" for the affected tables and version added/changed notices.

Given the discussions on python-dev, the main condition that needs to be met before I commit this is for Victor to change his current -1 to a -0 or higher.

ncoghlan · 2013-11-19T14:25:41Z

Victor is still -1, so to Python 3.5 it goes.

ncoghlan · 2013-11-22T12:44:25Z

The 3.4 portion of bpo-19619 has been addressed, so removing it as a dependency again.

ncoghlan · 2013-11-23T00:46:51Z

With bpo-19619 resolved for Python 3.4 (the issue itself remains open awaiting a backport to 3.3), Victor has softened his stance on this topic and given the go ahead to restore the codec aliases: http://bugs.python.org/issue19619#msg203897

I'll be committing this shortly, after adjusting the patch to account for the bpo-19619 changes to the tests and What's New.

python-dev · 2013-11-23T01:14:38Z

New changeset 5e960d2c2156 by Nick Coghlan in branch 'default':
Close bpo-7475: Restore binary & text transform codecs
http://hg.python.org/cpython/rev/5e960d2c2156

ncoghlan · 2013-11-23T01:16:23Z

Note that I still plan to do a documentation-only PEP for 3.4, proposing some adjustments to the way the codecs module is documented, making binary and test transform defined terms in the glossary, etc.

I'll probably aim for beta 2 for that.

serhiy-storchaka · 2014-01-04T13:34:05Z

Docstrings for new codecs mention bytes.transform() and bytes.untransform() which are nonexistent.

python-dev · 2014-03-14T00:55:23Z

New changeset d7950e916f20 by R David Murray in branch '3.3':
bpo-7475: Remove references to '.transform' from transform codec docstrings.
http://hg.python.org/cpython/rev/d7950e916f20

New changeset 83d54ab5c696 by R David Murray in branch 'default':
Merge bpo-7475: Remove references to '.transform' from transform codec docstrings.
http://hg.python.org/cpython/rev/83d54ab5c696

florentx mannequin added the stdlib Python modules in the Lib dir label Dec 10, 2009

loewis mannequin closed this as completed Dec 10, 2009

loewis mannequin added the invalid label Dec 10, 2009

malemburg reopened this Dec 10, 2009

benjaminp closed this as completed Dec 11, 2009

florentx mannequin added docs Documentation in the Doc dir topic-2to3 and removed stdlib Python modules in the Lib dir labels Dec 11, 2009

florentx mannequin changed the title ~~codecs missing: base64 bz2 hex zlib ...~~ No hint about codecs removed : base64 bz2 hex zlib ... Dec 11, 2009

florentx mannequin assigned birkenfeld Dec 11, 2009

florentx mannequin reopened this Dec 11, 2009

florentx mannequin changed the title ~~No hint about codecs removed : base64 bz2 hex zlib ...~~ No hint about codecs removed: base64 bz2 hex zlib ... Dec 11, 2009

malemburg changed the title ~~No hint about codecs removed: base64 bz2 hex zlib ...~~ codecs missing: base64 bz2 hex zlib ... Dec 11, 2009

malemburg removed the invalid label Dec 11, 2009

florentx mannequin added the stdlib Python modules in the Lib dir label Dec 11, 2009

gpshead closed this as completed Apr 23, 2013

gpshead reopened this Apr 23, 2013

ncoghlan self-assigned this Nov 23, 2013

python-dev mannequin closed this as completed Nov 23, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

codecs missing: base64 bz2 hex zlib hex_codec ... #51724

codecs missing: base64 bz2 hex zlib hex_codec ... #51724

Comments

florentx mannequin commented Dec 10, 2009

florentx mannequin commented Dec 10, 2009

loewis mannequin commented Dec 10, 2009

malemburg commented Dec 10, 2009

malemburg commented Dec 10, 2009

loewis mannequin commented Dec 10, 2009

benjaminp commented Dec 11, 2009

florentx mannequin commented Dec 11, 2009

florentx mannequin commented Dec 11, 2009

malemburg commented Dec 11, 2009

malemburg commented Dec 11, 2009

florentx mannequin commented Dec 11, 2009

benjaminp commented Dec 11, 2009

malemburg commented Dec 11, 2009

florentx mannequin commented Dec 11, 2009

loewis mannequin commented Dec 11, 2009

ncoghlan commented Apr 23, 2013

gpshead commented Apr 23, 2013

ncoghlan commented Apr 23, 2013

loewis mannequin commented Apr 24, 2013

ezio-melotti commented Apr 24, 2013

ncoghlan commented Apr 24, 2013

malemburg commented Apr 24, 2013

ezio-melotti commented Apr 24, 2013

ncoghlan commented Apr 24, 2013

ncoghlan commented Apr 24, 2013

ncoghlan commented Apr 25, 2013

ncoghlan commented Apr 25, 2013

ncoghlan commented Oct 2, 2013

ncoghlan commented Oct 2, 2013

ncoghlan commented Nov 4, 2013

ncoghlan commented Nov 6, 2013

ncoghlan commented Nov 10, 2013

ncoghlan commented Nov 17, 2013

ncoghlan commented Nov 19, 2013

ncoghlan commented Nov 22, 2013

ncoghlan commented Nov 23, 2013

python-dev mannequin commented Nov 23, 2013

ncoghlan commented Nov 23, 2013

serhiy-storchaka commented Jan 4, 2014

python-dev mannequin commented Mar 14, 2014