This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

classification
Title: codecs missing: base64 bz2 hex zlib hex_codec ...
Type: enhancement Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: 17828 17839 17844 Superseder:
Assigned To: ncoghlan Nosy List: barry, belopolsky, benjamin.peterson, cben, eric.araujo, ezio.melotti, flox, georg.brandl, gregory.p.smith, isoschiz, jcea, jwilk, lemburg, loewis, ncoghlan, pconnell, petri.lehtinen, python-dev, r.david.murray, serhiy.storchaka, ssbarnea, vstinner
Priority: normal Keywords: patch

Created on 2009-12-10 22:27 by flox, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue7475_warning.diff flox, 2009-12-11 09:26 Patch for documentation and warnings in 2.7 review
issue7475_missing_codecs_py3k.diff flox, 2009-12-11 17:05 Patch, apply to trunk
issue7475_restore_codec_aliases_in_py34.diff ncoghlan, 2013-11-17 07:41 Patch to restore the transform aliases. review
Messages (95)
msg96218 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-10 22:27
AFAIK these codecs were not ported to Python 3.

1. I found no hint in documentation on this matter.

2. Is it possible to contribute some of them, or there's a good reason
to look elsewhere?
msg96223 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-12-10 23:15
These are not encodings, in that they don't convert characters to bytes.
It was a mistake that they were integrated into the codecs interfaces in
Python 2.x; this mistake is corrected in 3.x.
msg96226 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-10 23:25
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> These are not encodings, in that they don't convert characters to bytes.
> It was a mistake that they were integrated into the codecs interfaces in
> Python 2.x; this mistake is corrected in 3.x.

Martin, I beg your pardon, but these codecs indeed implement valid
encodings and the fact that these codecs were removed was a
mistake.

They should be readded to Python 3.x.

Note that just because a codec doesn't convert between bytes
and characters only, doesn't make it wrong in any way. The codec
architecture in Python is designed to support same type encodings
just as well as ones between bytes and characters.
msg96227 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-10 23:26
Reopening the ticket.
msg96228 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-12-10 23:28
It's not possible to add these codecs back. Bytes objects (correctly)
don't have an encode method, and string objects (correctly) don't have a
decode method. The codec architecture of Python 3.x just doesn't support
this kind of application; the codec architecture of 2.x was flawed.
msg96232 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-12-11 02:09
I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal
strictly with unicode -> bytes.
msg96236 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-11 08:21
«Everything you thought you knew about binary data and Unicode
 has changed.»

Reopening for the documentation part.

This "mistake" deserves some words in the documentation:
  docs.python.org/dev/py3k/whatsnew/3.0.html
      #text-vs-data-instead-of-unicode-vs-8-bit

And the conversion may be automated with 2to3, maybe.
msg96237 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-11 08:31
Is it possible to add "DeprecationWarning" for these codecs
when using "python -3" ?

>>> {}.has_key('a')
__main__:1: DeprecationWarning: dict.has_key() not supported in 3.x;
            use the in operator
False
>>> print `123`
<stdin>:1: SyntaxWarning: backquote not supported in 3.x; use repr()
123
>>> 'abc'.encode('base64')
'YWJj\n'
msg96240 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-11 09:46
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> It's not possible to add these codecs back. Bytes objects (correctly)
> don't have an encode method, and string objects (correctly) don't have a
> decode method. The codec architecture of Python 3.x just doesn't support
> this kind of application; the codec architecture of 2.x was flawed.

Of course it does support these kinds of codecs. The codec
architecture hasn't changed between 2.x and 3.x, just the way
a few methods work.

All we agreed to is that unicode.encode() will only return bytes,
while bytes.decode() will only return unicode. So the methods won't
support same type conversions, because Guido didn't want to
have methods that return different types based on the chosen
parameter (the codec name in this case).

However, you can still use codecs.encode() and codecs.decode()
to work with codecs that return different combinations of
types. I explicitly added that support back to 3.0.

You can't argue that just because two methods don't support
a certain type combination, the whole architecture doesn't
support this anymore.

Also note that codecs allow a much more far-reaching use
than just through the unicode and bytes methods: you can
use them as seamless wrappers for streams, subclass from
them, use their methods directly, etc. etc.

So your argument that just because the two methods don't
support these codecs anymore is just not good enough
to warrant their removal.
msg96242 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-11 09:56
Benjamin Peterson wrote:
> 
> Benjamin Peterson <benjamin@python.org> added the comment:
> 
> I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal
> strictly with unicode -> bytes.

Sorry, Bejamin, but that's simply not true.

Codecs can work with arbitrary types, it's just that the helper
methods on unicode and bytes objects only support one combination
of types in Python 3.x.

codecs.encode()/.decode() provide access to all codecs, regardless
of their supported type combinations and of course, you can use
them directly via the codec registry, subclass from them, etc.
msg96243 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-11 10:22
Thinking about it, I am +1 to reimplement the codecs.

We could implement new methods to replace the old one.
(similar to base64.encodebytes and base64.decodebytes)

>>> b'abc'.encodebytes('base64')
b'YWJj\n'
>>> b'abc'.encodebytes('zlib').encodebytes('base64')
b'eJxLTEoGAAJNASc=\n'
>>> b'UHl0aG9u'.decodebytes('base64').decode('utf-8')
'Python'
msg96251 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-12-11 12:54
2009/12/11 Marc-Andre Lemburg <report@bugs.python.org>:
> codecs.encode()/.decode() provide access to all codecs, regardless
> of their supported type combinations and of course, you can use
> them directly via the codec registry, subclass from them, etc.

Didn't you have a proposal for bytes.transform/untransform for
operations like this?
msg96253 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-11 13:13
Benjamin Peterson wrote:
> 
> Benjamin Peterson <benjamin@python.org> added the comment:
> 
> 2009/12/11 Marc-Andre Lemburg <report@bugs.python.org>:
>> codecs.encode()/.decode() provide access to all codecs, regardless
>> of their supported type combinations and of course, you can use
>> them directly via the codec registry, subclass from them, etc.
> 
> Didn't you have a proposal for bytes.transform/untransform for
> operations like this?

Yes. At the time it was postponed, since I brought it up late
in the 3.0 release process. Perhaps I should bring it up again.

Note that those methods are just convenient helpers to access
the codecs and as such only provide limited functionality.

The full machinery itself is accessible via the codecs module and
the code in the encodings package. Any decision to include a codec
or not needs to be based on whether it fits the framework in those
modules/packages, not the functionality we expose on unicode and
bytes objects.
msg96265 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-11 17:05
I've ported the codecs from Py2:
    base64, bytes_escape, bz2, hex, quopri, rot13, uu and zlib

It's not a big deal. Basically:
 - StringIO.StringIO --> io.BytesIO
 - 'string_escape' --> 'bytes_escape'

Will add documentation if we agree on the feature.
msg96277 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-12-11 23:09
> codecs.encode()/.decode() provide access to all codecs, regardless
> of their supported type combinations and of course, you can use
> them directly via the codec registry, subclass from them, etc.

I presume that the OP didn't talk about codecs.encode, but about
the methods on string objects. flox, can you clarify what precisely
it is that you miss?
msg96295 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-12 15:40
Martin,

actually, I was trying to convert some piece of code from python2 to
python3. And this statement was not converted by 2to3:
  "x.decode('base64').decode('zlib')"

So, I read the official documentation, and found no hint about the
removal of these codecs.
For my specific use case, I can use "zlib.decompress" and
"base64.decodebytes", but I find that the ".encode()" and ".decode()"
helpers were useful in Python 2.

I don't know all the background of the removal of these codecs. But I
try to contribute to Python, and help Python 3 become at least as
featureful, and useful, as Python 2.

So, after reading the above comments, I think we may end up with
following changes:
 * restore the "bytes-to-bytes" codecs in the "encodings" package
 * then create new helpers on bytes objects (either
   ".transform()/.untransform()" or ".encodebytes()/.decodebytes")
msg96296 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-12 15:44
> And this statement was not converted

s/this statement/this method call/
msg96301 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-12-12 19:25
> So, after reading the above comments, I think we may end up with
> following changes:
>  * restore the "bytes-to-bytes" codecs in the "encodings" package
>  * then create new helpers on bytes objects (either
>    ".transform()/.untransform()" or ".encodebytes()/.decodebytes")

I would still be opposed to such a change, and I think it needs a PEP.
If the codecs are restored, one half of them becomes available to
.encode/.decode methods, since the codec registry cannot tell which
ones implement real character encodings, and which ones are other
conversion methods. So adding them would be really confusing.

I also wonder why you are opposed to the import statement. My
recommendation is indeed that you use the official API for these
libraries (and indeed, there is an official API for each of them,
unlike real codecs, which don't have any other documented API).
msg96374 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-14 10:30
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> So, after reading the above comments, I think we may end up with
>> following changes:
>>  * restore the "bytes-to-bytes" codecs in the "encodings" package

+1

>>  * then create new helpers on bytes objects (either
>>    ".transform()/.untransform()" or ".encodebytes()/.decodebytes")

+1 - the names are still up for debate, IIRC.

> I would still be opposed to such a change, and I think it needs a PEP.

All this has already been discussed and the only reason it didn't
go in earlier was timing. No need for a PEP.

> If the codecs are restored, one half of them becomes available to
> .encode/.decode methods, since the codec registry cannot tell which
> ones implement real character encodings, and which ones are other
> conversion methods. So adding them would be really confusing.

Not at all. The helper methods check the return types and raise an
exception if the types don't match the expected types.

The codecs registry itself doesn't need to know about the possible
input/output types of codecs, since this information is not
required to match a name to an implementation.

What we could do, is add that information to the CodecInfo object
used for registering the codec. codecs.lookup() would then
return the information to the application.

E.g.

.encode_input_types = (str,)
.encode_output_types = (bytes,)
.decode_input_types = (bytes,)
.decode_output_types = (str,)

Codecs not supporting these CodecInfo attributes would simply
return None.

> I also wonder why you are opposed to the import statement. My
> recommendation is indeed that you use the official API for these
> libraries (and indeed, there is an official API for each of them,
> unlike real codecs, which don't have any other documented API).

That's not the point. The codec API provides a standardized API for
all these encodings. The hex, zlib, bz2, etc. codecs are just
adapters of the different pre-existing APIs to the codec API.
msg96632 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-12-19 18:09
I also seem to recall that adding .transform()/.untransform() was
already accepted at some point.
msg106669 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 13:45
I agree with Martin: codecs choosed the wrong direction in Python2, and it's fixed in Python3. The codecs module is related to charsets (encodings), should encode str to bytes, and should decode bytes (or any read buffer) to str.

Eg. rot13 "encodes" str to str.

"base64 bz2 hex zlib ...": use base64, bz2, binascii and zlib modules for that.

The documentation should be fixed (explain how to port code from Python2 to Python3).

It's maybe possible for write some 2to3 fixers for the following examples:

"...".encode("base64") => base64.b64encode("...")
"...".encode("rot13") => do nothing (but display a warning?)
"...".encode("zlib") => zlib.compress("...")
"...".encode("hex") => base64.b16encode("...")
"...".encode("bz2") => bz2.compress("...")

"...".decode("base64") => base64.b64decode("...")
"...".decode("rot13") => do nothing (but display a warning?)
"...".decode("zlib") => zlib.decompress("...")
"...".decode("hex") => base64.b16decode("...")
"...".decode("bz2") => bz2.decompress("...")
msg106670 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 13:48
Explanation the change in Python3 by Guido:

"We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API)."

http://www.artima.com/weblogs/viewpost.jsp?thread=208549

--

See also issue #8838.
msg106674 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 14:17
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> I agree with Martin: codecs choosed the wrong direction in Python2, and it's fixed in Python3. The codecs module is related to charsets (encodings), should encode str to bytes, and should decode bytes (or any read buffer) to str.

No, that's just not right: the codec system in Python does not
mandate the types used or accepted by the codecs.

The only change that was applied in Python3 was to make sure
that the str.encode() and bytes.decode() methods always return
the same type to assure type-safety.

Python2 does not apply that check, but instead provides a
direct interface to codecs.encode() and codecs.decode().

Please don't mix the helper methods on those objects with what
the codec system was designed for. The helper methods apply
a strategy that's more constrained than the codec system.

The addition of .transform() and .untransform() for same
type conversions was discussed in 2008, but didn't make it into 3.0
since I hadn't had time to add the methods:

http://mail.python.org/pipermail/python-3000/2008-August/014533.html
http://mail.python.org/pipermail/python-3000/2008-August/014533.html
http://mail.python.org/pipermail/python-3000/2008-August/014534.html

The removed codecs don't rely on the helper methods in any way.
They are easily usable via codecs.encode() and codecs.decode()
even without .transform() and .untransform().

Esp. the hex codec is very handy and at least in our eGenix
code base in wide-spread use. Using a single well-defined
interface to such encodings is just much more user friendly
than having to research the different APIs for each of them.
msg107057 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-06-04 14:12
Related: bytes vs. str for base64 encoding in email, #8896
msg107794 - (view) Author: Sorin Sbarnea (ssbarnea) * Date: 2010-06-14 15:35
I would like to know what happened with hex_codec and what is the new py3 for this.

Also, it would be really helpful to see DeprecationWarnings for all these codecs in py2x and include a note in py3 changelist. 

The official python documentation from http://docs.python.org/library/codecs.html lists them as valid without any signs of them as being dropped or replaced.
msg109872 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-07-10 14:24
> I would like to know what happened with hex_codec and what is the new py3 for this.

If you had read this bug report, you'd know that the codec was removed
in Python 3. Use binascii.hexlify/binascii.unhexlify instead (as you
should in 2.x, also).
msg109876 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-07-10 15:24
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> I would like to know what happened with hex_codec and what is the new py3 for this.
> 
> If you had read this bug report, you'd know that the codec was removed
> in Python 3. Use binascii.hexlify/binascii.unhexlify instead (as you
> should in 2.x, also).

... or wait for Python 3.2 which will readd them :-)
msg109879 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-10 15:36
... but don't wait to long to add them!
msg109894 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-07-10 17:06
Georg Brandl wrote:
> 
> Georg Brandl <georg@python.org> added the comment:
> 
> ... but don't wait to long to add them!

I plan to work on that after EuroPython. Florent already provided
the patch for the codecs, so what's left is adding the .transform()/
.untransform() methods, and perhaps tweak the codec input/output
types in a couple of cases.
msg109904 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-07-10 18:14
I am confused by MvL’s reply. From the first paragraph documentation for binascii: “Normally, you will not use these functions directly but use wrapper modules like uu, base64, or binhex instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.”

Is the doc not accurate?

Also, can someone not unsure about the status of this report edit the type, stage, component and resolution? It would be helpful.
msg109905 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-07-10 18:35
> I am confused by MvL’s reply. From the first paragraph documentation
> for binascii: “Normally, you will not use these functions directly
> but use wrapper modules like uu, base64, or binhex instead. The
> binascii module contains low-level functions written in C for greater
> speed that are used by the higher-level modules.”
> 
> Is the doc not accurate?

It is correct. So use base64.b16encode/b16decode then.
It's just that I personally prefer hexlify/unhexlify, because I can
memorize the function name better.
msg123090 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-12-02 18:08
Codecs brought back and (un)transform implemented in r86934.
msg123154 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-03 01:40
I am probably a bit late to this discussion, but why these things should be called "codecs" and why should they share the registry with the encodings?  It looks like the proper term would be "transformations" or "transforms".
msg123206 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-12-03 08:46
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> I am probably a bit late to this discussion, but why these things should be called "codecs" and why should they share the registry with the encodings?  It looks like the proper term would be "transformations" or "transforms".

.transform() is just the name of the method. The codecs are still just
that: codecs, i.e. objects that encode and decode data. The types they
support are defined by the codecs, not by the helper methods.

In Python3, the str and bytes methods .encode() and .decode() will
only support str->bytes->str conversions. The new
str and bytes .transform() method adds back str->str and
bytes->bytes.

The codec subsystem does not impose restrictions on the type combinations
a codec can support, and that's per design.
msg123435 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-12-05 19:04
As per 

http://mail.python.org/pipermail/python-dev/2010-December/106374.html

I think this checkin should be reverted, as it's breaking the language moratorium.
msg123436 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-12-05 19:12
I leave this to MAL, on whose behalf I finished this to be in time for beta.
msg123462 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-12-06 11:49
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> As per 
> 
> http://mail.python.org/pipermail/python-dev/2010-December/106374.html
> 
> I think this checkin should be reverted, as it's breaking the language moratorium.

I've asked Guido. We may have to revert the addition of the new
methods and then readd them for 3.3, but I don't really see
them as difficult to implement for the other Python implementations,
since they are just interfaces to the codec sub-system.

The readdition of the codecs and changes to support them in the
codec system do not fall under the moratorium, since they are
stdlib changes.
msg123693 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-09 18:43
With Georg's approval, I am reopening this issue until a decision is made on whether {str,bytes,bytearray}.{transform,untransform} methods should go into 3.2.

I am adding Guido to "nosy" because the decision may turn on the interpretation of his post. [1]

I also started a python-dev thread on this issue. [2]

[1] http://mail.python.org/pipermail/python-dev/2010-December/106374.html
[2] http://mail.python.org/pipermail/python-dev/2010-December/106617.html
msg125073 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-02 19:01
See issue #10807: 'base64' can be used with bytes.decode() (and str.encode()), but it raises a confusing exception (TypeError: expected bytes, not memoryview).
msg145246 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-10-09 09:18
So.  This was reverted before 3.2 was out, right?  What is the status for 3.3?
msg145656 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-10-17 00:53
What is the status of this issue?

rot13 codecs & friends were added back to Python 3.2 with {bytes,str}.(un)transform() methods: commit 7e4833764c88. Codecs were disabled because of surprising error messages before the release of Python 3.2 final: issue #10807, commit ff1261a14573. transform() and untransform() methods were also removed, I don't remember why/how exactly, maybe because new codecs were disabled.

So we have rot13 & friends in Python 3.2 and 3.3, but they cannot be used with the regular str.encode('rot13'), you have to write (for example):

>>> codecs.getdecoder('rot_13')('rot13')
('ebg13', 5)
>>> codecs.getencoder('rot_13')('ebg13')
('rot13', 5)

The major issue with {bytes,str}.(un)transform() is that we have only one registry for all codecs, and the registry was changed in Python 3 to ensure:
 * encode: str->bytes
 * decode: bytes->str

To implement str.transform(), we need another register. Marc-Andre suggested (msg96374) to add tags to codecs:
"""
.encode_input_types = (str,)
.encode_output_types = (bytes,)
.decode_input_types = (bytes,)
.decode_output_types = (str,)
"""

I'm still opposed to str->str (rot13) and bytes->bytes (hex, gzip, ...) operations using the codecs API. Developers have to use the right module. If the API of these modules is too complex, we should add helpers to these modules, but not to builtin types. Builtin types have to be and stay simple and well defined.
msg145693 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-10-17 13:38
> transform() and untransform() methods were also removed, I don't remember why/how exactly,
I don’t remember either; maybe it was too late in the release process, or we lacked enough consensus.

> So we have rot13 & friends in Python 3.2 and 3.3, but they cannot be used with the regular
> str.encode('rot13'), you have to write (for example): codecs.getdecoder('rot_13')
Ah, great, I thought they were not available at all!

> The major issue with {bytes,str}.(un)transform() is that we have only one registry for all
> codecs, and the registry was changed in Python 3 [...] To implement str.transform(), we need
> another register. Marc-Andre suggested (msg96374) to add tags to codecs
I’m confused: does the tags idea replace the idea of adding another registry?

> I'm still opposed to str->str (rot13) and bytes->bytes (hex, gzip, ...) operations using the
> codecs API. Developers have to use the right module.
Well, here I disagree with you and agree with MAL: str.encode and bytes.decode are strict, but the codec API in general is not restricted to str→bytes and bytes→str directions.  Using the zlib or base64 modules vs. the codecs is a matter of style; sometimes you think it looks hacky, sometimes you think it’s very handy.  And rot13 only exists as a codec!
msg145897 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-10-19 11:35
They were removed because adding new methods to builtin types violated the language moratorium.

Now that the language moratorium is over, the transform/untransform convenience APIs should be added again for 3.3. It's an approved change, the original timing was just wrong.
msg145900 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-10-19 11:58
Sorry, I meant to state my rationale for the unassignment - I'm assuming this issue is covered by MAL's recent decision to step away from Unicode and codec maintenance issues. If that's incorrect, MAL can reclaim the issue, otherwise unassigning leaves it open for whoever wants to move it forward.
msg145979 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-10-19 22:09
Some further comments after getting back up to speed with the actual status of this problem (i.e. that we had issues with the error checking and reporting in the original 3.2 commit).

1. I agree with the position that the codecs module itself is intended to be a type neutral codec registry. It encodes and decodes things, but shouldn't actually care about the types involved. If that is currently not the case in 3.x, it needs to be fixed.

This type neutrality was blurred in 2.x by the fact that it only implemented str->str translations, and even further obscured by the coupling to the .encode() and .decode() convenience APIs. The fact that the type neutrality of the registry itself is currently broken in 3.x is a *regression*, not an improvement. (The convenience APIs, on the other hand, are definitely *not* type neutral, and aren't intended to be)

2. To assist in producing nice error messages, and to allow restrictions to be enforced on type-specific convenience APIs, the CodecInfo objects should grow additional state as MAL suggests. To avoid redundancy (and inaccurate overspecification), my suggested colour for that particular bikeshed is:

Character encoding codec:
  .decoded_format = 'text'
  .encoded_format = 'binary'

Binary transform codec:
  .decoded_format = 'binary'
  .encoded_format = 'binary'

Text transform codec:
  .decoded_format = 'text'
  .encoded_format = 'text'

I suggest using the fuzzy format labels mainly due to the existence of the buffer API - most codec operations that consume binary data will accept anything that implements the buffer API, so referring specifically to 'bytes' in error messages would be inaccurate.

The convenience APIs can then emit errors like:

  'a'.encode('rot_13') ==>
  CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text)

  'a'.decode('rot_13') ==>
  CodecLookupError: text <-> binary codec expected ('rot_13' is text <-> text)

  'a'.transform('bz2') ==>
  CodecLookupError: text <-> text codec expected ('bz2' is binary <-> binary)

  'a'.transform('ascii') ==>
  CodecLookupError: text <-> text codec expected ('ascii' is text <-> binary)

  b'a'.transform('ascii') ==>
  CodecLookupError: binary <-> binary codec expected ('ascii' is text <-> binary)

For backwards compatibility with 3.2, codecs that do not specify their formats should be treated as character encoding codecs (i.e. decoded format is 'text', encoded format is 'binary')
msg145980 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-10-19 22:12
Oops, typo in my second error example. The command should be:

  b'a'.decode('rot_13')

(Since str objects don't offer a decode() method any more)
msg145982 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-10-19 22:34
> *.encode('rot_13') ==> CodecLookupError

I like the idea of raising a lookup error on .encode/.decode if the codec is not a classic text codec (like ASCII or UTF-8).

> *.transform('ascii') ==> CodecLookupError

Same comment.

> str.transform('bz2') ==> CodecLookupError

A lookup error is surprising here. It may be a TypeError instead. The bz2 can be used with .transform, but not on str. So:

 - Lookup error if the codec cannot be used with encode/decode or transform/untransform
 - Type error if the value type is invalid

(CodecLookupError doesn't exist, you propose to define a new exception who inherits from LookupError?)
msg145986 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-10-19 22:54
On Thu, Oct 20, 2011 at 8:34 AM, STINNER Victor <report@bugs.python.org> wrote:
>> str.transform('bz2') ==> CodecLookupError
>
> A lookup error is surprising here. It may be a TypeError instead. The bz2 can be used with .transform, but not on str. So:

No, it's the same concept as the other cases - we found a codec with
the requested name, but it's not the kind of codec we wanted in the
current context (i.e. str.transform). It may be that the problem is
the user has a str when they expected to have a bytearray or a bytes
object, but there's no way for the codec lookup process to know that.

>  - Lookup error if the codec cannot be used with encode/decode or transform/untransform
>  - Type error if the value type is invalid

There's no way for str.transform to tell the difference between "I
asked for the wrong codec" and "I expected to have a bytes object
here, not a str object". That's why I think we need to think in terms
of format checks rather than type checks.

> (CodecLookupError doesn't exist, you propose to define a new exception who inherits from LookupError?)

Yeah, and I'd get that to handle the process of creating the nice
error messages. I think it may even make sense to build the filtering
options into codecs.lookup() itself:

  def lookup(encoding, decoded_format=None,  encoded_format=None):
      info = _lookup(encoding) # The existing codec lookup algorithm
      if ((decoded_format is not None and decoded_format !=
info.decoded_format) or
          (encoded_format is not None and encoded_format !=
info.encoded_format)):
          raise CodecLookupError(info, decoded_format, encoded_format)

Then the various encode, decode and transform methods can just pass
the appropriate arguments to 'codecs.lookup' without all having to
reimplement the format checking logic.
msg145991 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-10-19 23:10
> I think it may even make sense to build the filtering
> options into codecs.lookup() itself:
> 
>   def lookup(encoding, decoded_format=None,  encoded_format=None):
>       info = _lookup(encoding) # The existing codec lookup algorithm
>       if ((decoded_format is not None and decoded_format !=
> info.decoded_format) or
>           (encoded_format is not None and encoded_format !=
> info.encoded_format)):
>           raise CodecLookupError(info, decoded_format, encoded_format)

lookup('rot13') should fail with a lookup error to keep backward 
compatibility. You can just change the default values to:

def lookup(encoding, decoded_format='text',  encoded_format='binary'): ...

If you patch lookup, what about the following functions?

- getencoder()
- getdecoder()
- getincrementalencoder()
- getincrementaldecoder()
- getread()
- getwriter()
- itereencode()
msg145998 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-10-20 01:53
I'm fine with people needing to drop down to the lower level lookup() API if they want the filtering functionality in Python code. For most purposes, constraining the expected codec input and output formats really isn't a major issue - we just need it in the core in order to emit sane error messages when people misuse the convenience APIs based on things that used to work in 2.x (like 'a'.encode('base64')).

At the C level, I'd adjust _PyCodec_Lookup to accept the two extra arguments and add _PyCodec_EncodeText, _PyCodec_DecodeBinary, _PyCodec_TransformText and _PyCodec_TransformBinary to support the convenience APIs (rather than needing the individual objects to know about the details of the codec tagging mechanism).

Making new codecs available isn't a backwards compatibility problem - anyone relying on a particular key being absent from an extensible registry is clearly doing the wrong thing.

Regarding the particular formats, I'd suggest that hex, base64, quopri, uu, bz2 and zlib all be flagged as binary transforms, but rot13 be implemented as a text transform (Florent's patch has rot13 as another binary transform, but it makes more sense in the text domain - this should just be a matter of adjusting some of the data types in the implementation from bytes to str)
msg149439 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2011-12-14 10:51
Issue 13600 has been marked as a duplicate of this issue.

FRT, +1 to the idea of adding encoded_format and decoded_format attributes to CodecInfo, and also to adding {str,bytes}.{transform,untransform} back.
msg153304 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-02-13 21:17
What is the status of this issue? Is there still a fan of this issue motivated to write a PEP, a patch or something like that?
msg153317 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-14 03:25
It's still on my radar to come back and have a look at it. Feedback from the web folks doing Python 3 migrations is that it would have helped them in quite a few cases.

I want to get a couple of other open PEPs out of the way first, though (mainly 394 and 409)
msg164224 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-06-28 07:13
My current opinion is that this should be a PEP for 3.4, to make sure we flush out all the corner cases and other details correctly.
msg164226 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-06-28 07:26
For that matter, with the relevant codecs restored in 3.2, a transform() helper could probably be added to six (or a new project on PyPI) to prototype the approach.
msg164237 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-06-28 10:41
Setting as a release blocker for 3.4 - this is important.
msg165435 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-07-14 07:36
FWIW it's, I've been thinking further about this recently and I think implementing this feature as builtin methods is the wrong way to approach it.

Instead, I propose the addition of codecs.encode and codecs.decode methods that are type neutral (leaving any type checks entirely up to the codecs themselves), while the str.encode and bytes.decode methods retain their current strict test model related type restrictions.

Also, I now think my previous proposal for nice error messages was massively over-engineered. A much simpler approach is to just replace the status quo:

>>> "".encode("bz2_codec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ncoghlan/devel/py3k/Lib/encodings/bz2_codec.py", line 17, in bz2_encode
    return (bz2.compress(input), len(input))
  File "/home/ncoghlan/devel/py3k/Lib/bz2.py", line 443, in compress
    return comp.compress(data) + comp.flush()
TypeError: 'str' does not support the buffer interface

with a better error with more context like:

UnicodeEncodeError: encoding='bz2_codec', errors='strict', codec_error="TypeError: 'str' does not support the buffer interface"

A similar change would be straightforward on the decoding side.

This would be a good use case for __cause__, but the codec error should still be included in the string representation.
msg170414 - (view) Author: Uzume (uzume) Date: 2012-09-12 19:09
Many have chimed in on this topic but I thought I would lend my stance--for whatever it is worth.

I also believe most of these do not fit concept of a character codec and some sort of transforms would likely be useful, however most are sort of specialized (e.g., there should probably be a generalized compression library interface al la hashlib):

rot13: a (albeit simplistic) text cipher (str to str; though bytes to bytes could be argued since since many crypto functions do that)

zlib, bz2, etc. (lzma/xz should also be here): all bytes to bytes compression transforms

hex(adecimal) uu, base64, etc.: these more or less fit the description of a character codec as they map between bytes and str, however, I am not sure they are really the same thing as these are basically doing a radix transformation to character symbols and the mapping it not strictly from bytes to a single character and back as a true character codec seems to imply. As evidenced by by int() format() and bytes.fromhex(), float.hex(), float.fromhex(), etc., these are more generalized conversions for serializing strings of bits into a textual representation (possibly for human consumption).

I personally feel any <type/class>.hex(), etc. method would be better off as a format() style formatter if they are to exist in such a space at all (i.e., not some more generalized conversion library--which we have but since 3.x could probably use to be updated and cleaned up).
msg187630 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2013-04-23 12:05
Another rant, because it matters to many of us:
http://lucumr.pocoo.org/2012/8/11/codec-confusion/

IMHO, the solution to restore str.decode and bytes.encode and return TypeError for improper use is probably the most obvious for the average user.
msg187631 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-23 12:15
-1
I see encoding as the process to go from text to bytes, and decoding the process to go from bytes to text, so (ab)using these terms for other kind of conversions is not an option IMHO.

Anyway I think someone should write a PEP and list the possible options and their pro and cons, and then a decision can be taken on python-dev.

FTR in Python 2 you can use decode for bytes->text, text->text, bytes->bytes, and even text->bytes:
u'DEADBEEF'.decode('hex')
'\xde\xad\xbe\xef'
msg187634 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-23 12:42
transform/untransform has approval-in-principle, adding encode/decode to the type that doesn't have them has been explicitly (and repeatedly :) rejected.

(I don't know about anybody else, but at this point I have written code that assumes that if an object has an 'encode' method, calling it will get me a bytes, and vice versa with 'decode'...an assumption I know is not "safe", but that I feel is useful duck typing in the contexts in which I used it.)

Nick wants a PEP, other people have said a PEP isn't necessary.  What is certainly necessary is for someone to pick up the ball and run with it.
msg187636 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2013-04-23 12:54
I am not a native english speaker, but it seems that the common usage of encode/decode is wider than the restricted definition applied for Python 3.3:

Some examples:

* RFC 4648 specifies "Base16, Base32, and Base64 Data Encodings"
  http://tools.ietf.org/html/rfc4648

* About rot13: "the same code can be used for encoding and decoding"
  http://www.catb.org/~esr/jargon/html/R/rot13.html

* The Huffman coding is "an entropy encoding algorithm" (used for DEFLATE)
  http://en.wikipedia.org/wiki/Huffman_coding

* RFC 2616 lists (zlib's) deflate or gzip as "encoding transformations"
  http://tools.ietf.org/html/rfc2616#section-3.5


However, I acknowledge that there are valid reasons to choose a different verb too.
msg187638 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-23 12:59
While not strictly necessary, a PEP would be certainly useful and will help reaching a consensus.  The PEP should provide a summary of the available options (transform/untransforms, reintroducing encode/decode for bytes/str, maybe others), their intended behavior (e.g. is type(x.transform()) == type(x) always true?), and possible issues (e.g.  Should some transformations be limited to str or bytes?  Should rot13 work with both transform and untransform?).
Even if we all agreed on a solution, such document would still be useful IMHO.
msg187644 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-23 13:46
+1 for someone stepping up to write a PEP on this if they would like to see the situation improved in 3.4.

transform/untransform has at least one core developer with an explicit -1 on the proposal at the moment (me).

We *definitely* need a generic object->object convenience API in the codecs module (codecs.decode, codecs.encode). I even accept that those two functions could be worthy of elevation to be new builtin functions.

I'm *far* from convinced that awkwardly named methods that only handle str->object, bytes->object and bytearray->object are a good idea. Should memoryview gain transform/untransform methods as well?

transform/untransform as proposed aren't even inverse operations, since they don't swap the valid input and output types (that is, transform is str/bytes/bytearray to arbitrary objects, while untransform is *also* str/bytes/bytearray to arbitrary objects. Inverses can't have a domain/range mismatch like that).

Those names are also ambiguous about which one corresponds to "encoding" and which to "decoding". encode() and decode(), whether as functions in the codecs module or as builtins, have no such issue.

Personally, the more I think about it, the more I'm in favour of adding encode and decode as builtin functions for 3.4. If you want arbitrary object->object conversions, use the builtins, if you want strict str->bytes or bytes/bytearray->str use the methods. Python 3 has been around long enough now, and Python 3.2 and 3.3 are sufficiently well known that I think we can add the full power builtins without people getting confused.
msg187649 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-23 14:41
I was visualizing transform/untransform as being restricted to buffertype->bytes and stringtype->string, which at least for binascii-type transforms is all the modules support.  After all, you don't get to choose what type of object you get back from encode or decode.

A more generalized transformation (encode/decode) utility is also interesting, but how many non-string non-bytes transformations do we actually support?
msg187651 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-23 14:55
If transform is a method, how do you plan to accept arbitrary buffer supporting types as input?

This is why I mentioned memoryview: it doesn't provide decode(), but there's no good reason you should have to copy the data from the view before decoding it. Similarly, you shouldn't have to make an unaltered copy before creating a compressed (or decompressed) copy.

With codecs.encode and codecs.decode as functions, supporting memoryview as an input for bytes->str decoding, binary->bytes encoding (e.g. gzip compression) and binary->bytes decoding (e.g. gzip decompression) is trivial. Ditto for array.array and anything else that supports the buffer protocol.

With transform/untransform as methods? No such luck.

And once you're using functions rather than methods, it's best to define the API as object -> object, and leave any type constraints up to the individual codecs (with the error handling improved to provide more context and a more meaningful exception type, as I described earlier in the thread)
msg187652 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-23 15:02
I agree with you.  transform/untransform are parallel to encode/decode, and I wouldn't expect them to exist on any type that didn't support either encode or decode.  They are convenience methods, just as encode/decode are.

I am also probably not invested enough in it to write the PEP :)
msg187653 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-04-23 15:42
str.decode() and bytes.encode() are not coming back.

Any proposal had better take into account the API design rule that the *type* of a method's return value should not depend on the *value* of one of the arguments.  (The Python 2 design failed this test, and that's why we changed it.)

It is however fine to let the return type depend on one of the argument *types*.  So e.g. bytes.transform(enc) -> bytes and str.transform(enc) -> str are fine.  And so are e.g. transform(bytes, enc) -> bytes and transform(str, enc) -> str.  But a transform() taking bytes that can return either str or bytes depending on the encoding name would be a problem.

Personally I don't think transformations are so important or ubiquitous so as to deserve being made new bytes/str methods.  I'd be happy with a convenience function, for example transform(input, codecname), that would have to be imported from somewhere (maybe the codecs module).

My guess is that in almost all cases where people are demanding to say e.g.

  x = y.transform('rot13')

the codec name is a fixed literal, and they are really after minimizing the number of imports.  Personally, disregarding the extra import line, I think

  x = rot13.transform(y)

looks better though.  Such custom APIs also give the API designer (of the transformation) more freedom to take additional optional parameters affecting the transformation, offer a set of variants, or a richer API.
msg187660 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-04-23 17:38
FWIW, I'm not interested in seeing this added anymore.
msg187668 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-04-23 19:26
consensus here appears to be "bad idea... don't do this."
msg187670 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-23 21:46
No, transform/untransform as methods are a bad idea, but these *codecs*
should definitely come back.

The minimal change needed for that to be feasible is to give errors raised
during encoding and decoding more context information (at least the codec
name and error mode, and switching to the right kind of error).

MAL also stated on python-dev that codecs.encode and codecs.decode already
exist, so it should just be a matter of documenting them properly.
msg187673 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-04-23 22:19
okay, but i don't personally find any of these to be good ideas as "codecs" given they don't have anything to do with translating between bytes<->unicode.
msg187676 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-23 23:07
The codecs module is generic, text encodings are just the most common use
case (hence the associated method API).
msg187695 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-04-24 11:45
I don't see any point in merely bringing the codecs back, without any convenience API to use them. If I need to do

  import codecs
  result = codecs.getencoder("base64").encode(data)

I don't think people would actually prefer this over

  import base64
  result = base64.encodebytes(data)

I't (IMO) only the convenience method (.encode) that made people love these codecs.
msg187696 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-24 12:20
IMHO it's also a documentation problem.  Once people figure out that they can't use encode/decode anymore, it's not immediately clear what they should do instead.  By reading the codecs docs[0] it's not obvious that it can be done with codecs.getencoder("...").encode/decode, so people waste time finding a solution, get annoyed, and blame Python 3 because it removed a simple way to use these codecs without making clear what should be used instead.
FWIW I don't care about having to do an extra import, but indeed something simpler than codecs.getencoder("...").encode/decode would be nice.

[0]: http://docs.python.org/3/library/codecs.html
msg187698 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-24 13:43
It turns out MAL added the convenience API I'm looking for back in 2004, it just didn't get documented, and is hidden behind the "from _codecs import *" call in the codecs.py source code:

    http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598

So, all the way from 2.4 to 2.7 you can write:

  from codecs import encode
  result = encode(data, "base64")

It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases:

>>> encode(b"example", "base64_codec")
b'ZXhhbXBsZQ==\n'
>>> decode(b"ZXhhbXBsZQ==\n", "base64_codec")
b'example'

Note that the convenience functions omit the extra checks that are part of the methods (although I admit the specific error here is rather quirky):

>>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode
    return (base64.decodebytes(input), len(input))
  File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes
    raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not memoryview

I'me going to create some additional issues, so this one can return to just being about restoring the missing aliases.
msg187701 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-04-24 13:47
Just copying some details here about codecs.encode() and
codec.decode() from python-dev:

"""
Just as reminder: we have the general purpose
encode()/decode() functions in the codecs module:

import codecs
r13 = codecs.encode('hello world', 'rot-13')

These interface directly to the codec interfaces, without
enforcing type restrictions. The codec defines the supported
input and output types.
"""

As Nick found, these aren't documented, which is a documentation
bug (I probably forgot to add documentation back then).
They have been in Python since 2004:

http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598

These API are nice for general purpose codec work and
that's why I added them back in 2004.

For the codecs in question, it would still be nice to have
a more direct way to access them via methods on the types
that you typically use them with.
msg187702 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-24 13:53
> It works in 3.x as well, you just need to add the "_codec" to the end
> to account for the missing aliases:

FTR this is because of ff1261a14573 (see #10807).
msg187705 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-24 14:11
Issue 17827 covers adding documentation for codecs.encode and codecs.decode

Issue 17828 covers adding exception handling improvements for all encoding and decoding operations
msg187707 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-24 14:22
For me, the killer argument *against* a method based API is memoryview (and, equivalently, array.array). It should be possible to use those as inputs for the bytes->bytes codecs, and once you endorse codecs.encode and codecs.decode for that use case, it's hard to justify adding more exclusive methods to the already broad bytes and bytearray APIs (particularly given the problems with conveying direction of conversion unambiguously).

By contrast, I think "the codecs functions are generic while the str, bytes and bytearray methods are specific to text encodings" is something we can explain fairly easily, thus allowing the aliases mentioned in this issue to be restored for use with the codecs module functions. To avoid reintroducing the quirky errors described in issue 10807, the encoding and decoding error messages should first be improved as discussed in issue 17828.
msg187764 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-25 07:49
Also adding 17839 as a dependency, since part of the reason the base64 errors in particular are so cryptic is because the base64 module doesn't accept arbitrary PEP 3118 compliant objects as input.
msg187770 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-04-25 08:31
I also created issue 17841 to cover that that the 3.3 documentation incorrectly states that these aliases still exist, even though they were removed before 3.2 was released.
msg198845 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-02 15:08
With issue 17839 fixed, the error from invoking the base64 codec through the method API is now substantially more sensible:

>>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoder did not return a str object (type=bytes)
msg198846 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-02 15:13
I just wanted to note something I realised in chatting to Armin Ronacher recently: in both Python 2.x and 3.x, the encode/decode method APIs are constrained by the text model, it's just that in 2.x that model was effectively basestring<->basestring, and thus still covered every codec in the standard library. This greatly limited the use cases for the codecs.encode/decode convenience functions, which is why the fact they were undocumented went unnoticed.

In 3.x, the changed text model meant the method API become limited to the Unicode codecs, making the function based API more important.
msg202130 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-04 13:21
For anyone interested, I have a patch up on issue 17828 that produces the following output for various codec usage errors:

>>> import codecs
>>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types

>>> "hello".encode("bz2_codec")
TypeError: 'str' does not support the buffer interface

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)

>>> "hello".encode("rot_13")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types
msg202264 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-06 12:41
Providing the 2to3 fixers in issue 17823 now depends on this issue rather than the other way around (since not having to translate the names simplifies the fixer a bit).
msg202515 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-10 09:25
Issue 17823 is now closed, but not because it has been implemented. It turns out that the data driven nature of the incompatibility means it isn't really amenable to being detected and fixed automatically via 2to3.

Issue 19543 is a replacement proposal for the introduction of some additional codec related Py3k warnings in Python 2.7.7.
msg203124 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-17 07:41
Attached patch restores the aliases for the binary and text transforms, adds a test to ensure they exist and restores the "Aliases" column to the relevant tables in the documentation. It also updates the relevant section in the What's New document.

I also tweaked the wording in the docs to use the phrases "binary transform" and "text transform" for the affected tables and version added/changed notices.

Given the discussions on python-dev, the main condition that needs to be met before I commit this is for Victor to change his current -1 to a -0 or higher.
msg203378 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-19 14:25
Victor is still -1, so to Python 3.5 it goes.
msg203751 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-22 12:44
The 3.4 portion of issue 19619 has been addressed, so removing it as a dependency again.
msg203936 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-23 00:46
With issue 19619 resolved for Python 3.4 (the issue itself remains open awaiting a backport to 3.3), Victor has softened his stance on this topic and given the go ahead to restore the codec aliases: http://bugs.python.org/issue19619#msg203897

I'll be committing this shortly, after adjusting the patch to account for the issue 19619 changes to the tests and What's New.
msg203942 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-11-23 01:14
New changeset 5e960d2c2156 by Nick Coghlan in branch 'default':
Close #7475: Restore binary & text transform codecs
http://hg.python.org/cpython/rev/5e960d2c2156
msg203944 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-23 01:16
Note that I still plan to do a documentation-only PEP for 3.4, proposing some adjustments to the way the codecs module is documented, making binary and test transform defined terms in the glossary, etc.

I'll probably aim for beta 2 for that.
msg207283 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-04 13:34
Docstrings for new codecs mention bytes.transform() and bytes.untransform() which are nonexistent.
msg213502 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-03-14 00:55
New changeset d7950e916f20 by R David Murray in branch '3.3':
#7475: Remove references to '.transform' from transform codec docstrings.
http://hg.python.org/cpython/rev/d7950e916f20

New changeset 83d54ab5c696 by R David Murray in branch 'default':
Merge #7475: Remove references to '.transform' from transform codec docstrings.
http://hg.python.org/cpython/rev/83d54ab5c696
History
Date User Action Args
2022-04-11 14:56:55adminsetgithub: 51724
2014-03-14 00:55:23python-devsetmessages: + msg213502
2014-01-04 13:34:04serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg207283
2014-01-02 12:42:36jwilksetnosy: + jwilk
2013-11-23 01:16:23ncoghlansetmessages: + msg203944
2013-11-23 01:14:37python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg203942

resolution: fixed
stage: resolved
2013-11-23 00:46:51ncoghlansetassignee: ncoghlan
messages: + msg203936
versions: + Python 3.4, - Python 3.5
2013-11-22 12:44:25ncoghlansetdependencies: - Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()
messages: + msg203751
2013-11-21 13:35:20ncoghlansetdependencies: + Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()
2013-11-19 14:25:41ncoghlansetmessages: + msg203378
versions: + Python 3.5, - Python 3.4
2013-11-17 07:41:29ncoghlansetfiles: + issue7475_restore_codec_aliases_in_py34.diff

messages: + msg203124
2013-11-10 09:25:10ncoghlansetmessages: + msg202515
2013-11-10 09:22:10ncoghlanunlinkissue17823 dependencies
2013-11-06 12:41:41ncoghlansetdependencies: - 2to3 fixers for missing codecs
messages: + msg202264
2013-11-06 12:40:42ncoghlanlinkissue17823 dependencies
2013-11-04 13:21:33ncoghlansetmessages: + msg202130
2013-10-02 15:18:13ncoghlansetversions: - Python 2.7, Python 3.3
2013-10-02 15:17:00ncoghlansetmessages: - msg198847
2013-10-02 15:16:36ncoghlansetmessages: + msg198847
versions: + Python 2.7, Python 3.3
2013-10-02 15:13:49ncoghlansetmessages: + msg198846
2013-10-02 15:08:16ncoghlansetmessages: + msg198845
2013-05-02 22:46:38isoschizsetnosy: + isoschiz
2013-04-25 16:34:15gvanrossumsetnosy: - gvanrossum
2013-04-25 11:43:30serhiy.storchakasetdependencies: + Add link to alternatives for bytes-to-bytes codecs
2013-04-25 08:31:46ncoghlansetmessages: + msg187770
2013-04-25 07:53:34serhiy.storchakasetdependencies: + 2to3 fixers for missing codecs
2013-04-25 07:49:12ncoghlansetdependencies: + base64 module should use memoryview
messages: + msg187764
2013-04-24 14:22:38ncoghlansetdependencies: + More informative error handling when encoding and decoding
messages: + msg187707
2013-04-24 14:11:28ncoghlansetmessages: + msg187705
2013-04-24 13:53:35ezio.melottisetmessages: + msg187702
2013-04-24 13:47:10lemburgsetmessages: + msg187701
2013-04-24 13:43:13ncoghlansetmessages: + msg187698
2013-04-24 12:20:46ezio.melottisetmessages: + msg187696
2013-04-24 11:45:23loewissetmessages: + msg187695
2013-04-23 23:07:32ncoghlansetmessages: + msg187676
2013-04-23 22:19:41gregory.p.smithsetstatus: closed -> open
resolution: wont fix -> (no value)
messages: + msg187673

stage: resolved -> (no value)
2013-04-23 21:46:42ncoghlansetmessages: + msg187670
2013-04-23 19:26:47gregory.p.smithsetstatus: open -> closed
priority: high -> normal


nosy: + gregory.p.smith
messages: + msg187668
resolution: wont fix
stage: resolved
2013-04-23 17:38:31georg.brandlsetmessages: + msg187660
2013-04-23 15:42:31gvanrossumsetmessages: + msg187653
2013-04-23 15:02:13r.david.murraysetmessages: + msg187652
2013-04-23 14:55:55ncoghlansetmessages: + msg187651
2013-04-23 14:41:42r.david.murraysetmessages: + msg187649
2013-04-23 13:46:22ncoghlansetmessages: + msg187644
2013-04-23 12:59:21ezio.melottisetmessages: + msg187638
2013-04-23 12:54:04floxsetmessages: + msg187636
2013-04-23 12:42:27r.david.murraysetnosy: + r.david.murray
messages: + msg187634
2013-04-23 12:15:06ezio.melottisetmessages: + msg187631
2013-04-23 12:05:43floxsetmessages: + msg187630
2013-04-22 18:39:06pconnellsetnosy: + pconnell
2013-04-01 18:06:30floxsetnosy: + flox
2013-04-01 18:06:19floxsetnosy: - flox
2012-09-12 19:11:51uzumesetnosy: - uzume
2012-09-12 19:09:57uzumesetnosy: + uzume
messages: + msg170414
2012-08-25 07:52:33ncoghlansetpriority: release blocker -> high
2012-07-14 10:51:15ezio.melottisetnosy: + ezio.melotti
2012-07-14 07:36:42ncoghlansetmessages: + msg165435
2012-06-28 10:41:30ncoghlansetpriority: normal -> release blocker

messages: + msg164237
stage: commit review -> (no value)
2012-06-28 07:26:31ncoghlansetmessages: + msg164226
2012-06-28 07:13:02ncoghlansetmessages: + msg164224
versions: + Python 3.4, - Python 3.3
2012-02-19 04:16:27jceasetnosy: + jcea
2012-02-14 03:25:58ncoghlansetmessages: + msg153317
2012-02-13 21:17:55vstinnersetmessages: + msg153304
2012-02-13 21:11:44barrysetnosy: + barry
2011-12-14 10:51:53petri.lehtinensetnosy: + petri.lehtinen
messages: + msg149439
2011-12-14 10:48:42petri.lehtinenlinkissue13600 superseder
2011-10-20 01:53:08ncoghlansetmessages: + msg145998
2011-10-19 23:10:52vstinnersetmessages: + msg145991
2011-10-19 22:54:41ncoghlansetmessages: + msg145986
2011-10-19 22:34:48vstinnersetmessages: + msg145982
2011-10-19 22:12:37ncoghlansetmessages: + msg145980
2011-10-19 22:09:43ncoghlansetmessages: + msg145979
2011-10-19 11:58:38ncoghlansetmessages: + msg145900
2011-10-19 11:35:38ncoghlansetassignee: lemburg -> (no value)

messages: + msg145897
nosy: + ncoghlan
2011-10-17 13:38:20eric.araujosetmessages: + msg145693
2011-10-17 00:53:29vstinnersetmessages: + msg145656
2011-10-09 09:18:13eric.araujosetmessages: + msg145246
components: - Documentation, 2to3 (2.x to 3.x conversion tool)
2011-09-22 15:36:27cbensetnosy: + cben
2011-07-19 13:13:46eric.araujosetversions: + Python 3.3, - Python 3.2
2011-01-02 19:01:49vstinnersetnosy: lemburg, gvanrossum, loewis, georg.brandl, belopolsky, vstinner, benjamin.peterson, eric.araujo, ssbarnea, flox
messages: + msg125073
2010-12-30 01:53:47belopolskylinkissue3232 dependencies
2010-12-09 18:43:33belopolskysetstatus: closed -> open

type: enhancement
components: + Unicode

nosy: + gvanrossum
messages: + msg123693
resolution: fixed -> (no value)
stage: commit review
2010-12-06 11:49:37lemburgsetmessages: + msg123462
2010-12-05 19:12:13georg.brandlsetassignee: lemburg
messages: + msg123436
2010-12-05 19:04:43loewissetmessages: + msg123435
2010-12-03 08:46:28lemburgsetmessages: + msg123206
2010-12-03 01:40:10belopolskysetnosy: + belopolsky
messages: + msg123154
2010-12-02 18:08:08georg.brandlsetstatus: open -> closed
resolution: fixed
messages: + msg123090
2010-07-31 17:44:57floxlinkissue3532 superseder
2010-07-10 18:35:06loewissetmessages: + msg109905
2010-07-10 18:14:32eric.araujosetmessages: + msg109904
2010-07-10 17:07:40lemburgsetversions: - Python 3.1, Python 2.7
2010-07-10 17:06:57lemburgsetmessages: + msg109894
2010-07-10 15:36:30georg.brandlsetmessages: + msg109879
2010-07-10 15:36:19georg.brandlsetmessages: - msg109878
2010-07-10 15:36:07georg.brandlsetmessages: + msg109878
2010-07-10 15:24:32lemburgsetmessages: + msg109876
2010-07-10 14:24:54loewissetmessages: + msg109872
2010-06-14 15:35:05ssbarneasetnosy: + ssbarnea

messages: + msg107794
title: codecs missing: base64 bz2 hex zlib ... -> codecs missing: base64 bz2 hex zlib hex_codec ...
2010-06-04 14:12:06eric.araujosetmessages: + msg107057
2010-05-28 14:25:47eric.araujosetnosy: + eric.araujo
2010-05-28 14:17:52lemburgsetmessages: + msg106674
2010-05-28 13:48:54vstinnersetmessages: + msg106670
2010-05-28 13:45:56vstinnersetmessages: + msg106669
2010-05-28 13:18:57vstinnersetnosy: + vstinner
2010-05-20 20:33:01skip.montanarosetnosy: - skip.montanaro
2009-12-19 18:09:41georg.brandlsetassignee: georg.brandl -> (no value)
2009-12-19 18:09:28georg.brandlsetmessages: + msg96632
2009-12-14 10:30:11lemburgsetmessages: + msg96374
2009-12-12 19:25:17loewissetmessages: + msg96301
2009-12-12 15:44:22floxsetmessages: + msg96296
2009-12-12 15:40:27floxsetmessages: + msg96295
2009-12-11 23:09:08loewissetmessages: + msg96277
2009-12-11 17:05:47floxsetfiles: + issue7475_missing_codecs_py3k.diff

messages: + msg96265
2009-12-11 13:13:50lemburgsetmessages: + msg96253
2009-12-11 12:54:39benjamin.petersonsetmessages: + msg96251
2009-12-11 10:22:23floxsetnosy: lemburg, loewis, skip.montanaro, georg.brandl, benjamin.peterson, flox
messages: + msg96243
components: + Library (Lib)
2009-12-11 09:56:57lemburgsetmessages: + msg96242
2009-12-11 09:47:23lemburgsetresolution: not a bug -> (no value)
2009-12-11 09:46:55lemburgsetmessages: + msg96240
title: No hint about codecs removed: base64 bz2 hex zlib ... -> codecs missing: base64 bz2 hex zlib ...
2009-12-11 09:26:31floxsetfiles: + issue7475_warning.diff
keywords: + patch
2009-12-11 08:33:13floxsettitle: No hint about codecs removed : base64 bz2 hex zlib ... -> No hint about codecs removed: base64 bz2 hex zlib ...
2009-12-11 08:31:52floxsetversions: + Python 2.7
2009-12-11 08:31:17floxsetmessages: + msg96237
2009-12-11 08:21:39floxsetstatus: closed -> open

assignee: georg.brandl
components: + Documentation, 2to3 (2.x to 3.x conversion tool), - Library (Lib)
title: codecs missing: base64 bz2 hex zlib ... -> No hint about codecs removed : base64 bz2 hex zlib ...
nosy: + georg.brandl

messages: + msg96236
2009-12-11 02:09:19benjamin.petersonsetstatus: open -> closed
nosy: + benjamin.peterson
messages: + msg96232

2009-12-10 23:28:52loewissetmessages: + msg96228
2009-12-10 23:26:10lemburgsetstatus: closed -> open

messages: + msg96227
2009-12-10 23:25:03lemburgsetnosy: + lemburg
messages: + msg96226
2009-12-10 23:15:12loewissetstatus: open -> closed

nosy: + loewis
messages: + msg96223

resolution: not a bug
2009-12-10 22:52:04skip.montanarosetnosy: + skip.montanaro
2009-12-10 22:27:38floxcreate