Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codecs missing: base64 bz2 hex zlib hex_codec ... #51724

Closed
florentx mannequin opened this issue Dec 10, 2009 · 95 comments
Closed

codecs missing: base64 bz2 hex zlib hex_codec ... #51724

florentx mannequin opened this issue Dec 10, 2009 · 95 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement

Comments

@florentx
Copy link
Mannequin

florentx mannequin commented Dec 10, 2009

BPO 7475
Nosy @malemburg, @loewis, @warsaw, @birkenfeld, @gpshead, @jcea, @cben, @ncoghlan, @abalkin, @vstinner, @benjaminp, @jwilk, @ezio-melotti, @merwok, @bitdancer, @ssbarnea, @florentx, @akheron, @serhiy-storchaka, @phmc
Dependencies
  • bpo-17828: More informative error handling when encoding and decoding
  • bpo-17839: base64 module should use memoryview
  • bpo-17844: Add link to alternatives for bytes-to-bytes codecs
  • Files
  • issue7475_warning.diff: Patch for documentation and warnings in 2.7
  • issue7475_missing_codecs_py3k.diff: Patch, apply to trunk
  • issue7475_restore_codec_aliases_in_py34.diff: Patch to restore the transform aliases.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ncoghlan'
    closed_at = <Date 2013-11-23.01:14:37.682>
    created_at = <Date 2009-12-10.22:27:38.811>
    labels = ['type-feature', 'library', 'expert-unicode']
    title = 'codecs missing: base64 bz2 hex zlib hex_codec ...'
    updated_at = <Date 2014-03-14.00:55:23.273>
    user = 'https://github.com/florentx'

    bugs.python.org fields:

    activity = <Date 2014-03-14.00:55:23.273>
    actor = 'python-dev'
    assignee = 'ncoghlan'
    closed = True
    closed_date = <Date 2013-11-23.01:14:37.682>
    closer = 'python-dev'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2009-12-10.22:27:38.811>
    creator = 'flox'
    dependencies = ['17828', '17839', '17844']
    files = ['15523', '15526', '32663']
    hgrepos = []
    issue_num = 7475
    keywords = ['patch']
    message_count = 95.0
    messages = ['96218', '96223', '96226', '96227', '96228', '96232', '96236', '96237', '96240', '96242', '96243', '96251', '96253', '96265', '96277', '96295', '96296', '96301', '96374', '96632', '106669', '106670', '106674', '107057', '107794', '109872', '109876', '109879', '109894', '109904', '109905', '123090', '123154', '123206', '123435', '123436', '123462', '123693', '125073', '145246', '145656', '145693', '145897', '145900', '145979', '145980', '145982', '145986', '145991', '145998', '149439', '153304', '153317', '164224', '164226', '164237', '165435', '170414', '187630', '187631', '187634', '187636', '187638', '187644', '187649', '187651', '187652', '187653', '187660', '187668', '187670', '187673', '187676', '187695', '187696', '187698', '187701', '187702', '187705', '187707', '187764', '187770', '198845', '198846', '202130', '202264', '202515', '203124', '203378', '203751', '203936', '203942', '203944', '207283', '213502']
    nosy_count = 22.0
    nosy_names = ['lemburg', 'loewis', 'barry', 'georg.brandl', 'gregory.p.smith', 'jcea', 'cben', 'ncoghlan', 'belopolsky', 'vstinner', 'benjamin.peterson', 'jwilk', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'ssbarnea', 'flox', 'python-dev', 'petri.lehtinen', 'serhiy.storchaka', 'pconnell', 'isoschiz']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue7475'
    versions = ['Python 3.4']

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Dec 10, 2009

    AFAIK these codecs were not ported to Python 3.

    1. I found no hint in documentation on this matter.

    2. Is it possible to contribute some of them, or there's a good reason
      to look elsewhere?

    @florentx florentx mannequin added the stdlib Python modules in the Lib dir label Dec 10, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 10, 2009

    These are not encodings, in that they don't convert characters to bytes.
    It was a mistake that they were integrated into the codecs interfaces in
    Python 2.x; this mistake is corrected in 3.x.

    @loewis loewis mannequin closed this as completed Dec 10, 2009
    @loewis loewis mannequin added the invalid label Dec 10, 2009
    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    These are not encodings, in that they don't convert characters to bytes.
    It was a mistake that they were integrated into the codecs interfaces in
    Python 2.x; this mistake is corrected in 3.x.

    Martin, I beg your pardon, but these codecs indeed implement valid
    encodings and the fact that these codecs were removed was a
    mistake.

    They should be readded to Python 3.x.

    Note that just because a codec doesn't convert between bytes
    and characters only, doesn't make it wrong in any way. The codec
    architecture in Python is designed to support same type encodings
    just as well as ones between bytes and characters.

    @malemburg
    Copy link
    Member

    Reopening the ticket.

    @malemburg malemburg reopened this Dec 10, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 10, 2009

    It's not possible to add these codecs back. Bytes objects (correctly)
    don't have an encode method, and string objects (correctly) don't have a
    decode method. The codec architecture of Python 3.x just doesn't support
    this kind of application; the codec architecture of 2.x was flawed.

    @benjaminp
    Copy link
    Contributor

    I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal
    strictly with unicode -> bytes.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Dec 11, 2009

    «Everything you thought you knew about binary data and Unicode
    has changed.»

    Reopening for the documentation part.

    This "mistake" deserves some words in the documentation:
    docs.python.org/dev/py3k/whatsnew/3.0.html
    #text-vs-data-instead-of-unicode-vs-8-bit

    And the conversion may be automated with 2to3, maybe.

    @florentx florentx mannequin added docs Documentation in the Doc dir topic-2to3 and removed stdlib Python modules in the Lib dir labels Dec 11, 2009
    @florentx florentx mannequin changed the title codecs missing: base64 bz2 hex zlib ... No hint about codecs removed : base64 bz2 hex zlib ... Dec 11, 2009
    @florentx florentx mannequin assigned birkenfeld Dec 11, 2009
    @florentx florentx mannequin reopened this Dec 11, 2009
    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Dec 11, 2009

    Is it possible to add "DeprecationWarning" for these codecs
    when using "python -3" ?

    >>> {}.has_key('a')
    __main__:1: DeprecationWarning: dict.has_key() not supported in 3.x;
                use the in operator
    False
    >>> print `123`
    <stdin>:1: SyntaxWarning: backquote not supported in 3.x; use repr()
    123
    >>> 'abc'.encode('base64')
    'YWJj\n'

    @florentx florentx mannequin changed the title No hint about codecs removed : base64 bz2 hex zlib ... No hint about codecs removed: base64 bz2 hex zlib ... Dec 11, 2009
    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    It's not possible to add these codecs back. Bytes objects (correctly)
    don't have an encode method, and string objects (correctly) don't have a
    decode method. The codec architecture of Python 3.x just doesn't support
    this kind of application; the codec architecture of 2.x was flawed.

    Of course it does support these kinds of codecs. The codec
    architecture hasn't changed between 2.x and 3.x, just the way
    a few methods work.

    All we agreed to is that unicode.encode() will only return bytes,
    while bytes.decode() will only return unicode. So the methods won't
    support same type conversions, because Guido didn't want to
    have methods that return different types based on the chosen
    parameter (the codec name in this case).

    However, you can still use codecs.encode() and codecs.decode()
    to work with codecs that return different combinations of
    types. I explicitly added that support back to 3.0.

    You can't argue that just because two methods don't support
    a certain type combination, the whole architecture doesn't
    support this anymore.

    Also note that codecs allow a much more far-reaching use
    than just through the unicode and bytes methods: you can
    use them as seamless wrappers for streams, subclass from
    them, use their methods directly, etc. etc.

    So your argument that just because the two methods don't
    support these codecs anymore is just not good enough
    to warrant their removal.

    @malemburg malemburg changed the title No hint about codecs removed: base64 bz2 hex zlib ... codecs missing: base64 bz2 hex zlib ... Dec 11, 2009
    @malemburg malemburg removed the invalid label Dec 11, 2009
    @malemburg
    Copy link
    Member

    Benjamin Peterson wrote:

    Benjamin Peterson <benjamin@python.org> added the comment:

    I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal
    strictly with unicode -> bytes.

    Sorry, Bejamin, but that's simply not true.

    Codecs can work with arbitrary types, it's just that the helper
    methods on unicode and bytes objects only support one combination
    of types in Python 3.x.

    codecs.encode()/.decode() provide access to all codecs, regardless
    of their supported type combinations and of course, you can use
    them directly via the codec registry, subclass from them, etc.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Dec 11, 2009

    Thinking about it, I am +1 to reimplement the codecs.

    We could implement new methods to replace the old one.
    (similar to base64.encodebytes and base64.decodebytes)

    >>> b'abc'.encodebytes('base64')
    b'YWJj\n'
    >>> b'abc'.encodebytes('zlib').encodebytes('base64')
    b'eJxLTEoGAAJNASc=\n'
    >>> b'UHl0aG9u'.decodebytes('base64').decode('utf-8')
    'Python'

    @florentx florentx mannequin added the stdlib Python modules in the Lib dir label Dec 11, 2009
    @benjaminp
    Copy link
    Contributor

    2009/12/11 Marc-Andre Lemburg <report@bugs.python.org>:

    codecs.encode()/.decode() provide access to all codecs, regardless
    of their supported type combinations and of course, you can use
    them directly via the codec registry, subclass from them, etc.

    Didn't you have a proposal for bytes.transform/untransform for
    operations like this?

    @malemburg
    Copy link
    Member

    Benjamin Peterson wrote:

    Benjamin Peterson <benjamin@python.org> added the comment:

    2009/12/11 Marc-Andre Lemburg <report@bugs.python.org>:
    > codecs.encode()/.decode() provide access to all codecs, regardless
    > of their supported type combinations and of course, you can use
    > them directly via the codec registry, subclass from them, etc.

    Didn't you have a proposal for bytes.transform/untransform for
    operations like this?

    Yes. At the time it was postponed, since I brought it up late
    in the 3.0 release process. Perhaps I should bring it up again.

    Note that those methods are just convenient helpers to access
    the codecs and as such only provide limited functionality.

    The full machinery itself is accessible via the codecs module and
    the code in the encodings package. Any decision to include a codec
    or not needs to be based on whether it fits the framework in those
    modules/packages, not the functionality we expose on unicode and
    bytes objects.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Dec 11, 2009

    I've ported the codecs from Py2:
    base64, bytes_escape, bz2, hex, quopri, rot13, uu and zlib

    It's not a big deal. Basically:

    • StringIO.StringIO --> io.BytesIO
    • 'string_escape' --> 'bytes_escape'

    Will add documentation if we agree on the feature.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 11, 2009

    codecs.encode()/.decode() provide access to all codecs, regardless
    of their supported type combinations and of course, you can use
    them directly via the codec registry, subclass from them, etc.

    I presume that the OP didn't talk about codecs.encode, but about
    the methods on string objects. flox, can you clarify what precisely
    it is that you miss?

    @gpshead gpshead closed this as completed Apr 23, 2013
    @ncoghlan
    Copy link
    Contributor

    No, transform/untransform as methods are a bad idea, but these *codecs*
    should definitely come back.

    The minimal change needed for that to be feasible is to give errors raised
    during encoding and decoding more context information (at least the codec
    name and error mode, and switching to the right kind of error).

    MAL also stated on python-dev that codecs.encode and codecs.decode already
    exist, so it should just be a matter of documenting them properly.

    @gpshead
    Copy link
    Member

    gpshead commented Apr 23, 2013

    okay, but i don't personally find any of these to be good ideas as "codecs" given they don't have anything to do with translating between bytes<->unicode.

    @gpshead gpshead reopened this Apr 23, 2013
    @ncoghlan
    Copy link
    Contributor

    The codecs module is generic, text encodings are just the most common use
    case (hence the associated method API).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 24, 2013

    I don't see any point in merely bringing the codecs back, without any convenience API to use them. If I need to do

      import codecs
      result = codecs.getencoder("base64").encode(data)

    I don't think people would actually prefer this over

      import base64
      result = base64.encodebytes(data)

    I't (IMO) only the convenience method (.encode) that made people love these codecs.

    @ezio-melotti
    Copy link
    Member

    IMHO it's also a documentation problem. Once people figure out that they can't use encode/decode anymore, it's not immediately clear what they should do instead. By reading the codecs docs0 it's not obvious that it can be done with codecs.getencoder("...").encode/decode, so people waste time finding a solution, get annoyed, and blame Python 3 because it removed a simple way to use these codecs without making clear what should be used instead.
    FWIW I don't care about having to do an extra import, but indeed something simpler than codecs.getencoder("...").encode/decode would be nice.

    @ncoghlan
    Copy link
    Contributor

    It turns out MAL added the convenience API I'm looking for back in 2004, it just didn't get documented, and is hidden behind the "from _codecs import *" call in the codecs.py source code:

    http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598
    

    So, all the way from 2.4 to 2.7 you can write:

      from codecs import encode
      result = encode(data, "base64")

    It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases:

    >>> encode(b"example", "base64_codec")
    b'ZXhhbXBsZQ==\n'
    >>> decode(b"ZXhhbXBsZQ==\n", "base64_codec")
    b'example'

    Note that the convenience functions omit the extra checks that are part of the methods (although I admit the specific error here is rather quirky):

    >>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode
        return (base64.decodebytes(input), len(input))
      File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes
        raise TypeError("expected bytes, not %s" % s.__class__.__name__)
    TypeError: expected bytes, not memoryview

    I'me going to create some additional issues, so this one can return to just being about restoring the missing aliases.

    @malemburg
    Copy link
    Member

    Just copying some details here about codecs.encode() and
    codec.decode() from python-dev:

    """
    Just as reminder: we have the general purpose
    encode()/decode() functions in the codecs module:

    import codecs
    r13 = codecs.encode('hello world', 'rot-13')

    These interface directly to the codec interfaces, without
    enforcing type restrictions. The codec defines the supported
    input and output types.
    """

    As Nick found, these aren't documented, which is a documentation
    bug (I probably forgot to add documentation back then).
    They have been in Python since 2004:

    http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598

    These API are nice for general purpose codec work and
    that's why I added them back in 2004.

    For the codecs in question, it would still be nice to have
    a more direct way to access them via methods on the types
    that you typically use them with.

    @ezio-melotti
    Copy link
    Member

    It works in 3.x as well, you just need to add the "_codec" to the end
    to account for the missing aliases:

    FTR this is because of ff1261a14573 (see bpo-10807).

    @ncoghlan
    Copy link
    Contributor

    bpo-17827 covers adding documentation for codecs.encode and codecs.decode

    bpo-17828 covers adding exception handling improvements for all encoding and decoding operations

    @ncoghlan
    Copy link
    Contributor

    For me, the killer argument *against* a method based API is memoryview (and, equivalently, array.array). It should be possible to use those as inputs for the bytes->bytes codecs, and once you endorse codecs.encode and codecs.decode for that use case, it's hard to justify adding more exclusive methods to the already broad bytes and bytearray APIs (particularly given the problems with conveying direction of conversion unambiguously).

    By contrast, I think "the codecs functions are generic while the str, bytes and bytearray methods are specific to text encodings" is something we can explain fairly easily, thus allowing the aliases mentioned in this issue to be restored for use with the codecs module functions. To avoid reintroducing the quirky errors described in bpo-10807, the encoding and decoding error messages should first be improved as discussed in bpo-17828.

    @ncoghlan
    Copy link
    Contributor

    Also adding 17839 as a dependency, since part of the reason the base64 errors in particular are so cryptic is because the base64 module doesn't accept arbitrary PEP-3118 compliant objects as input.

    @ncoghlan
    Copy link
    Contributor

    I also created bpo-17841 to cover that that the 3.3 documentation incorrectly states that these aliases still exist, even though they were removed before 3.2 was released.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Oct 2, 2013

    With bpo-17839 fixed, the error from invoking the base64 codec through the method API is now substantially more sensible:

    >>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: decoder did not return a str object (type=bytes)

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Oct 2, 2013

    I just wanted to note something I realised in chatting to Armin Ronacher recently: in both Python 2.x and 3.x, the encode/decode method APIs are constrained by the text model, it's just that in 2.x that model was effectively basestring<->basestring, and thus still covered every codec in the standard library. This greatly limited the use cases for the codecs.encode/decode convenience functions, which is why the fact they were undocumented went unnoticed.

    In 3.x, the changed text model meant the method API become limited to the Unicode codecs, making the function based API more important.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Nov 4, 2013

    For anyone interested, I have a patch up on bpo-17828 that produces the following output for various codec usage errors:

    >>> import codecs
    >>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
    
    >>> "hello".encode("bz2_codec")
    TypeError: 'str' does not support the buffer interface
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
    
    >>> "hello".encode("rot_13")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Nov 6, 2013

    Providing the 2to3 fixers in bpo-17823 now depends on this issue rather than the other way around (since not having to translate the names simplifies the fixer a bit).

    @ncoghlan
    Copy link
    Contributor

    bpo-17823 is now closed, but not because it has been implemented. It turns out that the data driven nature of the incompatibility means it isn't really amenable to being detected and fixed automatically via 2to3.

    bpo-19543 is a replacement proposal for the introduction of some additional codec related Py3k warnings in Python 2.7.7.

    @ncoghlan
    Copy link
    Contributor

    Attached patch restores the aliases for the binary and text transforms, adds a test to ensure they exist and restores the "Aliases" column to the relevant tables in the documentation. It also updates the relevant section in the What's New document.

    I also tweaked the wording in the docs to use the phrases "binary transform" and "text transform" for the affected tables and version added/changed notices.

    Given the discussions on python-dev, the main condition that needs to be met before I commit this is for Victor to change his current -1 to a -0 or higher.

    @ncoghlan
    Copy link
    Contributor

    Victor is still -1, so to Python 3.5 it goes.

    @ncoghlan
    Copy link
    Contributor

    The 3.4 portion of bpo-19619 has been addressed, so removing it as a dependency again.

    @ncoghlan
    Copy link
    Contributor

    With bpo-19619 resolved for Python 3.4 (the issue itself remains open awaiting a backport to 3.3), Victor has softened his stance on this topic and given the go ahead to restore the codec aliases: http://bugs.python.org/issue19619#msg203897

    I'll be committing this shortly, after adjusting the patch to account for the bpo-19619 changes to the tests and What's New.

    @ncoghlan ncoghlan self-assigned this Nov 23, 2013
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 23, 2013

    New changeset 5e960d2c2156 by Nick Coghlan in branch 'default':
    Close bpo-7475: Restore binary & text transform codecs
    http://hg.python.org/cpython/rev/5e960d2c2156

    @python-dev python-dev mannequin closed this as completed Nov 23, 2013
    @ncoghlan
    Copy link
    Contributor

    Note that I still plan to do a documentation-only PEP for 3.4, proposing some adjustments to the way the codecs module is documented, making binary and test transform defined terms in the glossary, etc.

    I'll probably aim for beta 2 for that.

    @serhiy-storchaka
    Copy link
    Member

    Docstrings for new codecs mention bytes.transform() and bytes.untransform() which are nonexistent.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 14, 2014

    New changeset d7950e916f20 by R David Murray in branch '3.3':
    bpo-7475: Remove references to '.transform' from transform codec docstrings.
    http://hg.python.org/cpython/rev/d7950e916f20

    New changeset 83d54ab5c696 by R David Murray in branch 'default':
    Merge bpo-7475: Remove references to '.transform' from transform codec docstrings.
    http://hg.python.org/cpython/rev/83d54ab5c696

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests