Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b64decode should accept strings or bytes #49019

Closed
beazley mannequin opened this issue Dec 29, 2008 · 14 comments
Closed

b64decode should accept strings or bytes #49019

beazley mannequin opened this issue Dec 29, 2008 · 14 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@beazley
Copy link
Mannequin

beazley mannequin commented Dec 29, 2008

BPO 4769
Nosy @vstinner, @merwok, @bitdancer
Files
  • b64-decode-str-bytes-typeerror.txt: running python -m base64 with various options and inputs
  • base64_str.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-10-27.12:16:16.685>
    created_at = <Date 2008-12-29.17:35:52.476>
    labels = ['type-feature', 'library']
    title = 'b64decode should accept strings or bytes'
    updated_at = <Date 2010-10-27.12:16:16.684>
    user = 'https://bugs.python.org/beazley'

    bugs.python.org fields:

    activity = <Date 2010-10-27.12:16:16.684>
    actor = 'eric.araujo'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-10-27.12:16:16.685>
    closer = 'eric.araujo'
    components = ['Library (Lib)']
    creation = <Date 2008-12-29.17:35:52.476>
    creator = 'beazley'
    dependencies = []
    files = ['17412', '17463']
    hgrepos = []
    issue_num = 4769
    keywords = ['patch']
    message_count = 14.0
    messages = ['78466', '78468', '78554', '78746', '78747', '106127', '106266', '106314', '106315', '106477', '106486', '106488', '106832', '115690']
    nosy_count = 7.0
    nosy_names = ['beazley', 'vstinner', 'kawai', 'eric.araujo', 'r.david.murray', 'brotchie', 'meatballhat']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue4769'
    versions = ['Python 3.2']

    @beazley
    Copy link
    Mannequin Author

    beazley mannequin commented Dec 29, 2008

    The whole point of base64 encoding is to safely encode binary data into
    text characters. Thus, the base64.b64decode() function should equally
    accept text strings or binary strings as input. For example, there is a
    reasonable expectation that something like this should work:

    >>> x = 'SGVsbG8='
    >>> base64.b64decode(x)
    b'Hello'
    >>>

    In Python 3, you get this exception however:

    >>> base64.b64decode(x)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/tmp/lib/python3.0/base64.py", line 80, in b64decode
        raise TypeError("expected bytes, not %s" % s.__class__.__name__)
    TypeError: expected bytes, not str
    >>> 

    I realize that there are encoding issues with Unicode strings, but
    base64 encodes everything into the first 127 ASCII characters. If the
    input to b64decode is a str, just do a encode('ascii') operation on it
    and proceed. If that fails, it wasn't valid Base64 to begin with.

    I can't think of any real negative impact to making this change as long
    as the result is still always bytes. The main benefit is just
    simplifying the decoding process for end-users.

    See bpo-4768.

    @beazley beazley mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Dec 29, 2008
    @beazley
    Copy link
    Mannequin Author

    beazley mannequin commented Dec 29, 2008

    Note: This problem applies to all of the other decoders/encoders in the
    base64 too (b16, b32, etc.)

    @beazley
    Copy link
    Mannequin Author

    beazley mannequin commented Dec 30, 2008

    One more followup. The quopri module (which is highly related to
    base64 in that quopri and base64 are often used together within MIME)
    does accept both unicode and byte strings when decoding. For example,
    this works:

    >>> quopri.decodestring('Hello World')
    b'Hello World'
    >>> quopri.decodestring(b'Hello World')
    b'Hello World'
    >>>

    However, the quopri module, like base64, uses byte strings almost
    everywhere else. For example, encoding a byte string with quopri still
    produces bytes (just like base64)

    >>> quopri.encodestring(b'Hello World')
    b'Hello World'
    >>>

    @vstinner
    Copy link
    Member

    vstinner commented Jan 2, 2009

    About quoted printable, there are two implementations:

    • binascii.a2b_qp() (Modules/binascii.c): C implementation, use
      PyArg_ParseTupleAndKeywords(args, kwargs, "s*|i", ...) to parse the
      data
    • quopri.decode() (Lib/quopri.py): Python implementation
      => quopri.decodestring() uses io.BytesIO() to parse the data

    But quopri.decodestring() reuses binascii.a2b_qp() if the binascii
    module is present. So quopri.decodestring behaviour depends of the
    presence of binascii module...

    • binascii present: accept bytes or unicode
    • missing binascii: accept only bytes!

    binascii.a2b_qp() encodes unicode string to UTF-8 bytes string.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 2, 2009

    If the input to b64decode is a str, just do a encode('ascii')
    operation on it and proceed. If that fails, it wasn't valid
    Base64 to begin with.

    On unicode encode error, should we raise an UnicodeEncodeError or a
    binascii.Error?

    And there is also the problem of base64.b64decode()
    alternate "characters". Should we accept non-ASCII alternate
    characters?
    base64.b64decode('01a\xfeb\xffcd', altchars=b'\xfe\xff')

    For the example, the result depends on the choosen charset:

    • ASCII (strict): encode input text raise an UnicodeDecodeError
    • ISO-8859-1 (ignore): works as expected
    • UTF-8 (strict): unexpected result

    The only valid choice is ASCII because ISO-8859-1 or UTF-8 will
    reintroduce bytes/character mixture.

    @meatballhat
    Copy link
    Mannequin

    meatballhat mannequin commented May 20, 2010

    This appears to still be an issue in py3k. I've attached the command and output when running python3 -m base64 with various options and inputs. If there's consensus on a solution, I'd be happy to take a crack at making a patch.

    @vstinner
    Copy link
    Member

    Attached base64_main.patch fixes errors described in b64-decode-str-bytes-typeerror.txt.

    @meatballhat
    Copy link
    Mannequin

    meatballhat mannequin commented May 22, 2010

    @Haypo - what patch? :)

    @vstinner
    Copy link
    Member

    This one!

    @vstinner
    Copy link
    Member

    I commited base64_main.patch (+ tests): 3.2 (r81533) and 3.1 (r81534).

    @vstinner
    Copy link
    Member

    Accept unicode string is not "pure", but I agree that it's convinient. Here is a patch:

    • base64.b(16|32|64)encode and b64.encodebytes accept unicode string
    • unicode is first encoded to utf-8 to get a byte string
    • Update the docstrings and the documentation
    • Fix tests

    @vstinner
    Copy link
    Member

    I commited base64_main.patch (+ tests): 3.2 (r81533) and 3.1 (r81534).

    Hum, the test fails on Windows: fixed by r81535 (3.2) and r81536 (3.1).

    @bitdancer
    Copy link
    Member

    The patch appears to be fixing the wrong functions. It is decode that needs to accept unicode. Encode should still be restricted to bytes.

    @bitdancer
    Copy link
    Member

    After thinking about it, I'm inclined to reject this and say that quopri should be fixed to reject string input to decode. On python-dev Guido opined that a kind of polymorphism in the stdlib was good (bytes in --> bytes out, string in --> string out). string in --> bytes out and bytes in --> string out was considered bad, to my understanding (except for unicode encode/decode, of course).

    As you say, all one has to do is encode the string as ascii to get the bytes to pass in. It is better, I think, to maintain the clear distinction between bytes and strings in the programmers mind. That's what Python3 is all about, really.

    As for "the whole point of base64 is to safely encode binary data into text characters", that is not true. The point is to encode binary data into a subset of *ascii*, which is *not* text, it is bytes. The fact that this is also useful for transferring binary data through unicode is pretty much an unintended consequence of the way unicode is designed.

    @bitdancer bitdancer added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Sep 6, 2010
    @merwok merwok closed this as completed Oct 27, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants