This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: base64.decode: linebreaks are not ignored
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, martin.panter, r.david.murray
Priority: normal Keywords:

Created on 2018-01-03 23:35 by gregory.p.smith, last changed 2022-04-11 14:58 by admin.

Messages (3)
msg309449 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2018-01-03 23:35
I've tried reading various RFCs around Base64 encoding, but I couldn't make the ends meet.  Yet there is an inconsistency between base64.decodebytes() and base64.decode() in that how they handle linebreaks that were used to collate the encoded text.  Below is an example of what I'm talking about:

>>> import base64
>>> foo = base64.encodebytes(b'123456789')
>>> foo
b'MTIzNDU2Nzg5\n'
>>> foo = b'MTIzND\n' + b'U2Nzg5\n'
>>> foo
b'MTIzND\nU2Nzg5\n'
>>> base64.decodebytes(foo)
b'123456789'
>>> from io import BytesIO
>>> bytes_in = BytesIO(foo)
>>> bytes_out = BytesIO()
>>> bytes_in.seek(0)
0
>>> base64.decode(bytes_in, bytes_out)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/somewhere/lib/python3.6/base64.py", line 512, in decode
    s = binascii.a2b_base64(line)
binascii.Error: Incorrect padding
>>> bytes_in = BytesIO(base64.encodebytes(b'123456789'))
>>> bytes_in.seek(0)
0
>>> base64.decode(bytes_in, bytes_out)
>>> bytes_out.getvalue()
b'123456789'

Obviously, I'd expect encodebytes() and encode both to either accept or to reject the same input.

Thanks.

Oleg

via Oleg Sivokon on python-dev (who was having trouble getting bugs.python.org account creation to work)
msg309451 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-01-04 00:52
This reduces to the following:

>>> from binascii import a2b_base64 as f
>>> f(b'MTIzND\nU2Nzg5\n')
b'123456789'
>>> f(b'MTIzND\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Incorrect padding

That is, decode does its decoding line by line, whereas decodebytes passes the entire object to a2b_base64 as a single entity.  Apparently a2b_base64 looks at the padding for the entirety of what it is given, which I believe is in accordance with the RFC.  This means that decode is fundamentally broken per the RFC, and there is no obvious way to fix it without adding an incremental decoder to binascii.  And an incremental decoder probably belongs in codecs (assuming we ever resolved the transcode interface issue, I can't actually remember...).

Note that it will work as long as an "integral" number of base64 encoding units are in each line.
msg309454 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-01-04 03:44
I wrote an incremental base-64 decoder for the "codecs" module in Issue 27799, which you could use. It just does some preprocessing using a regular expression to pick four-character chunks before passing the data to a2b_base64. Or maybe implementing it properly in the "binascii" module is better.

Quickly reading RFC 2045, I saw it says "All line breaks or other characters not found in Table 1 [64 alphabet characters plus padding character] must be ignored by decoding software." So this is a real bug, although I think a base-64 encoder that triggers it would be rare.
History
Date User Action Args
2022-04-11 14:58:56adminsetgithub: 76672
2018-01-04 03:44:16martin.pantersetnosy: + martin.panter
messages: + msg309454
2018-01-04 00:52:25r.david.murraysetnosy: + r.david.murray
messages: + msg309451
2018-01-03 23:35:25gregory.p.smithcreate