Issue 32491: base64.decode: linebreaks are not ignored

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76672

classification

Title:	base64.decode: linebreaks are not ignored
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.7, Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	gregory.p.smith, martin.panter, r.david.murray
Priority:	normal	Keywords:

Created on 2018-01-03 23:35 by gregory.p.smith, last changed 2022-04-11 14:58 by admin.

Messages (3)
msg309449 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2018-01-03 23:35
I've tried reading various RFCs around Base64 encoding, but I couldn't make the ends meet. Yet there is an inconsistency between base64.decodebytes() and base64.decode() in that how they handle linebreaks that were used to collate the encoded text. Below is an example of what I'm talking about: >>> import base64 >>> foo = base64.encodebytes(b'123456789') >>> foo b'MTIzNDU2Nzg5\n' >>> foo = b'MTIzND\n' + b'U2Nzg5\n' >>> foo b'MTIzND\nU2Nzg5\n' >>> base64.decodebytes(foo) b'123456789' >>> from io import BytesIO >>> bytes_in = BytesIO(foo) >>> bytes_out = BytesIO() >>> bytes_in.seek(0) 0 >>> base64.decode(bytes_in, bytes_out) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/somewhere/lib/python3.6/base64.py", line 512, in decode s = binascii.a2b_base64(line) binascii.Error: Incorrect padding >>> bytes_in = BytesIO(base64.encodebytes(b'123456789')) >>> bytes_in.seek(0) 0 >>> base64.decode(bytes_in, bytes_out) >>> bytes_out.getvalue() b'123456789' Obviously, I'd expect encodebytes() and encode both to either accept or to reject the same input. Thanks. Oleg via Oleg Sivokon on python-dev (who was having trouble getting bugs.python.org account creation to work)
msg309451 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2018-01-04 00:52
This reduces to the following: >>> from binascii import a2b_base64 as f >>> f(b'MTIzND\nU2Nzg5\n') b'123456789' >>> f(b'MTIzND\n') Traceback (most recent call last): File "<stdin>", line 1, in <module> binascii.Error: Incorrect padding That is, decode does its decoding line by line, whereas decodebytes passes the entire object to a2b_base64 as a single entity. Apparently a2b_base64 looks at the padding for the entirety of what it is given, which I believe is in accordance with the RFC. This means that decode is fundamentally broken per the RFC, and there is no obvious way to fix it without adding an incremental decoder to binascii. And an incremental decoder probably belongs in codecs (assuming we ever resolved the transcode interface issue, I can't actually remember...). Note that it will work as long as an "integral" number of base64 encoding units are in each line.
msg309454 - (view)	Author: Martin Panter (martin.panter) *	Date: 2018-01-04 03:44
I wrote an incremental base-64 decoder for the "codecs" module in Issue 27799, which you could use. It just does some preprocessing using a regular expression to pick four-character chunks before passing the data to a2b_base64. Or maybe implementing it properly in the "binascii" module is better. Quickly reading RFC 2045, I saw it says "All line breaks or other characters not found in Table 1 [64 alphabet characters plus padding character] must be ignored by decoding software." So this is a real bug, although I think a base-64 encoder that triggers it would be rare.

History
Date	User	Action	Args
2022-04-11 14:58:56	admin	set	github: 76672
2018-01-04 03:44:16	martin.panter	set	nosy: + martin.panter messages: + msg309454
2018-01-04 00:52:25	r.david.murray	set	nosy: + r.david.murray messages: + msg309451
2018-01-03 23:35:25	gregory.p.smith	create