Issue 1753718: base64 "legacy" functions violate RFC 3548

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45194

classification

Title:	base64 "legacy" functions violate RFC 3548
Type:		Stage:	resolved
Components:	Documentation	Versions:	Python 3.6, Python 3.4, Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Isobel Hooper, barry, docs@python, loewis, martin.panter, ncoghlan, python-dev, r.david.murray, serhiy.storchaka, slinkp
Priority:	normal	Keywords:	patch

Created on 2007-07-13 18:13 by slinkp, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue-1753718.patch	Isobel Hooper, 2015-12-05 14:54		review
issue-01753718.patch	r.david.murray, 2015-12-14 00:40		review
issue-01753718.patch	r.david.murray, 2015-12-14 16:41		review
issue-01753718.patch	r.david.murray, 2015-12-16 02:17		review
issue-01753718.patch	r.david.murray, 2015-12-18 03:03		review

Messages (25)
msg32496 - (view)	Author: Paul Winkler (slinkp) *	Date: 2007-07-13 18:13
(Python 2.5.1 and earlier) Apologies for how long this is, but cleaning up history is hard. There seems to be a lot of historical confusion around Base 64 encoding, and unfortunately the python library docs reflect that confusion. The base64 module docs (at http://docs.python.org/lib/module-base64.html ) claim to implement RFC 3548, as seen at http://www.faqs.org/rfcs/rfc3548.html ... heck it's even in the page title. (I'll quickly note here that RFC 3548 has recently been obsoleted by RFC 4648, but for purposes of this bug report the two RFCs are the same so I'll just refer to 3548.) But the "legacy" functions, encode() and encodestring() , add line feeds every 76 characters. That is a violation of RFC 3548, which specifically says "Implementations MUST NOT add line feeds to base-encoded data". RFC 4648 says the same thing. Obviously we can't change behavior of legacy functions, but I strongly feel the docs should warn you about this violation. What encode() and encodestring() actually implement is MIME base 64 encoding, as per RFC 2045 (see http://tools.ietf.org/html/rfc2045#section-6.8 ... obsoletes 1521, 1522, 1590) So base64.encodestring() is AFAICT functionally identical to email.base64mime.encodestring (tangent: someday we should consolidate those two functions into one implementation). What RFC 3548 describes IS implemented by the "modern" interface such as base64.b64encode(), which does not split into lines. There's also the lower-level binascii.b2a_base64() function which afaict correctly implements RFC 3548 (although it adds a newline at the end, which base64.b64encode() does not; it's not clear to me which is correct per RFC 3548.) At one time, b2a_base64 DID split into lines, but that was fixed: http://sourceforge.net/tracker/index.php?func=detail&aid=473009&group_id=5470&atid=105470 . But unfortunately, its docs still mistakenly say "The length of data should be at most 57 to adhere to the base64 standard." which presumably refers to the old PEM spec that predates even MIME. So I propose several doc changes: 1) Add a reference to RFC 3548 and/or 4648 to the binascii docs, and remove the cruft sentence about "at most 57". 2) Add a sentence to the base64 docstrings for encode() and encodestring() like this: "Newlines will be inserted in the output every 76 characters, as per RFC 2045 (MIME)." 3) Change the introductory text in the base64 docs to something like this: """There are two interfaces provided by this module. The modern interface supports encoding and decoding string objects using all three alphabets. The modern interface, per RFC 3548, does NOT break the output into multiple lines. The legacy interface provides for encoding and decoding to and from file-like objects as well as strings, but only using the Base64 standard alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME). Thus, the legacy interface is not compliant with the newer RFC 3548. If you are encoding data for email attachments and wondering which to use, you can use the legacy functions but you should probably be using the email package instead. """
msg32497 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-07-14 12:56
I would not say that the older functions violate RFC 4648. They just don't implement it, as they implement some other standard instead.
msg32498 - (view)	Author: Paul Winkler (slinkp) *	Date: 2007-07-16 14:44
ok then, how about this (last sentence of middle paragraph slightly modified): """There are two interfaces provided by this module. The modern interface supports encoding and decoding string objects using all three alphabets. The modern interface, per RFC 3548, does NOT break the output into multiple lines. The legacy interface provides for encoding and decoding to and from file-like objects as well as strings, but only using the Base64 standard alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME). Thus, the legacy interface does not implement the newer RFC 3548. If you are encoding data for email attachments and wondering which to use, you can use the legacy functions but you should probably be using the email package instead. """
msg32499 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-07-16 21:06
Barry, as you implemented these new functions: can you fix this appropriately? If not, please unassign.
msg255959 - (view)	Author: Isobel Hooper (Isobel Hooper)	Date: 2015-12-05 14:54
Attached patch fixes library/base64.rst as requested, and adds a mention of RFC 3548 into the b2a_base64() docs in library/binascii.rst. I'm not sure I've made the changes against the right version of the docs - I think this might be against the 3.3 docs.
msg256020 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-06 18:29
See also the discussion in issue 25495. I will try to review both of these issues soon.
msg256353 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-14 00:40
I started tweaking this patch, and wound up going through the whole doc and fixing the references to 'byte string' and 'string' throughout, as well as making all the entries consistent in how they reference the function arguments and output (previously some did not reference the output at all, nor was it clear that the output is always bytes). I believe I also clarified some confusing wordings along the way. Since there are so many changes I need some eyes checking my work before I commit. Note that the primary motivation for this change (the incorrect claim that both interfaces supported the RFC) is not made by the 2.7 docs, and since those docs are very different now, I don't plan to touch them.
msg256357 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-12-14 02:45
Left some review comments. I left a comment about the original patch as well, because I didn’t notice the new patch in time :) Also, maybe we should say the input to the “legacy” MIME decode() function should be multiple lines, since it calls readline() with no line limit.
msg256384 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-14 15:57
How about we just make the docs more correct and say that input is read until readline() returns an empty bytes object? That should make it clear that a line-oriented file is expected.
msg256396 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-14 16:41
Updated patch.
msg256412 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-12-14 19:48
The change to readline() works well. Any thoughts regarding my other comments? In particular, altchars and ignorechars cannot be arbitrary bytes-like objects.
msg256414 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-14 20:56
I missed the other comments somehow. Will take a look soon.
msg256503 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-16 02:17
Updated patch that addresses most of the comments.
msg256634 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-18 02:51
The intent of the term "bytes-like object" it to make it possible to use it in documentation in the way I have used it here. That the buffer has a len is clearly discussed in the Buffer Protocol documentation, but of course that's only talking about the C level API. Perhaps what is needed is an addition to the bytes-like object description that clarifies that a bytes-like object is a Sequence that supports the buffer protocol? (So: "A Sequence object that supports the Buffer Protocol and...") Do we also need to clarify that the item size must be one byte? That would seem to me to be implicit in the name. I don't know if what ctypes produces is a bytes-like object in this sense, since I don't understand ctypes very well, but it sounds like it isn't. Trying to wrap it in a memoryview gives an error ('unsupported format <c'), so I suspect it is not. I'm adding Nick to nosy, since the commit log says he added the bytes-like object support to this module.
msg256635 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-18 03:01
The altchars 2 char limit is an assertion. That's a bug that should be dealt with separately. Either it should be turned into an error, or it should be dropped to match the docs. Probably the latter, since it is documented as OK and it might break code that is currently working in -O mode or on python2.
msg256636 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-18 03:03
Fixed the spurious 'u'.
msg256637 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-18 03:08
"or on python2" should be "or ported from python2". Also note that Nick's commit message specifically mentions a test for multi-dimensional input, so the module does indeed conform to the current bytes-like object definition in that regard.
msg256781 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-12-21 02:56
I understood “bytes-like” to mean what is now defined in the glossary: what is accepted by the C-level y* format for the PyArg parsing functions. I tend to agree with <https://bugs.python.org/issue23756#msg248265> that it may not have been the best term to choose when you really mean “C-contiguous buffer”. I am understanding that you take “bytes-like” to be a more specific thing. Perhaps we could instead have two distinct terms, say “C-contiguous buffer”, which is what FileIO.write() and PyArg supports, and “byte sequence”, perhaps implementing an API common to bytes() and memoryview(), which is easier to work with using native Python. In general, I think ctypes and array.array produce my stricter kind of C-contiguous buffers. But since Issue 15944 native Python code can access these buffers by casting to a second memoryview: >>> c = ctypes.c_char(b"A") >>> with memoryview(c) as array_view, array_view.cast("B") as byte_view: ... print(repr(byte_view[0])) ... 65 Nick’s commit d90f25e1a705 mentions multi-dimensional input for the “modern” interface. That is not the problem. In <https://bugs.python.org/issue17839#msg198843> he decided to be less permissive for the “legacy” interface, which seems unnecessary to me. Anyway, this is all rather off-topic. Apart from the bytes-like errors, the rest of the current patch is good. Even if you committed with those four errors, I can live with that. I think there are similar problems elsewhere in the documentation, HTTPConnection.request() over TLS for instance.
msg256783 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-21 06:19
The term "bytes-like object" is specifically designed for those situations where python used to say "X does not support the buffer interface" if passed something else. Which is the case here...now it says "a bytes-like object is required". I'm not sure if we fixed that everywhere, but I think we did. It is certainly true in the cases you cite, except that it turns out that ignorechars accepts an ASCII string. So, if there is any sort of remaining problem, it is a separate issue: my edits match the current error message behavior. I'll fix ignorechars and commit.
msg256949 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-12-24 02:20
New changeset 105bf5dd93b8 by R David Murray in branch '3.5': #1753718: clarify RFC compliance and bytes/string argument types. https://hg.python.org/cpython/rev/105bf5dd93b8 New changeset 92760d2edc9e by R David Murray in branch 'default': Merge: #1753718: clarify RFC compliance and bytes/string argument types. https://hg.python.org/cpython/rev/92760d2edc9e
msg256950 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-24 02:21
Thanks everyone.
msg256953 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-24 06:25
While we are here, may be update docstrings too?
msg256963 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-24 15:10
That would be a good idea, yes. I thought Martin was doing that as part of issue 22088, but now that I look at the patch I see he didn't. Martin, do you want to add it to that patch, or should I reopen this?
msg257390 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-01-03 01:00
I was waiting for you to finish here to avoid any new merge conflicts. Now that you have committed your patch, I will try and work on mine in the next few days, and I am happy to update the doc strings at the same time.
msg257941 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-01-10 21:31
Uploaded a Python 3 patch to Issue 22088 which includes the doc string changes.

History
Date	User	Action	Args
2022-04-11 14:56:25	admin	set	github: 45194
2016-01-10 21:31:03	martin.panter	set	messages: + msg257941
2016-01-03 01:00:32	martin.panter	set	messages: + msg257390
2015-12-24 15:10:35	r.david.murray	set	messages: + msg256963
2015-12-24 06:25:26	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg256953
2015-12-24 02:21:54	r.david.murray	set	status: open -> closed resolution: fixed messages: + msg256950 stage: commit review -> resolved
2015-12-24 02:20:23	python-dev	set	nosy: + python-dev messages: + msg256949
2015-12-21 06:19:48	r.david.murray	set	messages: + msg256783 stage: patch review -> commit review
2015-12-21 02:56:47	martin.panter	set	messages: + msg256781
2015-12-18 03:08:09	r.david.murray	set	messages: + msg256637
2015-12-18 03:03:54	r.david.murray	set	files: + issue-01753718.patch messages: + msg256636
2015-12-18 03:01:46	r.david.murray	set	messages: + msg256635
2015-12-18 02:51:10	r.david.murray	set	nosy: + ncoghlan messages: + msg256634
2015-12-16 02:17:52	r.david.murray	set	versions: - Python 2.7
2015-12-16 02:17:44	r.david.murray	set	files: + issue-01753718.patch messages: + msg256503
2015-12-14 20:56:58	r.david.murray	set	messages: + msg256414
2015-12-14 19:48:33	martin.panter	set	messages: + msg256412
2015-12-14 16:41:58	r.david.murray	set	files: + issue-01753718.patch messages: + msg256396
2015-12-14 15:57:47	r.david.murray	set	messages: + msg256384
2015-12-14 05:42:10	martin.panter	link	issue20782 dependencies
2015-12-14 02:45:02	martin.panter	set	nosy: + martin.panter messages: + msg256357
2015-12-14 00:40:47	r.david.murray	set	files: + issue-01753718.patch messages: + msg256353 stage: patch review
2015-12-06 18:29:59	r.david.murray	set	nosy: + r.david.murray messages: + msg256020 versions: + Python 3.4, Python 3.5, Python 3.6, - Python 3.1, Python 3.2
2015-12-05 14:54:16	Isobel Hooper	set	files: + issue-1753718.patch nosy: + Isobel Hooper messages: + msg255959 keywords: + patch
2010-09-17 18:27:12	BreamoreBoy	set	assignee: barry -> docs@python nosy: + docs@python versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2010-06-09 21:36:47	terry.reedy	set	versions: - Python 2.5
2008-01-06 12:32:48	christian.heimes	set	versions: + Python 2.6, Python 2.5
2007-07-13 18:13:54	slinkp	create