classification
Title: base64 "legacy" functions violate RFC 3548
Type: Stage: resolved
Components: Documentation Versions: Python 3.6, Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Isobel Hooper, barry, docs@python, loewis, martin.panter, ncoghlan, python-dev, r.david.murray, serhiy.storchaka, slinkp
Priority: normal Keywords: patch

Created on 2007-07-13 18:13 by slinkp, last changed 2016-01-10 21:31 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
issue-1753718.patch Isobel Hooper, 2015-12-05 14:54 review
issue-01753718.patch r.david.murray, 2015-12-14 00:40 review
issue-01753718.patch r.david.murray, 2015-12-14 16:41 review
issue-01753718.patch r.david.murray, 2015-12-16 02:17 review
issue-01753718.patch r.david.murray, 2015-12-18 03:03 review
Messages (25)
msg32496 - (view) Author: Paul Winkler (slinkp) * Date: 2007-07-13 18:13
(Python 2.5.1 and earlier)
Apologies for how long this is, but cleaning up history is hard.

There seems to be a lot of historical confusion around Base 64 encoding, and unfortunately the python library docs reflect that confusion.

The base64 module docs (at http://docs.python.org/lib/module-base64.html ) claim to implement RFC 3548, as seen at http://www.faqs.org/rfcs/rfc3548.html 
... heck it's even in the page title.
(I'll quickly note here that RFC 3548 has recently been obsoleted by RFC 4648, but for purposes of this bug report the two RFCs are the same so I'll just refer to 3548.)

But the "legacy" functions, encode() and encodestring() , add line feeds every 76 characters.  That is a violation of RFC 3548, which specifically says "Implementations MUST NOT add line feeds to base-encoded data". RFC 4648 says the same thing.

Obviously we can't change behavior of legacy functions, but I strongly feel the docs should warn you about this violation.

What encode() and encodestring() actually implement is MIME base 64 encoding, as per RFC 2045 (see http://tools.ietf.org/html/rfc2045#section-6.8 ...
obsoletes 1521, 1522, 1590)
So base64.encodestring() is AFAICT functionally identical to email.base64mime.encodestring (tangent: someday we should consolidate those two functions into one implementation).

What RFC 3548 describes IS implemented by the "modern" interface such as base64.b64encode(), which does not split into lines.

There's also the lower-level binascii.b2a_base64() function which afaict correctly implements RFC 3548 (although it adds a newline at the end, which base64.b64encode() does not; it's not clear to me which is correct per RFC 3548.)
At one time, b2a_base64 DID split into lines, but that was fixed: http://sourceforge.net/tracker/index.php?func=detail&aid=473009&group_id=5470&atid=105470
. But unfortunately, its docs still mistakenly say
"The length of data should be at most 57 to adhere to the base64 standard." which presumably refers to the old PEM spec that predates even MIME.

So I propose several doc changes:

1) Add a reference to RFC 3548 and/or 4648 to the binascii docs, and remove the cruft sentence about "at most 57".

2) Add a sentence to the base64 docstrings for encode() and encodestring() like this:

"Newlines will be inserted in the output every 76 characters, as per RFC 2045 (MIME)."


3) Change the introductory text in the base64 docs to something like this:

"""There are two interfaces provided by this module. The modern interface supports encoding and decoding string objects using all three alphabets.  The modern interface, per RFC 3548, does NOT break the output into multiple lines. 

The legacy interface provides for encoding and decoding to and from file-like objects as well as strings, but only using the Base64 standard alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME).
Thus, the legacy interface is not compliant with the newer RFC 3548.

If you are encoding data for email attachments and wondering which to use, you can use the legacy functions but you should probably be using the email package instead.
"""
msg32497 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-07-14 12:56
I would not say that the older functions violate RFC 4648. They just don't implement it, as they implement some other standard instead.
msg32498 - (view) Author: Paul Winkler (slinkp) * Date: 2007-07-16 14:44
ok then, how about this (last sentence of middle paragraph slightly modified):

"""There are two interfaces provided by this module. The modern interface
supports encoding and decoding string objects using all three alphabets.
The modern interface, per RFC 3548, does NOT break the output into multiple
lines.

The legacy interface provides for encoding and decoding to and from
file-like objects as well as strings, but only using the Base64 standard
alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME).
Thus, the legacy interface does not implement the newer RFC 3548.

If you are encoding data for email attachments and wondering which to use,
you can use the legacy functions but you should probably be using the email
package instead.
"""
msg32499 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-07-16 21:06
Barry, as you implemented these new functions: can you fix this appropriately? If not, please unassign.
msg255959 - (view) Author: Isobel Hooper (Isobel Hooper) Date: 2015-12-05 14:54
Attached patch fixes library/base64.rst as requested, and adds a mention of RFC 3548 into the b2a_base64() docs in library/binascii.rst.

I'm not sure I've made the changes against the right version of the docs - I think this might be against the 3.3 docs.
msg256020 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-06 18:29
See also the discussion in issue 25495.  I will try to review both of these issues soon.
msg256353 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-14 00:40
I started tweaking this patch, and wound up going through the whole doc and fixing the references to 'byte string' and 'string' throughout, as well as making all the entries consistent in how they reference the function arguments and output (previously some did not reference the output at all, nor was it clear that the output is always bytes).  I believe I also clarified some confusing wordings along the way.

Since there are so many changes I need some eyes checking my work before I commit.

Note that the primary motivation for this change (the incorrect claim that both interfaces supported the RFC) is not made by the 2.7 docs, and since those docs are very different now, I don't plan to touch them.
msg256357 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-14 02:45
Left some review comments. I left a comment about the original patch as well, because I didn’t notice the new patch in time :)

Also, maybe we should say the input to the “legacy” MIME decode() function should be multiple lines, since it calls readline() with no line limit.
msg256384 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-14 15:57
How about we just make the docs more correct and say that input is read until readline() returns an empty bytes object?  That should make it clear that a line-oriented file is expected.
msg256396 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-14 16:41
Updated patch.
msg256412 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-14 19:48
The change to readline() works well.

Any thoughts regarding my other comments? In particular, altchars and ignorechars cannot be arbitrary bytes-like objects.
msg256414 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-14 20:56
I missed the other comments somehow.  Will take a look soon.
msg256503 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-16 02:17
Updated patch that addresses most of the comments.
msg256634 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-18 02:51
The intent of the term "bytes-like object" it to make it possible to use it in documentation in the way I have used it here.  That the buffer has a len is clearly discussed in the Buffer Protocol documentation, but of course that's only talking about the C level API.  Perhaps what is needed is an addition to the bytes-like object description that clarifies that a bytes-like object is a Sequence that supports the buffer protocol?  (So: "A Sequence object that supports the Buffer Protocol and...")  Do we also need to clarify that the item size must be one byte?  That would seem to me to be implicit in the name.

I don't know if what ctypes produces is a bytes-like object in this sense, since I don't understand ctypes very well, but it sounds like it isn't.  Trying to wrap it in a memoryview gives an error ('unsupported format <c'), so I suspect it is not.

I'm adding Nick to nosy, since the commit log says he added the bytes-like object support to this module.
msg256635 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-18 03:01
The altchars 2 char limit is an assertion.  That's a bug that should be dealt with separately.  Either it should be turned into an error, or it should be dropped to match the docs.  Probably the latter, since it is documented as OK and it might break code that is currently working in -O mode or on python2.
msg256636 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-18 03:03
Fixed the spurious 'u'.
msg256637 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-18 03:08
"or on python2" should be "or ported from python2".

Also note that Nick's commit message specifically mentions a test for multi-dimensional input, so the module does indeed conform to the current bytes-like object definition in that regard.
msg256781 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-21 02:56
I understood “bytes-like” to mean what is now defined in the glossary: what is accepted by the C-level y* format for the PyArg parsing functions. I tend to agree with <https://bugs.python.org/issue23756#msg248265> that it may not have been the best term to choose when you really mean “C-contiguous buffer”.

I am understanding that you take “bytes-like” to be a more specific thing. Perhaps we could instead have two distinct terms, say “C-contiguous buffer”, which is what FileIO.write() and PyArg supports, and “byte sequence”, perhaps implementing an API common to bytes() and memoryview(), which is easier to work with using native Python.

In general, I think ctypes and array.array produce my stricter kind of C-contiguous buffers. But since Issue 15944 native Python code can access these buffers by casting to a second memoryview:

>>> c = ctypes.c_char(b"A")
>>> with memoryview(c) as array_view, array_view.cast("B") as byte_view:
...     print(repr(byte_view[0]))
... 
65

Nick’s commit d90f25e1a705 mentions multi-dimensional input for the “modern” interface. That is not the problem. In <https://bugs.python.org/issue17839#msg198843> he decided to be less permissive for the “legacy” interface, which seems unnecessary to me.

Anyway, this is all rather off-topic. Apart from the bytes-like errors, the rest of the current patch is good. Even if you committed with those four errors, I can live with that. I think there are similar problems elsewhere in the documentation, HTTPConnection.request() over TLS for instance.
msg256783 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-21 06:19
The term "bytes-like object" is specifically designed for those situations where python used to say "X does not support the buffer interface" if passed something else.  Which is the case here...now it says "a bytes-like object is required".  I'm not sure if we fixed that everywhere, but I think we did.  It is certainly true in the cases you cite, except that it turns out that ignorechars accepts an ASCII string.

So, if there is any sort of remaining problem, it is a separate issue: my edits match the current error message behavior.  I'll fix ignorechars and commit.
msg256949 - (view) Author: Roundup Robot (python-dev) Date: 2015-12-24 02:20
New changeset 105bf5dd93b8 by R David Murray in branch '3.5':
#1753718: clarify RFC compliance and bytes/string argument types.
https://hg.python.org/cpython/rev/105bf5dd93b8

New changeset 92760d2edc9e by R David Murray in branch 'default':
Merge: #1753718: clarify RFC compliance and bytes/string argument types.
https://hg.python.org/cpython/rev/92760d2edc9e
msg256950 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-24 02:21
Thanks everyone.
msg256953 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-24 06:25
While we are here, may be update docstrings too?
msg256963 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-24 15:10
That would be a good idea, yes.  I thought Martin was doing that as part of issue 22088, but now that I look at the patch I see he didn't.  Martin, do you want to add it to that patch, or should I reopen this?
msg257390 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-03 01:00
I was waiting for you to finish here to avoid any new merge conflicts. Now that you have committed your patch, I will try and work on mine in the next few days, and I am happy to update the doc strings at the same time.
msg257941 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-10 21:31
Uploaded a Python 3 patch to Issue 22088 which includes the doc string changes.
History
Date User Action Args
2016-01-10 21:31:03martin.pantersetmessages: + msg257941
2016-01-03 01:00:32martin.pantersetmessages: + msg257390
2015-12-24 15:10:35r.david.murraysetmessages: + msg256963
2015-12-24 06:25:26serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg256953
2015-12-24 02:21:54r.david.murraysetstatus: open -> closed
resolution: fixed
messages: + msg256950

stage: commit review -> resolved
2015-12-24 02:20:23python-devsetnosy: + python-dev
messages: + msg256949
2015-12-21 06:19:48r.david.murraysetmessages: + msg256783
stage: patch review -> commit review
2015-12-21 02:56:47martin.pantersetmessages: + msg256781
2015-12-18 03:08:09r.david.murraysetmessages: + msg256637
2015-12-18 03:03:54r.david.murraysetfiles: + issue-01753718.patch

messages: + msg256636
2015-12-18 03:01:46r.david.murraysetmessages: + msg256635
2015-12-18 02:51:10r.david.murraysetnosy: + ncoghlan
messages: + msg256634
2015-12-16 02:17:52r.david.murraysetversions: - Python 2.7
2015-12-16 02:17:44r.david.murraysetfiles: + issue-01753718.patch

messages: + msg256503
2015-12-14 20:56:58r.david.murraysetmessages: + msg256414
2015-12-14 19:48:33martin.pantersetmessages: + msg256412
2015-12-14 16:41:58r.david.murraysetfiles: + issue-01753718.patch

messages: + msg256396
2015-12-14 15:57:47r.david.murraysetmessages: + msg256384
2015-12-14 05:42:10martin.panterlinkissue20782 dependencies
2015-12-14 02:45:02martin.pantersetnosy: + martin.panter
messages: + msg256357
2015-12-14 00:40:47r.david.murraysetfiles: + issue-01753718.patch

messages: + msg256353
stage: patch review
2015-12-06 18:29:59r.david.murraysetnosy: + r.david.murray

messages: + msg256020
versions: + Python 3.4, Python 3.5, Python 3.6, - Python 3.1, Python 3.2
2015-12-05 14:54:16Isobel Hoopersetfiles: + issue-1753718.patch

nosy: + Isobel Hooper
messages: + msg255959

keywords: + patch
2010-09-17 18:27:12BreamoreBoysetassignee: barry -> docs@python

nosy: + docs@python
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2010-06-09 21:36:47terry.reedysetversions: - Python 2.5
2008-01-06 12:32:48christian.heimessetversions: + Python 2.6, Python 2.5
2007-07-13 18:13:54slinkpcreate