classification
Title: binascii documentation incorrect
Type: Stage: resolved
Components: Documentation Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, georg.brandl, martin.panter, matrixise, mouse07410, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2015-10-27 23:10 by mouse07410, last changed 2015-12-14 01:30 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
issue25495.patch matrixise, 2015-10-28 20:18 review
issue25495.base64.2.7.patch martin.panter, 2015-11-05 11:02 For 2.7 review
Messages (27)
msg253572 - (view) Author: Mouse (mouse07410) Date: 2015-10-27 23:10
binascii b2a_base64() documentation says:

The length of data should be at most 57 to adhere to the base64 standard.

This is incorrect, because there is no base64 standard that restricts the length of input data, especially to such a small value.

What RFC4648 (that superseded RFC3548 that your documentation still keeps referring to) actually says is that MIME enforces the limit ofthe OUTPUT LINE length at 76, but NOT of the entire output, and certainly not of the entire input.

Please correct the documentation, making it conformant with what the ACTUAL base64 standard says.

See https://en.wikipedia.org/wiki/Base64 and
https://tools.ietf.org/html/rfc4648

Thanks!
msg253577 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-10-28 00:52
I agree that the documentation is not optimal.  To give you some background, binascii was primarily implemented to support the email module, and the standard it is referring to is in fact the MIME standard that references base64 (I believe at the time the independent base64 standard did not exist).  The number 57 is derived from the fact that 57 * 4/3 = 76; that is, input data of length 57 will result in an encoded line that is equal to the maximum recommended line length.  Thus in encoding an email the email package (originally, it no longer does this) passed the data in in 57 byte chunks and appended the resulting lines to the output buffer.

So, this documentation is historically correct, but no longer particularly useful.  Suggested improvements are welcome.  

This state of affairs exists because the binascii module doesn't really have a current maintainer. Someday I'd love to have enough time to pick it up, since I maintain email and it is still used by email (and could be used better, with some module improvements).
msg253607 - (view) Author: Mouse (mouse07410) Date: 2015-10-28 16:10
Yes I know where this came from. :-)

Here is my proposed change.

Replace the statement 

The length of data should be at most 57 to adhere to the base64 standard.

with:

To be MIME-compliant, the Base64 output (as defined in RFC4648) should be broken into lines of at most 76 characters long. This post-processing of the output is the responsibility of the caller. Note that the original PEM context-transfer encoding limited line length to 64 characters.


Would this change be agreeable to you?
msg253625 - (view) Author: Stéphane Wirtel (matrixise) * (Python committer) Date: 2015-10-28 20:18
Here is a patch with the submitted description.
msg253627 - (view) Author: Mouse (mouse07410) Date: 2015-10-28 20:44
Thank you for the quick turn-around, and for taking care of this issue!
msg253710 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-30 03:04
Thanks for bringing this up, it has often bugged me.

My understanding, based on the original wording and places where this is used, is that the data is often pre-processed into 57-byte chunks, rather than post-processing it. Maybe it is worthwhile restoring that info. It should help understanding the presence of the newline parameter (or why a newline is always added in 3.5).

Also, the link between RFC 4648 and this function could be made even more explicit. Maybe move “as defined” into the first sentence, or change “the Base64 output” to “the function’s output”.
msg253744 - (view) Author: Mouse (mouse07410) Date: 2015-10-30 16:29
As far as I remember, the data was not "originally processed in 57-byte chunks". I've been around the first PEM and MIME standards and discussions (and code, though not in Python, which wasn't around then) to be in position to know. :)

Whether the user prefers to process data in chunks or not, is up to the user. 
Not to mention that PEM is long gone, and MIME also changed somewhat. 

The link between this function and RFC4648 can and should be more explicit, but I think just referring to it is enough. 

Do you have a recommendation for additional info to explain newline issues?

Yes, changing "Base64 output" to "function output" makes perfect sense.

Thanks!
msg253766 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-30 22:50
I was only referring to the original Python documentation and library. See the base64.encode() implementation for an example which does do this 57-byte pre-chunking. Simplified:

MAXLINESIZE = 76 # Excluding the CRLF
MAXBINSIZE = (MAXLINESIZE//4)*3  # 57
...
while True:
    s = input.read(MAXBINSIZE)
    if not s:
        break
    line = binascii.b2a_base64(s)
    output.write(line)

Here’s my attempt to rewrite the doc (3.6 version):

'''
Convert binary data to the base 64 encoding defined in :rfc:`4648`. The return value includes a trailing newline ``b"\n"`` if *newline* is true.

To be MIME-compliant, base 64 output should be broken into lines at most 76 characters long. One way to do this is to call this function with 57-byte chunks and ``newline=True``. Also, the original PEM context-transfer encoding limited the line length to 64 characters.
'''

But if PEM is long gone as you say, perhaps we don’t need that last sentence?
msg253918 - (view) Author: Mouse (mouse07410) Date: 2015-11-02 14:33
1. I concede knowing nothing about the early Python library implementation, functionality, or even purpose.

2. I don't think it makes sense now to either refer to PEM. We'd be two decades too late for that (well, 27 years, to be precise :). See
 https://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail

3. I don't think we are in position to tell programmers how to split a string of characters into 76-long chunks. Not to mention that the example you gave is likely to suffer in performance (just count those function calls), compared to other methods, and won't reflect well on the authors.

Here's one possible doc version:

'''
Convert binary data to the base 64 encoding defined in :rfc:`4648`. The return value includes a trailing newline ``b"\n"`` if *newline* is true.

If the output is used as Base64 transfer encoding for MIME (:rfc: 2045), base 64 output should be broken into lines at most 76 characters long to be compliant. Base64 encoding standard does not limit the maximum encoded line length.
'''
msg253920 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-02 14:56
Add a parenthetical "(57 bytes of the input per line)" and I'll be happy with that.
msg253921 - (view) Author: Mouse (mouse07410) Date: 2015-11-02 14:59
Let's not insinuate anything about the input. This is about what constraints on the OUTPUT MAY be there, not a tutorial from the 80-ties on how one might accomplish it.
msg253922 - (view) Author: Mouse (mouse07410) Date: 2015-11-02 15:00
And even those constraints depend on the use. E.g. X.509 does not have those.
msg253924 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-02 15:15
Please take a look at the Examples section of this:

   http://perldoc.perl.org/MIME/Base64.html

Looks kind of like Martin's suggestion :)
msg253935 - (view) Author: Mouse (mouse07410) Date: 2015-11-02 17:23
1. I am OK with the following text, modeling referred Perldoc:

b2a_base64( $bytes, $eol );

Encode data by calling the encode_base64() function. The first argument is the byte string to encode. 

The second argument is optional, and provides the line-ending sequence to use. When it is given, the returned encoded string is broken into lines of no more than 76 characters each and it will end with $eol unless it is empty. Pass an empty string, or no second argument at all if you do not want the encoded string to be broken into lines.

2. I already had people telling me that "Python-3 doc prohibits input longer than 57 bytes, even though it doesn't currently enforce it". Please help putting end to spreading of this confusion.
msg253955 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-03 01:04
FWIW that Perl function looks like it does the line splitting for you. It mentions “chunks that are a multiple of 57 bytes”. The Python function does not do any line splitting. You have to use base64.encodebytes(), codecs.encode(encoding="base64") or perhaps something in the email package (or user code) for that.

I think we all agree that there is no hard limit of 57. I have avoided this function in the past due to the documentation. The question is whether the documentation should mention that number in a more accurate context, or not at all.

Personally I don’t see much harm in mentioning the 57-byte input chunking, as long as it is obvious it is not the only option. I don’t have a strong view; I am just trying to be conservative.
msg254011 - (view) Author: Mouse (mouse07410) Date: 2015-11-03 18:36
The harm in mentioning the 57-byte chunking is that so far it successfully confused people. 

b2a_base64() function is not coupled to MIME. It has no constraints on either its input, or its output. *IF* it is used by (in) a MIME application, then the caller may want to make its output RFC 2045-compliant, by whatever way he chooses. Giving (an unwelcome) advice to a writer of one specific application is in my opinion completely out of scope here. Justification that it used to matter 25 years ago and therefore should be kept here doesn't make sense to me.

I strongly insist that this "chunking" thing does not belong, and must be removed.
msg254100 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-05 11:02
Perhaps we can focus on the Python 2 version where there is always a newline appended. Here is a possible patch.
msg254118 - (view) Author: Mouse (mouse07410) Date: 2015-11-05 16:50
Unfortunately, NO. The problem (and this bug report) is for Python-3 documentation, so trying to address it in Python-2 rather than in Python-3 does not make sense.

We seem to both understand and agree that there is no length limitation on b2a_base64() input, either recommended or enforced - contrary to what the current Python-3 documentation implies.

We understand that *if* the *output* of this function is intended for use in MIME (rather than X.509 or whatever else Base64 is good for), then the caller should do other things besides calling b2a_base64(), and in all likelihood the caller is already aware of that - after all, if he figured that he needs Base64 in his stuff, he probably knows something about what MIME standards say and require?. 

I repeat my original complaint: Python-3 documentation is buggy because it implies a restriction on the input that is not there. This reference should be removed from there because it confuses people. 
I've talked to those confused personally, so this is first-hand.

I refer you to the original msg253572 of this bug report.

If you want to write a MIME-in-Python tutorial, it is up to you - but b2a_base64() does not seem to be the right place for it.  
(And I'd rather see an X.509 tutorial if you're dead set on writing something besides strict plain b2a_base64() doc. :-)
msg254122 - (view) Author: Mouse (mouse07410) Date: 2015-11-05 16:56
To add: I do not understand your attachment to that 57 "...(exactly 57 bytes of input data per line)", and request that this parenthesized sentence is removed from your Python-2.7 doc patch. 

Please give the reader the benefit of the doubt, and allow that *if* he wants to repeatedly call b2a_base64() instead of splitting its output - the ability to compute (76 * 3 / 4) is within his skill level.
msg254141 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2015-11-05 20:58
issue25495.base64.2.7.patch looks good to me.  A similar patch can be adapted for 3.x.
msg254148 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-05 23:24
Mouse, I know you originally opened this against 3.5. Apart from the module description at the bottom, my patch should be valid for 3.5 also. The relevant wording is identical to 2.7.

I have resisted removing the magic number 57 for a couple of reasons. Reading existing code that uses this number may be harder. David said he would be happier with it kept. I believed we could solve your original complaint and explain why the number was really there at the same time. It helps explain how the function was originally to be used, and why the newline is appended.

Anyway, I think it is best if I let this go, and someone else pick it up.
msg254195 - (view) Author: Mouse (mouse07410) Date: 2015-11-06 13:51
> my patch should be valid for 3.5 also.
> The relevant wording is identical to 2.7.

OK.

> I have resisted removing the magic number 57 for a couple
> of reasons. Reading existing code that uses this number may
> be harder.

You expect to see "existing code that uses this number" in Python-3.5+? Interesting... (Care to point me at a couple of samples of such "existing" Python-3 code?) And you expect that the main info source for understanding the reason behind that "57" (assuming this function is invoked that way, as opposed to splitting the output :) would be the doc for this function, rather than the main program, or RFC 2045, or...? Fine.

> It helps explain how the function was originally to be used,
> and why the newline is appended.

Pardon me, but why do you think anybody would care...? There are tons of functions, old and new, with more new ones popping up fast enough. I'd really envy a person who has time to enjoy history of one minuscule function of an old (albeit still useful :) library.

OK. You think a history of this function should be documented - fine. I don't need it (and don't think anybody else wants to read it either), but it's not my doc or my decision.

Just get the darn bug fixed.
msg254820 - (view) Author: Mouse (mouse07410) Date: 2015-11-17 23:20
Status...?
msg256021 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-06 18:30
See also Issue 1753718.  I will try to review both of these issues soon.
msg256348 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-12-13 23:15
New changeset 7b137466e879 by R David Murray in branch '2.7':
#25495: Clarify b2a_base64 documentation vis 57 bytes.
https://hg.python.org/cpython/rev/7b137466e879

New changeset 3d5bf9bd15a3 by R David Murray in branch '3.4':
#25495: Clarify b2a_base64 documentation vis 57 bytes.
https://hg.python.org/cpython/rev/3d5bf9bd15a3

New changeset ea9951598bab by R David Murray in branch '3.5':
Merge: #25495: Clarify b2a_base64 documentation vis 57 bytes.
https://hg.python.org/cpython/rev/ea9951598bab

New changeset 35650db28afe by R David Murray in branch 'default':
Merge: #25495: Clarify b2a_base64 documentation vis 57 bytes.
https://hg.python.org/cpython/rev/35650db28afe
msg256349 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-13 23:16
I kept the 57 as part of an historical note explaining why the newline is added.  I dropped that sentence in the 3.6 docs, where a keyword to control the apending of the newline has been added.
msg256354 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-14 01:30
Thanks for fixing this up David
History
Date User Action Args
2015-12-14 01:30:39martin.pantersetmessages: + msg256354
2015-12-13 23:16:59r.david.murraysetstatus: open -> closed
resolution: fixed
messages: + msg256349

stage: patch review -> resolved
2015-12-13 23:15:07python-devsetnosy: + python-dev
messages: + msg256348
2015-12-06 18:30:55r.david.murraysetmessages: + msg256021
2015-11-17 23:20:01mouse07410setmessages: + msg254820
2015-11-06 13:51:01mouse07410setmessages: + msg254195
2015-11-05 23:24:19martin.pantersetmessages: + msg254148
2015-11-05 20:58:42georg.brandlsetnosy: + georg.brandl
messages: + msg254141
2015-11-05 16:56:20mouse07410setmessages: + msg254122
2015-11-05 16:50:59mouse07410setmessages: + msg254118
2015-11-05 11:02:56martin.pantersetfiles: + issue25495.base64.2.7.patch

messages: + msg254100
2015-11-03 18:36:45mouse07410setmessages: + msg254011
2015-11-03 01:04:28martin.pantersetmessages: + msg253955
2015-11-02 17:23:24mouse07410setmessages: + msg253935
2015-11-02 15:15:52r.david.murraysetmessages: + msg253924
2015-11-02 15:00:46mouse07410setmessages: + msg253922
2015-11-02 14:59:49mouse07410setmessages: + msg253921
2015-11-02 14:56:20r.david.murraysetmessages: + msg253920
2015-11-02 14:33:17mouse07410setmessages: + msg253918
2015-10-30 22:50:38martin.pantersetmessages: + msg253766
2015-10-30 16:29:08mouse07410setmessages: + msg253744
2015-10-30 03:04:45martin.pantersetversions: + Python 2.7, Python 3.4, Python 3.6
nosy: + martin.panter

messages: + msg253710

stage: patch review
2015-10-28 20:44:03mouse07410setmessages: + msg253627
2015-10-28 20:18:08matrixisesetfiles: + issue25495.patch

nosy: + matrixise
messages: + msg253625

keywords: + patch
2015-10-28 16:10:29mouse07410setmessages: + msg253607
2015-10-28 00:52:04r.david.murraysetnosy: + r.david.murray
messages: + msg253577
2015-10-27 23:10:17mouse07410create