Message 32496 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	slinkp
Recipients
Date	2007-07-13.18:13:54
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
(Python 2.5.1 and earlier) Apologies for how long this is, but cleaning up history is hard. There seems to be a lot of historical confusion around Base 64 encoding, and unfortunately the python library docs reflect that confusion. The base64 module docs (at http://docs.python.org/lib/module-base64.html ) claim to implement RFC 3548, as seen at http://www.faqs.org/rfcs/rfc3548.html ... heck it's even in the page title. (I'll quickly note here that RFC 3548 has recently been obsoleted by RFC 4648, but for purposes of this bug report the two RFCs are the same so I'll just refer to 3548.) But the "legacy" functions, encode() and encodestring() , add line feeds every 76 characters. That is a violation of RFC 3548, which specifically says "Implementations MUST NOT add line feeds to base-encoded data". RFC 4648 says the same thing. Obviously we can't change behavior of legacy functions, but I strongly feel the docs should warn you about this violation. What encode() and encodestring() actually implement is MIME base 64 encoding, as per RFC 2045 (see http://tools.ietf.org/html/rfc2045#section-6.8 ... obsoletes 1521, 1522, 1590) So base64.encodestring() is AFAICT functionally identical to email.base64mime.encodestring (tangent: someday we should consolidate those two functions into one implementation). What RFC 3548 describes IS implemented by the "modern" interface such as base64.b64encode(), which does not split into lines. There's also the lower-level binascii.b2a_base64() function which afaict correctly implements RFC 3548 (although it adds a newline at the end, which base64.b64encode() does not; it's not clear to me which is correct per RFC 3548.) At one time, b2a_base64 DID split into lines, but that was fixed: http://sourceforge.net/tracker/index.php?func=detail&aid=473009&group_id=5470&atid=105470 . But unfortunately, its docs still mistakenly say "The length of data should be at most 57 to adhere to the base64 standard." which presumably refers to the old PEM spec that predates even MIME. So I propose several doc changes: 1) Add a reference to RFC 3548 and/or 4648 to the binascii docs, and remove the cruft sentence about "at most 57". 2) Add a sentence to the base64 docstrings for encode() and encodestring() like this: "Newlines will be inserted in the output every 76 characters, as per RFC 2045 (MIME)." 3) Change the introductory text in the base64 docs to something like this: """There are two interfaces provided by this module. The modern interface supports encoding and decoding string objects using all three alphabets. The modern interface, per RFC 3548, does NOT break the output into multiple lines. The legacy interface provides for encoding and decoding to and from file-like objects as well as strings, but only using the Base64 standard alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME). Thus, the legacy interface is not compliant with the newer RFC 3548. If you are encoding data for email attachments and wondering which to use, you can use the legacy functions but you should probably be using the email package instead. """

(Python 2.5.1 and earlier)
Apologies for how long this is, but cleaning up history is hard.

There seems to be a lot of historical confusion around Base 64 encoding, and unfortunately the python library docs reflect that confusion.

The base64 module docs (at http://docs.python.org/lib/module-base64.html ) claim to implement RFC 3548, as seen at http://www.faqs.org/rfcs/rfc3548.html 
... heck it's even in the page title.
(I'll quickly note here that RFC 3548 has recently been obsoleted by RFC 4648, but for purposes of this bug report the two RFCs are the same so I'll just refer to 3548.)

But the "legacy" functions, encode() and encodestring() , add line feeds every 76 characters.  That is a violation of RFC 3548, which specifically says "Implementations MUST NOT add line feeds to base-encoded data". RFC 4648 says the same thing.

Obviously we can't change behavior of legacy functions, but I strongly feel the docs should warn you about this violation.

What encode() and encodestring() actually implement is MIME base 64 encoding, as per RFC 2045 (see http://tools.ietf.org/html/rfc2045#section-6.8 ...
obsoletes 1521, 1522, 1590)
So base64.encodestring() is AFAICT functionally identical to email.base64mime.encodestring (tangent: someday we should consolidate those two functions into one implementation).

What RFC 3548 describes IS implemented by the "modern" interface such as base64.b64encode(), which does not split into lines.

There's also the lower-level binascii.b2a_base64() function which afaict correctly implements RFC 3548 (although it adds a newline at the end, which base64.b64encode() does not; it's not clear to me which is correct per RFC 3548.)
At one time, b2a_base64 DID split into lines, but that was fixed: http://sourceforge.net/tracker/index.php?func=detail&aid=473009&group_id=5470&atid=105470
. But unfortunately, its docs still mistakenly say
"The length of data should be at most 57 to adhere to the base64 standard." which presumably refers to the old PEM spec that predates even MIME.

So I propose several doc changes:

1) Add a reference to RFC 3548 and/or 4648 to the binascii docs, and remove the cruft sentence about "at most 57".

2) Add a sentence to the base64 docstrings for encode() and encodestring() like this:

"Newlines will be inserted in the output every 76 characters, as per RFC 2045 (MIME)."


3) Change the introductory text in the base64 docs to something like this:

"""There are two interfaces provided by this module. The modern interface supports encoding and decoding string objects using all three alphabets.  The modern interface, per RFC 3548, does NOT break the output into multiple lines. 

The legacy interface provides for encoding and decoding to and from file-like objects as well as strings, but only using the Base64 standard alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME).
Thus, the legacy interface is not compliant with the newer RFC 3548.

If you are encoding data for email attachments and wondering which to use, you can use the legacy functions but you should probably be using the email package instead.
"""

History
Date	User	Action	Args
2007-08-23 14:58:34	admin	link	issue1753718 messages
2007-08-23 14:58:34	admin	create