classification
Title: get_payload(decode=True) eats last newline in base64 encoded payload
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: Joaquin.Cuenca.Abela, athomas, barry, esam, r.david.murray
Priority: normal Keywords: easy, patch

Created on 2009-10-15 18:06 by athomas, last changed 2010-03-08 08:14 by Joaquin.Cuenca.Abela. This issue is now closed.

Files
File name Uploaded Description Edit
test.eml athomas, 2009-10-15 18:06 Sample email
b64crlf.patch Joaquin.Cuenca.Abela, 2010-03-04 16:03
b64crlf-2.patch Joaquin.Cuenca.Abela, 2010-03-04 22:22
Messages (12)
msg94100 - (view) Author: Andreas Thomas (athomas) Date: 2009-10-15 18:06
The attachment of my sample email contains '123\n', but get_payload
(decode=True) returns '123'.
The source of this error is the function _bdecode from the email.utils 
module, which gets called for decoding base64 encoded payloads. Here 
the last \n gets stripped away, because (comment in the function):

# We can't quite use base64.encodestring() since it tacks on a "courtesy
# newline".  Blech!

I don't think that's true. At least not for Python 2.5.4 and upwards. 
Also note the "encodestring" instead of "decodestring".
In fact calling

base64.decodestring(part.get_payload())

returns the correct value, so I propose to remove this function 
completely and just use the standard decoding method.
Please note that base64.encodestring really tacks on a \n.
msg97570 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-11 01:47
The behavior you object to is in fact correct per RFC 2046, 5.5.1:

   The boundary delimiter MUST occur at the beginning of a line, i.e.,
   following a CRLF, and the initial CRLF is considered to be attached
   to the boundary delimiter line rather than part of the preceding
   part.

So in your example message your attatched file does not in fact have a final newline.  Note that you do not get a newline regardless of whether decode is true ('p' is the part extracted from your sample message):

>>> p.get_payload()
'MTIzCg=='
>>> p.get_payload(decode=True)
'123'
msg100338 - (view) Author: Joaquin Cuenca Abela (Joaquin.Cuenca.Abela) Date: 2010-03-03 11:22
Hi,

RFC 2046, 5.1.1 refers to the CRLF that happens just before the boundary. It says nothing about an encoded CRLF.

From Andreas example, if you have:

Content-Type: text/plain; name="test.txt"
Content-Transfer-Encoding: base64
Content-Disposition: inline; filename="test.txt"

MTIzCg==
--------------040103020004000509010404--

You are correctly eating the CRLF that exists between MTIzCg== and --------------040103020004000509010404--, because it's part of the boundary.

You are also eating the CRLF that is inside the base64 encoded text, and I agree with Andreas that this is incorrect. Will you please consider reopening this bug?
msg100343 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-03 14:08
Ah, I misunderstood, and did not see that the newline in question was inside the base64 string.  Thank you for pointing out my mistake.

Would either of you like to propose a patch, including a test case?

(I've removed 2.5 because it is in security-fix-only mode).
msg100394 - (view) Author: Joaquin Cuenca Abela (Joaquin.Cuenca.Abela) Date: 2010-03-04 16:03
Hi,

I've never before made a patch to Python, so take it with care.

A couple of comments, I reused a test where all the attachments contained an ending newline, except for the base64 one (conveniently...)

I think the comment in _bdecode that Andreas quoted before refers to base64.encodestring add an extra \n to de encoded string, but base64.decodestring can work with and without this extra \n:

>>> import base64
>>> base64.decodestring('MTIzCg==')
'123\n'
>>> base64.decodestring('MTIzCg==\n')
'123\n'

So it's not necessary to work around this in _bdecode. Let me know if it looks good to you.
msg100403 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-04 18:01
Thanks for working on this.

For the most part your patch looks fine.  Two comments: (1) it concerns me that by co-opting the existing test, we are no longer testing that decoding does not introduce a spurious newline :).  (2) I think we should add a comment in _bdecode that it used to do more and is retained now only for backward compatibility.

I don't particularly like proliferating test files, and I plan to fix that in email6, but for now I think you should just add a new test file based on msg_10, and a test that uses it.
msg100427 - (view) Author: Joaquin Cuenca Abela (Joaquin.Cuenca.Abela) Date: 2010-03-04 22:22
I added a new subpart to msg_10.txt, that keeps the previous test and also tests the new behavior. Let me know if it's ok like this or if you still prefer to create a different msg file for testing this.

Thanks,
msg100592 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-07 16:58
Your patch looks good, thank you.

I just realized that Barry isn't nosy on this issue.  I've checked, and the code in question dates back to email version 1.0...code of that long standing that exists specifically to implement the behavior we propose to change makes me wonder what we will break if we fix it.  Yet the current behavior seems clearly contrary to the RFCs to me.

Barry?  Any memory of why _bdecode exists and what will break if we fix it?

In any case I suspect that backporting this fix might be a bad idea, since existing code may be compensating for the current behavior and break if the behavior is fixed.
msg100596 - (view) Author: Joaquin Cuenca Abela (Joaquin.Cuenca.Abela) Date: 2010-03-07 18:04
Unfortunately the only way that I can see to reliably work around this is to bypass entirely get_payload, in this case fixing this bug will not affect people that do that negatively.

Some people may have more control over their attachments and they are unconditionally adding a \n to certain type of attachments. In this case they will get an extra \n after this patch.

If they are doing this, they certainly realized there was a bug, because this is only necessary for base64 encoded attachments and not for any other encoding. In this case I guess it's reasonable to expect that this bug will be fixed one day.

But you're right that the change is risky, an that we (at least I) are not sure why the code is the way it is.

Cheers,
msg100614 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2010-03-07 21:46
Thanks for adding me to the nosy list.  Yep, this code is pretty old so it doesn't surprise me that its implementation isn't quite right.  Of course, I hate get_payload(decode=True) anyway and hope that goes away in email 6.

Having said that, I don't think this can be changed in Python 2.6.  It's pretty common for people to have workarounds for the devil they know and I'd bet that changing this behavior in 2.6.x would break people's code.  I'm okay for doing the sensible thing in Python 2.7 (though we should probably bump the email package's micro version).
msg100626 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-08 02:23
OK, patch (with comment tweaks to refer to this issue), and email minor version bump, applied to trunk in r78778.  It turns out that bdecode was already deleted in email 5 in py3k.  I did port the test in r78780.

Thanks Joaquin Cuenca Abela, and you now have the interesting distinction of being the first name listed in Misc/ACKS (because we've been maintaining it in sorted order by last name....)
msg100631 - (view) Author: Joaquin Cuenca Abela (Joaquin.Cuenca.Abela) Date: 2010-03-08 08:14
Thanks David and Barry!

As much as it flatters my ego to be in the first place is MISC/ACK, I have to confess that my family name is "Cuenca Abela", Cuenca is not a middle name.

Cheers,
History
Date User Action Args
2010-03-08 08:14:47Joaquin.Cuenca.Abelasetmessages: + msg100631
2010-03-08 02:23:34r.david.murraysetstatus: open -> closed
versions: - Python 3.2
messages: + msg100626

resolution: fixed
stage: patch review -> resolved
2010-03-07 21:46:08barrysetmessages: + msg100614
2010-03-07 18:04:02Joaquin.Cuenca.Abelasetmessages: + msg100596
2010-03-07 16:58:17r.david.murraysetnosy: + barry

messages: + msg100592
versions: - Python 2.6, Python 2.5, Python 3.1
2010-03-04 22:22:46Joaquin.Cuenca.Abelasetfiles: + b64crlf-2.patch

messages: + msg100427
versions: + Python 2.5
2010-03-04 18:01:59r.david.murraysetmessages: + msg100403
stage: test needed -> patch review
2010-03-04 16:03:45Joaquin.Cuenca.Abelasetfiles: + b64crlf.patch
keywords: + patch
messages: + msg100394
2010-03-03 14:12:06r.david.murraysetversions: - Python 3.0
2010-03-03 14:08:55r.david.murraysetstatus: closed -> open

assignee: r.david.murray
title: get_payload(decode=True) eats last newline -> get_payload(decode=True) eats last newline in base64 encoded payload
keywords: + easy
resolution: not a bug -> (no value)
versions: - Python 2.5
messages: + msg100343
stage: resolved -> test needed
2010-03-03 11:22:56Joaquin.Cuenca.Abelasetnosy: + Joaquin.Cuenca.Abela
messages: + msg100338
2010-01-11 01:47:21r.david.murraysetstatus: open -> closed
priority: normal


nosy: + r.david.murray
messages: + msg97570
resolution: not a bug
stage: resolved
2009-10-16 06:47:02esamsetnosy: + esam
2009-10-15 18:06:47athomascreate