classification
Title: rfc822 long header continuation broken
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder: Provisional new email API: new policy implementing custom header objects
View: 12586
Assigned To: r.david.murray Nosy List: BreamoreBoy, barry, loewis, r.david.murray, richard
Priority: normal Keywords: patch

Created on 2002-01-16 01:31 by richard, last changed 2012-05-16 01:55 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
rfc822.diff gvanrossum, 2007-08-30 03:36 review
email_test.diff BreamoreBoy, 2010-08-17 20:23 review
Messages (15)
msg8766 - (view) Author: Richard Jones (richard) * (Python committer) Date: 2002-01-16 01:31
I don't believe this is fixed in 2.1.2 or 2.2, but
haven't checked.

The code in rfc822.Message.readheaders incorrectly
unfolds long message headers. The relevant information
from rfc2822 is in section 2.2.3. In short:

"""
The process of moving from this folded multiple-line
representation of a header field to its single line
representation is called "unfolding". Unfolding is
accomplished by simply removing any CRLF that is
immediately followed by WSP.  Each header field should
be treated in its unfolded form for further syntactic
and semantic evaluation.
"""

This means that the code in readheaders:

            if headerseen and line[0] in ' \t':
                # It's a continuation line.
                list.append(line)
                x = (self.dict[headerseen] + "\n " +
line.strip())
                self.dict[headerseen] = x.strip()
                continue

should be:

            if headerseen and line[0] in ' \t':
                # It's a continuation line.
                list.append(line)
                x = self.dict[headerseen] + line
                self.dict[headerseen] = x.strip()
                continue

ie. no stripping of the leading whitespace and no
adding the newline.
msg8767 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-01-16 02:47
Logged In: YES 
user_id=6380

Richard, have you found a situation where it matters? I
thought that usually the next phase calls for normalizing
whitespace by squashing repeated spaces/tabs and removing
them from front and back.
msg8768 - (view) Author: Richard Jones (richard) * (Python committer) Date: 2002-01-16 12:12
Logged In: YES 
user_id=6405

Yes, we had someone submit a bug report on the roundup 
users mailing list because someone had sent a message to 
the roundup mail gateway which was split. The client was 
extra-specially broken, since it split in the middle of a 
word (which is not to spec), but the more general case of 
folding on whitespace will cause roundup problems since I 
hadn't expected there to be any newlines in the header.

I can modify roundup to strip out the newline, but it'd be 
nice to have rfc822.Message not put it in there...

msg8769 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-01-16 12:14
Logged In: YES 
user_id=21627

Even though it might not matter, I don't think changing it
would hurt, either, and the change brings it definitely
closer to following the word of RFC 2822. 

If no case is brought forward where it matters, fixing it
for 2.3 alone should be sufficient.
msg8770 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2002-04-15 17:28
Logged In: YES 
user_id=12800

There is some value in not unfolding long lines by default.
 FWIW, the email package also retains the line breaks for
such multi-line headers.  The advantage to retaining this is
that message input/output can be idempotent (i.e. you get
the same thing in as you get out).  This can be useful when
using the message to generate a hash value, and for other
user-friendly reasons.

That being said, there is also some use in providing a way
to return the unfolded line.  I don't see a lot of benefit
in adding such a feature to the rfc822 module, but I could
see adding it to the email package.  Specifically, I would
propose to add it to the email.Header.Header class, either
as a separate method (e.g. Header.unfold()) or as a default
argument to the Header.encode() method (e.g.
Header.encode(self, unfold=0)).

If we did the latter, then I'd change Header.__str__() to
call .encode(unfold=1).

Assigning to Ben to get his feedback.  Ben, feel free to
comment and re-assign this bug to me.
msg8771 - (view) Author: Richard Jones (richard) * (Python committer) Date: 2003-11-10 21:35
Logged In: YES 
user_id=6405

Hurm. This issue has been lost to the void, but it's as valid today as it 
ever was. I've just had another user of Roundup run into the same thing: 
 
 RE: [issue51] Mails being delayed 
[assignedto=stuartm;priority=medium]  
 
(that should be one long line) became 
 
 RE: [issue51] Mails being delayed [assignedto=stuartm;priority=me 
dium] 
 
when sent by Outlook. Note that the current code reconstructs that line 
as "me\ndium" which is about as wrong as it can get, as there's no way 
for my code to determine whether that *should* be "me dium" or 
"medium" since the other whitespace has been stripped (so just stripping 
out the newline, as my code currently does, doesn't help). 
 
I stand by my original post, requesting that the code be fixed as stated. 
 
msg8772 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2003-11-10 22:04
Logged In: YES 
user_id=12800

Since this was never addressed in the email package either,
perhaps you'd like to bring it up in the email-sig?
msg8773 - (view) Author: Richard Jones (richard) * (Python committer) Date: 2003-11-10 22:28
Logged In: YES 
user_id=6405

OK, I've sent a message, but I don't have the time to sign up to the list. 
msg55454 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-08-30 02:14
Is this still an issue?  No activity since 2003-11.
msg55457 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-08-30 03:36
How about this patch? It basically does

  self.dict[headerseen] += line.rstrip()

which should be what the RFC prescribes, maintaining the invariant that
the dict values don't end in whitespace.

I haven't written a test for this behavior.
msg55524 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-08-31 02:40
Barry, what do you think of this patch?  How does the email package
handle this case?
msg71573 - (view) Author: Kenneth Arnold (kcarnold) Date: 2008-08-20 20:50
This issue still seems to be present in Python 2.5's email module.

feedparser.py line 444-445 says:

# XXX reconsider the joining of folded lines
lhdr = EMPTYSTRING.join(lastvalue)[:-1].rstrip('\r\n')

I think that should be something like:
lhdr = EMPTYSTRING.join(ln.rstrip('\r\n') for ln in lastvalue)[:-1])

The resulting headers still need a fair amount of massaging, though; why
not just use Header instances for the headers?
msg85676 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-06 23:08
The source lines mentioned in this issue have not been changed in trunk,
and the feedparser line has not been changed in py3k as of r71355
(rfc822 no longer exists, so I'm not sure if the replacement code in
py3k has the issue or not).
msg114158 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-17 20:23
Confirmed still an issue in py3k by using the email subject line from msg8771 and adding two extra test cases to TestParsers got two failures.  Tried several variations of the patch from msg71573 (in the original the parantheses are unbalanced) and pushed the number of failures to over 80.  I've attached a patch against the unit test file, note that the comments will need changing.
msg160794 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-05-16 01:55
The email package no longer strips the leading whitespace.  It doesn't unfold the headers, but changing that at this stage has untenable consequences.

However, the email package in 3.3 will have a provisional API that will provide the correct unfolding of headers in an easy to use way.  So I'm closing this in favor of the issue where I will add that API.
History
Date User Action Args
2012-05-16 01:55:58r.david.murraysetstatus: open -> closed
superseder: Provisional new email API: new policy implementing custom header objects
resolution: fixed
messages: + msg160794
2011-03-13 22:52:43r.david.murraysetnosy: loewis, barry, richard, r.david.murray, BreamoreBoy
versions: + Python 3.3
2010-08-18 00:05:25kcarnoldsetnosy: - kcarnold
2010-08-17 23:57:32gvanrossumsetnosy: - gvanrossum
2010-08-17 20:23:12BreamoreBoysetfiles: + email_test.diff
versions: + Python 3.2, - Python 2.6, Python 3.0
nosy: + BreamoreBoy

messages: + msg114158

stage: test needed -> needs patch
2010-05-20 20:39:30skip.montanarosetnosy: - skip.montanaro
2010-05-05 13:46:28barrysetassignee: barry -> r.david.murray
2009-04-06 23:08:49r.david.murraysetversions: + Python 3.1, Python 2.7, - Python 2.5, Python 2.4, Python 2.3
nosy: + r.david.murray

messages: + msg85676

type: behavior
stage: test needed
2008-08-20 20:50:12kcarnoldsetnosy: + kcarnold
messages: + msg71573
2007-09-13 23:23:09brett.cannonsetkeywords: + patch
2007-08-31 02:40:05gvanrossumsetmessages: + msg55524
2007-08-30 03:36:58gvanrossumsetversions: + Python 2.6, Python 2.5, Python 2.4, Python 3.0
2007-08-30 03:36:08gvanrossumsetfiles: + rfc822.diff
messages: + msg55457
2007-08-30 02:14:06skip.montanarosetnosy: + skip.montanaro
messages: + msg55454
2002-01-16 01:31:17richardcreate