classification
Title: email parser incorrectly breaks headers with a CRLF at 8192
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: BreamoreBoy, anadelonbrin, barry, chris.eveleigh, graham_king, guettli, jdunck, kmtracey, pmoore, r.david.murray, tony_nelson
Priority: normal Keywords: patch

Created on 2006-09-09 23:41 by anadelonbrin, last changed 2010-07-17 01:42 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
parser.py.patch anadelonbrin, 2006-09-09 23:41 Patch for lib/email/parser.py (SVN 10/09/06)
test_email.py.patch anadelonbrin, 2006-09-09 23:42 Test case.
msg_45.txt anadelonbrin, 2006-09-09 23:43 Data for test case.
feedparser_pushcr_pushlf.patch tony_nelson, 2009-04-02 19:40 FeedParser.feed() handle feed("\r") feed("\n") as one line
Messages (10)
msg51105 - (view) Author: Tony Meyer (anadelonbrin) Date: 2006-09-09 23:41
If a message has a CRLF as part of a header that starts
at 8191, the parser will incorrectly consider the
headers to finish at 8191 and the body to start at
8193, which leaves headers in the body of the message.

This problem occurs because the parser reads 8192
characters at a time.  If 8192 is a '\r' and 8193 a
'\n', then when the second block is parsed, it will
appear to be a blank line (i.e. header separator).

The simplest fix for this is to just read an extra
character if the last one read is a '\r'.  This appears
to work fine without breaking anything, although I
suppose that an alternative would be to change the
FeedParser to check whether the '\n' belonged with the
previous data.

A patch and test are attached, against SVN of 10/Sept/06.
msg51106 - (view) Author: Graham King (graham_king) Date: 2006-12-06 11:30
 I had exactly the same problem parsing a POST encoded as 'multipart/form-data' with lots of parts. The patch provided fixes it. I'm on ActivePython 2.4.3 so I applied the patch by hand.

 +1 for integrating this patch into Python.

 Thanks,
 Graham King.
msg51107 - (view) Author: Jeremy Dunck (jdunck) Date: 2006-12-06 17:54
We're hitting this bug as well.
It affects Django, which handles media upload over HTTP by parsing multipart with email.message_from_string.

pegasusnews.com, FWIW.
msg51108 - (view) Author: Paul Moore (pmoore) * Date: 2007-03-20 09:23
The documentation isn't clear, but I would say that the code is written on the basis that the file is opened in text or universal newline mode (so that CRLF will ve translated to \n). On that assumption, the patch is unnecessary, as CRLF will never appear in the data. (Actually, the existence of Utils.fix_eols makes me wonder...)

On the other hand, 99% of the time, the code will work fine if a CRLF file is opened in binary mode. This patch fixes one of the obscure corner cases, and (as far as I can see) does no real harm - I can't imagine a *real* mail file containing literal CR bytes which should be preserved!

The documentation should be updated to require that the file is in text or universal newline mode, and note that files with CRLF opened in binary mode could be processed incorrectly.

I'm probably +0 on applying this patch, on the basis that it makes the code more robust given the current lack of any documentation of the required file mode. If the documentation is patched as above, I'd be -0.

If the rest of the email module is known not to robust in the face of files with CRLF opened in binary mode, the documentation should definitely be changed, and that fact made very clear. (This patch becomes irrelevant then).

BTW, the test case looks wrong - surely the line 'self.assertNotEqual(data[8193], "\n")' should check for a \r? I assume it's asserting that the line after the break is not the end of the headers - if so, that line will be \r\n as the earlier translation changes LF to CRLF.

Summary: CRLF handling is a bit of a can of worms - the module documentation could do with being clearer in any case, but the patch is probably good to apply on the basis that it fixes a small corner case without causing harm elsewhere. The test should be fixed as above.
msg64581 - (view) Author: Thomas Guettler (guettli) Date: 2008-03-27 11:37
I was hit by this bug in Django. The ticket URL:

http://code.djangoproject.com/ticket/6256

It would be nice if this could be fixed.
msg64864 - (view) Author: Karen Tracey (kmtracey) Date: 2008-04-02 15:37
Opening the file in universal newline mode doesn't work for cases where
the 'file' contains multipart MIME data (eg. multipart/form-data) where
one of the included parts is binary data (eg. application/octet-stream).
 In that case, blind translation of CRLF to LF may corrupt the binary
data.  (Thanks to Thomas Guettler for pointing that out to me.)

FeedParser goes to considerable trouble to split on any conceivable line
boundary but retain whatever line boundary existed in the stream when
putting things back together.  (Look at BufferedSubFile's push() code in
feedparser.py.)  It was not written on the assumption that it would be
getting LFs only.  

The only code that knows enough to know which CRLFs are really line
breaks is the code that is breaking the stream up based on the boundary
markers -- that is the FeedParser code.  It isn't safe for the caller to
do any CRLF conversions before calling the Parser.  Therefore I believe
the fix needs to be made to the parser.py code, not the docs.

Two people that I know of independently re-discovered this bug in the
last couple of weeks (running Django), after I re-discovered it about
three months ago after Jeremy Dunck re-discovered it a year earlier,
three months after it was originally opened.  Maybe a corner case, but
it would be nice, since it is quite difficult for people to track down,
and the fix is so trivial, if the fix could be put in.
msg85254 - (view) Author: Tony Nelson (tony_nelson) Date: 2009-04-02 19:40
The OP's diagnosis of a buffer boundary problem is correct, but
incomplete.  The problem can be reproduced by calling feedparser
FeedParser.feed() directly, or as my patch test does, by calling
BufferedSubFile.push() directly.  The proper fix is for push() to treat
a last line ending in CR as a partial line, as it does if no part of a
line ending is present.  The OP's patch only works when FeedParser is
called through the old Parser interface.
msg97666 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-12 22:13
See also issue 1721862, which has a different test and patch.  This one seems simpler.
msg110499 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-16 21:13
There are repeated statements that this impacts on Django which I understand is high profile.  Can we find the resources to review the patches and get things moving, none of the attached patch files are that large?
msg110534 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-17 01:42
Committed Tony Nelson's fix to py3k in r82922, 3.1 in r72923, 2.7 in r82924, and 2.6 in r82925.
History
Date User Action Args
2010-07-17 01:42:04r.david.murraysetstatus: open -> closed
versions: + Python 2.6
messages: + msg110534

resolution: fixed
stage: patch review -> resolved
2010-07-16 21:13:53BreamoreBoysetnosy: + BreamoreBoy

messages: + msg110499
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2010-05-05 13:35:48barrysetassignee: barry -> r.david.murray
2010-01-12 22:13:36r.david.murraysetnosy: + r.david.murray
messages: + msg97666

type: behavior
stage: patch review
2010-01-12 22:12:15r.david.murraylinkissue1721862 superseder
2009-06-02 09:16:44amaury.forgeotdarclinkissue6153 superseder
2009-04-02 19:40:40tony_nelsonsetfiles: + feedparser_pushcr_pushlf.patch
nosy: + tony_nelson
messages: + msg85254

2008-07-03 14:52:10chris.eveleighsetnosy: + chris.eveleigh
2008-04-02 15:37:48kmtraceysetmessages: + msg64864
2008-03-27 11:37:05guettlisetnosy: + guettli
messages: + msg64581
2007-12-21 14:17:29kmtraceysetnosy: + kmtracey
2006-09-09 23:41:43anadelonbrincreate