Issue 1555570: email parser incorrectly breaks headers with a CRLF at 8192

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43959

classification

Title:	email parser incorrectly breaks headers with a CRLF at 8192
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2, Python 2.7, Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	r.david.murray	Nosy List:	BreamoreBoy, anadelonbrin, barry, chris.eveleigh, graham_king, guettli, jdunck, kmtracey, paul.moore, r.david.murray, tony_nelson
Priority:	normal	Keywords:	patch

Created on 2006-09-09 23:41 by anadelonbrin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
parser.py.patch	anadelonbrin, 2006-09-09 23:41	Patch for lib/email/parser.py (SVN 10/09/06)
test_email.py.patch	anadelonbrin, 2006-09-09 23:42	Test case.
msg_45.txt	anadelonbrin, 2006-09-09 23:43	Data for test case.
feedparser_pushcr_pushlf.patch	tony_nelson, 2009-04-02 19:40	FeedParser.feed() handle feed("\r") feed("\n") as one line

Messages (10)
msg51105 - (view)	Author: Tony Meyer (anadelonbrin)	Date: 2006-09-09 23:41
If a message has a CRLF as part of a header that starts at 8191, the parser will incorrectly consider the headers to finish at 8191 and the body to start at 8193, which leaves headers in the body of the message. This problem occurs because the parser reads 8192 characters at a time. If 8192 is a '\r' and 8193 a '\n', then when the second block is parsed, it will appear to be a blank line (i.e. header separator). The simplest fix for this is to just read an extra character if the last one read is a '\r'. This appears to work fine without breaking anything, although I suppose that an alternative would be to change the FeedParser to check whether the '\n' belonged with the previous data. A patch and test are attached, against SVN of 10/Sept/06.
msg51106 - (view)	Author: Graham King (graham_king)	Date: 2006-12-06 11:30
I had exactly the same problem parsing a POST encoded as 'multipart/form-data' with lots of parts. The patch provided fixes it. I'm on ActivePython 2.4.3 so I applied the patch by hand. +1 for integrating this patch into Python. Thanks, Graham King.
msg51107 - (view)	Author: Jeremy Dunck (jdunck)	Date: 2006-12-06 17:54
We're hitting this bug as well. It affects Django, which handles media upload over HTTP by parsing multipart with email.message_from_string. pegasusnews.com, FWIW.
msg51108 - (view)	Author: Paul Moore (paul.moore) *	Date: 2007-03-20 09:23
The documentation isn't clear, but I would say that the code is written on the basis that the file is opened in text or universal newline mode (so that CRLF will ve translated to \n). On that assumption, the patch is unnecessary, as CRLF will never appear in the data. (Actually, the existence of Utils.fix_eols makes me wonder...) On the other hand, 99% of the time, the code will work fine if a CRLF file is opened in binary mode. This patch fixes one of the obscure corner cases, and (as far as I can see) does no real harm - I can't imagine a real mail file containing literal CR bytes which should be preserved! The documentation should be updated to require that the file is in text or universal newline mode, and note that files with CRLF opened in binary mode could be processed incorrectly. I'm probably +0 on applying this patch, on the basis that it makes the code more robust given the current lack of any documentation of the required file mode. If the documentation is patched as above, I'd be -0. If the rest of the email module is known not to robust in the face of files with CRLF opened in binary mode, the documentation should definitely be changed, and that fact made very clear. (This patch becomes irrelevant then). BTW, the test case looks wrong - surely the line 'self.assertNotEqual(data[8193], "\n")' should check for a \r? I assume it's asserting that the line after the break is not the end of the headers - if so, that line will be \r\n as the earlier translation changes LF to CRLF. Summary: CRLF handling is a bit of a can of worms - the module documentation could do with being clearer in any case, but the patch is probably good to apply on the basis that it fixes a small corner case without causing harm elsewhere. The test should be fixed as above.
msg64581 - (view)	Author: Thomas Guettler (guettli) *	Date: 2008-03-27 11:37
I was hit by this bug in Django. The ticket URL: http://code.djangoproject.com/ticket/6256 It would be nice if this could be fixed.
msg64864 - (view)	Author: Karen Tracey (kmtracey)	Date: 2008-04-02 15:37
Opening the file in universal newline mode doesn't work for cases where the 'file' contains multipart MIME data (eg. multipart/form-data) where one of the included parts is binary data (eg. application/octet-stream). In that case, blind translation of CRLF to LF may corrupt the binary data. (Thanks to Thomas Guettler for pointing that out to me.) FeedParser goes to considerable trouble to split on any conceivable line boundary but retain whatever line boundary existed in the stream when putting things back together. (Look at BufferedSubFile's push() code in feedparser.py.) It was not written on the assumption that it would be getting LFs only. The only code that knows enough to know which CRLFs are really line breaks is the code that is breaking the stream up based on the boundary markers -- that is the FeedParser code. It isn't safe for the caller to do any CRLF conversions before calling the Parser. Therefore I believe the fix needs to be made to the parser.py code, not the docs. Two people that I know of independently re-discovered this bug in the last couple of weeks (running Django), after I re-discovered it about three months ago after Jeremy Dunck re-discovered it a year earlier, three months after it was originally opened. Maybe a corner case, but it would be nice, since it is quite difficult for people to track down, and the fix is so trivial, if the fix could be put in.
msg85254 - (view)	Author: Tony Nelson (tony_nelson)	Date: 2009-04-02 19:40
The OP's diagnosis of a buffer boundary problem is correct, but incomplete. The problem can be reproduced by calling feedparser FeedParser.feed() directly, or as my patch test does, by calling BufferedSubFile.push() directly. The proper fix is for push() to treat a last line ending in CR as a partial line, as it does if no part of a line ending is present. The OP's patch only works when FeedParser is called through the old Parser interface.
msg97666 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-01-12 22:13
See also issue 1721862, which has a different test and patch. This one seems simpler.
msg110499 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-16 21:13
There are repeated statements that this impacts on Django which I understand is high profile. Can we find the resources to review the patches and get things moving, none of the attached patch files are that large?
msg110534 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-07-17 01:42
Committed Tony Nelson's fix to py3k in r82922, 3.1 in r72923, 2.7 in r82924, and 2.6 in r82925.

History
Date	User	Action	Args
2022-04-11 14:56:20	admin	set	github: 43959
2010-07-17 01:42:04	r.david.murray	set	status: open -> closed versions: + Python 2.6 messages: + msg110534 resolution: fixed stage: patch review -> resolved
2010-07-16 21:13:53	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg110499 versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2010-05-05 13:35:48	barry	set	assignee: barry -> r.david.murray
2010-01-12 22:13:36	r.david.murray	set	nosy: + r.david.murray messages: + msg97666 type: behavior stage: patch review
2010-01-12 22:12:15	r.david.murray	link	issue1721862 superseder
2009-06-02 09:16:44	amaury.forgeotdarc	link	issue6153 superseder
2009-04-02 19:40:40	tony_nelson	set	files: + feedparser_pushcr_pushlf.patch nosy: + tony_nelson messages: + msg85254
2008-07-03 14:52:10	chris.eveleigh	set	nosy: + chris.eveleigh
2008-04-02 15:37:48	kmtracey	set	messages: + msg64864
2008-03-27 11:37:05	guettli	set	nosy: + guettli messages: + msg64581
2007-12-21 14:17:29	kmtracey	set	nosy: + kmtracey
2006-09-09 23:41:43	anadelonbrin	create