Issue 26686: email.parser stops parsing headers too soon when given a defective message.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70873

classification

Title:	email.parser stops parsing headers too soon when given a defective message.
Type:	behavior	Stage:	patch review
Components:	email, Library (Lib)	Versions:	Python 3.6, Python 3.5, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, martin.panter, msapiro, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2016-04-01 17:46 by msapiro, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
continue-header.patch	martin.panter, 2016-06-13 12:58		review

Messages (7)
msg262750 - (view)	Author: Mark Sapiro (msapiro) *	Date: 2016-04-01 17:46
Given an admittedly defective (the folded Content-Type: isn't indented) message part with the following headers/body ------------------------------- Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip" Content-Type: application/x-rar-compressed; x-unix-mode=0600; name="04EBD_xxxx.xxxx_A546BB.zip" Content-Transfer-Encoding: base64 UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6 ... ------------------------------- email.parser parses the headers as ------------------------------- Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip" Content-Type: application/x-rar-compressed; x-unix-mode=0600; ------------------------------- and the body as ------------------------------- name="04EBD_xxxx.xxxx_A546BB.zip" Content-Transfer-Encoding: base64 UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6 ... ------------------------------- and shows no defects. This is wrong. RFC5322 section 2.1 is clear that everything up to the first empty line is headers. Even the docstring in the email/parser.py module says "The header block is terminated either by the end of the string or by a blank line." Since the message is defective, it isn't clear what the correct result should be, but I think Headers: Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip" Content-Type: application/x-rar-compressed; x-unix-mode=0600; Content-Transfer-Encoding: base64 Body: UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6 ... Defects: name="04EBD_xxxx.xxxx_A546BB.zip" would be more appropriate. The problem is that the Content-Transfer-Encoding: base64 header is not in the headers so that get_payload(decode=True) doesn't decode the base64 encoded body making malware recognition difficult.
msg262751 - (view)	Author: Mark Sapiro (msapiro) *	Date: 2016-04-01 17:50
Added Python 2.7 to versions:
msg262776 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-04-01 22:10
Also see Issue 24363, basically the same bug in the HTTP parser, which (ab?)uses the email package to do most of the work. In that case, according to my note the faulty header field ends with: X-Frame-Options: SAMEORIGIN\r\n Set-Cookie: mb-CookieP=; HttpOnly; \r\n Secure\r\n Set-Cookie: mb-CookieP=; HttpOnly; Secure\r\n \r\n But in this case, perhaps because of the implications of dropping the “Secure” flag, people are asking that the faulty line be appended to the previous header field. IMO I don’t think that is super important though. An alternative would be to add it to the defect list, and then raise an exception or warning if any defects are detected.
msg268438 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-06-13 12:58
FWIW in the HTTP bug <https://bugs.python.org/issue24363#msg244676>, David said “when seeing a line that doesn't look like a header the error recovery is to treat that line as the beginning of the body (ie: assume the blank line is missing).” I have no experience with email and RFC 5322 header handling, but it does make more sense to me to handle this as a defect in the header section, _not_ a genuine transition to the body (same as desired for the HTTP case). Here is a patch that revives MalformedHeaderDefect (see Issue 14925), and continues parsing the rest of the header section instead of starting the body. But I am not sure how safe this change is. I did have to fix one unrelated set of tests (see headertest_msg in the Test8BitBytesHandling class) that did not include a blank line and was relying on the old behaviour.
msg272295 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-08-10 02:57
. It would be nice to get feedback if my patch is sensible, especially from people familiar with normal usage of the “email” module, as opposed with the usage by the HTTP module.
msg272332 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-08-10 13:42
I would prefer if we did lookahead to see if the subsequent line looks like a header. It's more complicated to do that, of course, and could still lead to false negatives. However, I think that would probably retain enough backward compatibility to be acceptable. It would also be sensible to make this a policy switch, and as I said elsewhere I'm fine with changing the defaults of the http policy even in 3.5. (The downside of that is that I'm sure there are bugs hiding in the new header parsing code, so actually using the http policy to parse http headers will doubtless "allow" us to find some of them.) Even more complicated, but a better heuristic: look ahead to the next blank line, up to some limit (5 lines?), and if you do find something that looks like a header, also make sure that none of the intermediate lines look like a MIME boundary. That still leaves the question of what to do with a source text that has non-header lines up to the next blank line (this applies to one line lookahead as well). Maybe see if there is more text after the blank line and if so assume the non-header is part of the header, otherwise not? Regardless, lookahead may be difficult to code. So an alternative that uses your approach, but triggered by a policy setting on http, would be acceptable backward compatibility wise. If we want to we could even make an internal http policy that is compat32 plus this new flag.
msg272381 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-08-10 22:57
Thanks David. Since I am more intersted in fixing this robustly for HTTP and similar protocols, I might focus on just Issue 24363. Either confine my changes to the existing HTTP (or new) policy and start using that, or just address this from the HTTP package.

History
Date	User	Action	Args
2022-04-11 14:58:29	admin	set	github: 70873
2016-08-12 12:37:54	martin.panter	unlink	issue24363 dependencies
2016-08-10 22:57:36	martin.panter	set	messages: + msg272381
2016-08-10 13:42:23	r.david.murray	set	messages: + msg272332
2016-08-10 02:57:46	martin.panter	set	messages: + msg272295 stage: patch review
2016-06-29 14:44:23	ppperry	set	title: email.parser stops parsing headers too soon. -> email.parser stops parsing headers too soon when given a defective message.
2016-06-14 09:01:07	martin.panter	link	issue24363 dependencies
2016-06-13 12:58:57	martin.panter	set	files: + continue-header.patch keywords: + patch messages: + msg268438 versions: + Python 3.6
2016-04-01 22:10:27	martin.panter	set	nosy: + martin.panter messages: + msg262776
2016-04-01 18:12:01	SilentGhost	set	nosy: + barry, r.david.murray components: + email versions: + Python 3.5, - Python 3.4
2016-04-01 17:50:49	msapiro	set	messages: + msg262751 versions: + Python 2.7
2016-04-01 17:46:30	msapiro	create