classification
Title: email.parser stops parsing headers too soon when given a defective message.
Type: behavior Stage: patch review
Components: email, Library (Lib) Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, martin.panter, msapiro, r.david.murray
Priority: normal Keywords: patch

Created on 2016-04-01 17:46 by msapiro, last changed 2016-08-10 22:57 by martin.panter.

Files
File name Uploaded Description Edit
continue-header.patch martin.panter, 2016-06-13 12:58 review
Messages (7)
msg262750 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2016-04-01 17:46
Given an admittedly defective (the folded Content-Type: isn't indented) message part with the following headers/body

-------------------------------
Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip"
Content-Type: application/x-rar-compressed; x-unix-mode=0600;
name="04EBD_xxxx.xxxx_A546BB.zip"
Content-Transfer-Encoding: base64

UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw
ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6
...
-------------------------------

email.parser parses the headers as

-------------------------------
Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip"
Content-Type: application/x-rar-compressed; x-unix-mode=0600;
-------------------------------

and the body as

-------------------------------
name="04EBD_xxxx.xxxx_A546BB.zip"
Content-Transfer-Encoding: base64

UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw
ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6
...
-------------------------------

and shows no defects.

This is wrong. RFC5322 section 2.1 is clear that everything up to the first empty line is headers. Even the docstring in the email/parser.py module says "The header block is terminated either by the end of the string or by a blank line."

Since the message is defective, it isn't clear what the correct result should be, but I think

Headers:
Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip"
Content-Type: application/x-rar-compressed; x-unix-mode=0600;
Content-Transfer-Encoding: base64

Body:
UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw
ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6
...

Defects:
name="04EBD_xxxx.xxxx_A546BB.zip"

would be more appropriate. The problem is that the Content-Transfer-Encoding: base64 header is not in the headers so that get_payload(decode=True) doesn't decode the base64 encoded body making malware recognition difficult.
msg262751 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2016-04-01 17:50
Added Python 2.7 to versions:
msg262776 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-01 22:10
Also see Issue 24363, basically the same bug in the HTTP parser, which (ab?)uses the email package to do most of the work. In that case, according to my note the faulty header field ends with:

X-Frame-Options: SAMEORIGIN\r\n
Set-Cookie: mb-CookieP=; HttpOnly; \r\n
Secure\r\n
Set-Cookie: mb-CookieP=; HttpOnly; Secure\r\n
\r\n

But in this case, perhaps because of the implications of dropping the “Secure” flag, people are asking that the faulty line be appended to the previous header field. IMO I don’t think that is super important though. An alternative would be to add it to the defect list, and then raise an exception or warning if any defects are detected.
msg268438 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-13 12:58
FWIW in the HTTP bug <https://bugs.python.org/issue24363#msg244676>, David said “when seeing a line that doesn't look like a header the error recovery is to treat that line as the beginning of the body (ie: assume the blank line is missing).” I have no experience with email and RFC 5322 header handling, but it does make more sense to me to handle this as a defect in the header section, _not_ a genuine transition to the body (same as desired for the HTTP case).

Here is a patch that revives MalformedHeaderDefect (see Issue 14925), and continues parsing the rest of the header section instead of starting the body. But I am not sure how safe this change is. I did have to fix one unrelated set of tests (see headertest_msg in the Test8BitBytesHandling class) that did not include a blank line and was relying on the old behaviour.
msg272295 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-10 02:57
.
It would be nice to get feedback if my patch is sensible, especially from people familiar with normal usage of the “email” module, as opposed with the usage by the HTTP module.
msg272332 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-08-10 13:42
I would prefer if we did lookahead to see if the subsequent line looks like a header.  It's more complicated to do that, of course, and could still lead to false negatives.  However, I think that would probably retain enough backward compatibility to be acceptable.  It would also be sensible to make this a policy switch, and as I said elsewhere I'm fine with changing the defaults of the http policy even in 3.5.  (The downside of *that* is that I'm sure there are bugs hiding in the new header parsing code, so actually using the http policy to parse http headers will doubtless "allow" us to find some of them.)

Even more complicated, but a better heuristic: look ahead to the next blank line, up to some limit (5 lines?), and if you do find something that looks like a header, also make sure that none of the intermediate lines look like a MIME boundary.   That still leaves the question of what to do with a source text that has non-header lines up to the next blank line (this applies to one line lookahead as well).  Maybe see if there is more text after the blank line and if so assume the non-header is part of the header, otherwise not?

Regardless, lookahead may be difficult to code.  So an alternative that uses your approach, but triggered by a policy setting on http, would be acceptable backward compatibility wise.  If we want to we could even make an internal http policy that is compat32 plus this new flag.
msg272381 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-10 22:57
Thanks David. Since I am more intersted in fixing this robustly for HTTP and similar protocols, I might focus on just Issue 24363. Either confine my changes to the existing HTTP (or new) policy and start using that, or just address this from the HTTP package.
History
Date User Action Args
2016-08-12 12:37:54martin.panterunlinkissue24363 dependencies
2016-08-10 22:57:36martin.pantersetmessages: + msg272381
2016-08-10 13:42:23r.david.murraysetmessages: + msg272332
2016-08-10 02:57:46martin.pantersetmessages: + msg272295
stage: patch review
2016-06-29 14:44:23ppperrysettitle: email.parser stops parsing headers too soon. -> email.parser stops parsing headers too soon when given a defective message.
2016-06-14 09:01:07martin.panterlinkissue24363 dependencies
2016-06-13 12:58:57martin.pantersetfiles: + continue-header.patch
keywords: + patch
messages: + msg268438

versions: + Python 3.6
2016-04-01 22:10:27martin.pantersetnosy: + martin.panter
messages: + msg262776
2016-04-01 18:12:01SilentGhostsetnosy: + barry, r.david.murray

components: + email
versions: + Python 3.5, - Python 3.4
2016-04-01 17:50:49msapirosetmessages: + msg262751
versions: + Python 2.7
2016-04-01 17:46:30msapirocreate