Title: email parser ignores inner multipart boundary when outer message duplicates it
Type: behavior Stage: needs patch
Components: email Versions: Python 3.6, Python 3.5, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, forest, r.david.murray
Priority: normal Keywords:

Created on 2015-11-25 00:04 by forest, last changed 2015-11-25 14:22 by r.david.murray.

Messages (6)
msg255309 - (view) Author: Forest (forest) Date: 2015-11-25 00:04
When a multipart message erroneously defines a boundary string that conflicts with an inner message's boundary string, the parser ignores the (correct) inner message's boundary, and treats all matching boundary lines as if they belong to the (defective) outer message.

This file from the test_email suite demonstrates the problem:

Consequentially, the inner multipart/alternative message is parsed with is_multipart() returning False, and a truncated payload.

Moreover, unit tests like test_same_boundary_inner_outer() expect to find the StartBoundaryNotFoundDefect defect on the inner message in that file, which seems wrong to me, since the inner message is not defective.  According to the RFCs, the outer message should have been generated with a boundary string that does not appear anywhere in its encoded body (including the inner message).  The outer message is therefore the defective one.
msg255310 - (view) Author: Forest (forest) Date: 2015-11-25 00:18
I thought at first that this might be deliberate behavior in order to comply with RFC 2046 section 5.1.2.

After carefully re-reading that section, I see that it is just making sure an outer message's boundary will still be recognized if an inner multipart message is missing its boundary markers (for example if the inner message was truncated).  It does not describe any circumstances under which the inner message's boundary markers should be ignored when they are present.
msg255312 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-25 00:39
Who is to say that the outer message is defective and not the inner one?  How can a parser decide which part belongs to which message?  It isn't an AI.

The whole message is defective, so all bets are off :)  The library can't successfully parse such a message, though it goes to significant pains to make sure it never generates one.
msg255313 - (view) Author: Forest (forest) Date: 2015-11-25 01:05
RFC 2046 says that the outer message is defective, since it uses a boundary delimiter that is quite obviously present inside one of the encapsulated parts:

"The boundary delimiter MUST NOT appear inside any of the encapsulated parts, on a line by itself or as the prefix of any line."
msg255314 - (view) Author: Forest (forest) Date: 2015-11-25 01:08
> The library can't successfully parse such a message

It could successfully parse such a message, if it matched against inner message boundaries before outer message boundaries.  (One implementation would be to keep a list of all ancestor boundaries and traverse the list from most recent to least recent, but there might be more efficient ways to do it.)
msg255357 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-11-25 14:22
I am open to (and will review) a patch that applies simple heuristics to trying to guess correctly about such messages, but only if it doesn't add too much complexity to the parser.  I'm not certain I would consider it for a bug fix release, but I'll postpone that decision until I review the patch (the issue is: would it have the potential to break applications that are currently working?  I'm guessing not, but I tend to be cautious about such issues.)
Date User Action Args
2015-11-25 14:22:24r.david.murraysetstage: needs patch
messages: + msg255357
versions: + Python 3.5, Python 3.6, - Python 3.4
2015-11-25 01:08:51forestsetmessages: + msg255314
2015-11-25 01:05:43forestsetmessages: + msg255313
2015-11-25 00:39:29r.david.murraysetmessages: + msg255312
2015-11-25 00:18:47forestsetmessages: + msg255310
2015-11-25 00:04:36forestcreate