This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email library could "recover" from bad mime boundary like (some?) email clients do
Type: enhancement Stage: needs patch
Components: email Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Fedele Mantuano, adepasquale, barry, maciej.szulik, r.david.murray
Priority: normal Keywords: patch

Created on 2016-05-12 15:41 by Fedele Mantuano, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
mail Fedele Mantuano, 2016-05-12 15:41
mail.json Fedele Mantuano, 2016-05-12 16:44
issue27010-notuniqueboundary.patch adepasquale, 2016-05-26 14:21 NotUniqueBoundaryInMultipartDefect review
Messages (17)
msg265413 - (view) Author: Fedele Mantuano (Fedele Mantuano) Date: 2016-05-12 15:41
We are receiving a lot of mail with attachments not detected from email library.
I also tested Tika parser and it have the same issue:

mail: http://pastebin.com/kSEJnzSa
mail parsed: http://pastebin.com/7HaVPcTq

I can read only these content types:
multipart/mixed
multipart/alternative
text/plain
text/html

there isn't Content-Type: application/zip.

With a normal mail client I can read the attachment.

Where is the issue?
msg265416 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-12 16:26
When you say the attachment is not detected, what do you mean?  What call are you making to the email library that you are expecting to see the attachment in that it is not in?  Your 'parsed' pastebin isn't something the library produces, so I assume that's the Tika output.

(By the way, pastbin links are problematic in tracker issues, since they may expire.  Please paste directly in to the issue, or attach files to the issue.)

Oh, wait.  Looking at the email I think I see the problem:

----------------------------------------
</BODY>
</HTML>

--51a14337d8625bb8ce4a5b1667f--

--51a14337d8625bb8ce4a5b1667f
<attachment content>
----------------------------------------

That line that ends with '--' signals the end of the last MIME part in the message.  So by RFC standards the remainder of the message is part of the 'epilogue'.  If you check msg.epilogue I think you'll find that it contains the raw text of the remainder of the message.

It is interesting that your email client treats it as an actual attachment.  It would be possible to have the email library recognize such out of place mime dividers and register it as an error.  I would review a patch for that if someone wants to propose one.

--David
msg265417 - (view) Author: Fedele Mantuano (Fedele Mantuano) Date: 2016-05-12 16:44
Hi David,

I use email library to detect malicious attachments, so:

message = email.message_from_file(open('mail'))
for i in message.walk():
   do somethings

Not detected means that in for loop I can't see these attachments.

The same problem there is with tika parser (now I attached file).

I think that all automatics tools that using email library can't extract and post analyze these mails.
msg265419 - (view) Author: Fedele Mantuano (Fedele Mantuano) Date: 2016-05-12 16:55
I test your hypothesis:


for i in message.walk():
    print i.get_content_type()
    print("#################################################################")
    print i.epilogue

    
multipart/mixed
#################################################################

--31a14337d8625bb8ce4a5b1667f
Content-Type: application/zip; name="n.41056 0002 02 107413 del 11.05.2016.zip"
Content-Transfer-Encoding: base64
Content-ID: <008601d1ac89$01f7f760$0d00a8c0@D25LND1N>

UEsDBBQAAAAIALNQrEi/ST/WbSsBAABAAgAtAAAAbi40MTA0NiAwMDA0IDAyIDEwNzIwMyBk
ZWwgMTEuMDUuMjAxNi5wZGYuZXhl7FNnjExRGL1vDAZjZ1Zd0YbookeUIIgRYocdjBq9r766
GG2ZeJ7RrZroJXrvYtUhIUqIXhLEYMJisJJhnPPePjt6+CeZLzvn3nfv+c733XPvOjrNFVmE


And for me it's right.
msg265420 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-12 17:06
I'm going to change the title of this and see if anyone wants to propose a patch. It'll probably end up getting closed as not a bug if no one does for a while, though.
msg266372 - (view) Author: Andrea De Pasquale (adepasquale) * Date: 2016-05-25 16:42
Isn't this covered by the following test case?

Lib/test/test_email/test_defect_handling.py:18
msg266382 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-25 17:52
Yes.  The current behavior is not a bug, the question is, do we want to deal with that XXX comment in the test by detecting the duplicate and reconizing the "extra" mime part?  The defect detection would remain.
msg266438 - (view) Author: Andrea De Pasquale (adepasquale) * Date: 2016-05-26 14:21
How about the following patch? If it's different from what you had in mind, please let me know.
msg266440 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-26 15:28
Thanks for the patch.  I'll take a look at this during the PyCon sprints.
msg268558 - (view) Author: Andrea De Pasquale (adepasquale) * Date: 2016-06-14 14:04
Hello,
did you have a chance to look at my patch?
msg268909 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-06-20 16:05
Unfortunatley no, things were too busy.  I'm hoping to have time to review email patches in the not too distant future, though.
msg271882 - (view) Author: Andrea De Pasquale (adepasquale) * Date: 2016-08-03 09:18
Ok thanks, please kindly let me know.
msg274878 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-07 20:14
Andrea: yes, your patch is different from what I had in mind.  The idea would be to recognize the "nested part with duplicate boundary", register the new defect, but produce a Message object with a structure that looked like this:

  multipart/mixed
    multipart/alternative
        text/plain
        text/html
    image/gif

What your patch produces is:

  multipart/mixed
    multipart/alternative
    text/plain
    text/html

which is not recognizing the nested multipart or the final MIME part (which is the OPs goal).

In principle it should be possible to parse the nesting despite the bad boundary (other MIME parsers do it, as documented here), but I'm not sure how hard it will be to modify Feedparser to do it.  Looking at the code it seems like it shouldn't be that hard to make it work, but I haven't dug deeply enough to be sure.
msg275012 - (view) Author: Andrea De Pasquale (adepasquale) * Date: 2016-09-08 13:23
Yes you are right, my patch produces an RFC2046-compliant output and also registers the "not-unique-boundary" defect.
msg277114 - (view) Author: Andrea De Pasquale (adepasquale) * Date: 2016-09-21 09:14
To provide additional context, Microsoft has patched his Outlook client to be RFC2046-compliant. More details below:

http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3366
https://technet.microsoft.com/library/security/MS16-107
http://www.certego.net/en/news/badepilogue-the-perfect-evasion/
msg277136 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-21 13:07
Hmm.  Thanks for the links.  That[*] implies that "fixing" this would be *introducing* a security vulnerability...unless one was trying to implement a virus/spam scanner in Python.  So perhaps this should be controlled by a policy switch.

[*] The third of those links is the most useful one to read.
msg277140 - (view) Author: Fedele Mantuano (Fedele Mantuano) Date: 2016-09-21 13:21
I developed a library that can get that malformed email part, but to get it I used the not correct type of defect "StartBoundaryNotFoundDefect" (https://github.com/SpamScope/mail-parser/blob/develop/mailparser/__init__.py#L44).
With this patch, I could get malformed email part with the correct defect.
History
Date User Action Args
2022-04-11 14:58:31adminsetgithub: 71197
2016-09-21 13:21:01Fedele Mantuanosetmessages: + msg277140
2016-09-21 13:07:46r.david.murraysetstage: patch review -> needs patch
messages: + msg277136
versions: + Python 3.7, - Python 3.6
2016-09-21 09:14:57adepasqualesetmessages: + msg277114
2016-09-08 13:23:33adepasqualesetmessages: + msg275012
2016-09-07 20:14:38r.david.murraysetmessages: + msg274878
2016-08-03 09:18:43adepasqualesetmessages: + msg271882
2016-06-20 16:05:13r.david.murraysetmessages: + msg268909
2016-06-14 14:04:15adepasqualesetmessages: + msg268558
2016-05-26 15:28:32r.david.murraysetmessages: + msg266440
stage: needs patch -> patch review
2016-05-26 14:21:52adepasqualesetfiles: + issue27010-notuniqueboundary.patch
keywords: + patch
messages: + msg266438
2016-05-25 17:52:52r.david.murraysettype: enhancement
messages: + msg266382
stage: needs patch
2016-05-25 16:42:06adepasqualesetmessages: + msg266372
2016-05-25 13:19:21adepasqualesetnosy: + adepasquale
2016-05-13 20:52:36maciej.szuliksetnosy: + maciej.szulik
2016-05-12 17:06:43r.david.murraysettitle: Attachments not detected from email library -> email library could "recover" from bad mime boundary like (some?) email clients do
messages: + msg265420
versions: + Python 3.6, - Python 2.7
2016-05-12 16:55:00Fedele Mantuanosetmessages: + msg265419
2016-05-12 16:44:31Fedele Mantuanosetfiles: + mail.json

messages: + msg265417
2016-05-12 16:26:15r.david.murraysetmessages: + msg265416
2016-05-12 15:41:58Fedele Mantuanocreate