Issue 45626: Email part with content type message is multipart.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/89789

classification

Title:	Email part with content type message is multipart.
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	jdhowroyd
Priority:	normal	Keywords:

Created on 2021-10-27 12:57 by jdhowroyd, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
16155.eml	jdhowroyd, 2021-10-27 12:57	Example email

Messages (1)
msg405091 - (view)	Author: John Howroyd (jdhowroyd)	Date: 2021-10-27 12:57
From the library documentation, it is an intended feature that an email part with content_maintype == "message" is treated as multipart. This does not seem to be compliant to MIME specification nor expected semantics. The attached email (from the dnc wikileaks collection) is a good example where this logic breaks down. Code: import pathlib pth = "16155.eml" # path to example import email import email.parser parser = email.parser.BytesParser() fl = pth.open("rb") eml = parser.parse(fl) pts = [p for p in eml.walk()] len(pts) # returns 52 Here pts[0] is the root part of content_type 'multipart/report'. Then pts[1] has content_type 'multipart/alternative' containing the 'text/plain' pts[2] and the 'text/html' pts[3] (which most email clients would consider the message (of this email). All good so far. The problem is that pts[4] of content_type 'message/delivery-status' which has pts[4].is_multipart() [True] and contains 46 sub parts as returned by pts[4].get_payload(): these are pts[5], ... , pts[50]. Finally, pts[51] has content_type 'text/rfc822-headers' which is fine. Each of the subparts of pts[4] (pts[5:51]) have "" returned by pts[n].get_payload() as their content is treated as headers. Where as pts[4].as_bytes includes the header (and separating blank line) for that part; namely, b'Content-Type: message/delivery-status\n\n'. Looking at the raw file and in particular the MIME boundary makers it would seem to me that pts[4] should not be considered multipart and that there is no indication from a content-type of 'message/delivery-status' should or even could be considered an (rfc822) email. Moreover, as the main developer of a system to forensically analyse large (million +) corpora of emails this behaviour of treating parts even of the specific content-type 'message/rfc822' is undisarable; typically, these occur as part of bounce messages and have their content-disposition set to 'attachment'. As a developer what would seem more natural in the case that this behaviour is wanted would be to test parts for the content-type 'message/rfc822' and pass the .get_payload(decode=True) to the bytes parser parser.parse() method. I appreciate the need to support backwards compatibility, so presumably this would require an addition to email.policy to govern which parts should be treated as multipart. I would be more than happy to submit a patch for this but fear it would be rejected out of hand (as the original intent is clearly to parse out contained emails).

msg405091 - (view)

Author: John Howroyd (jdhowroyd)

Date: 2021-10-27 12:57

From the library documentation, it is an intended feature that an email part with content_maintype == "message" is treated as multipart.  This does not seem to be compliant to MIME specification nor expected semantics.  The attached email (from the dnc wikileaks collection) is a good example where this logic breaks down.

Code:
import pathlib
pth = "16155.eml" # path to example
import email
import email.parser
parser = email.parser.BytesParser()
fl = pth.open("rb")
eml = parser.parse(fl)
pts = [p for p in eml.walk()]
len(pts) # returns 52

Here pts[0] is the root part of content_type 'multipart/report'.
Then pts[1] has content_type 'multipart/alternative' containing the 'text/plain' pts[2] and the 'text/html' pts[3] (which most email clients would consider the message (of this email).  All good so far.

The problem is that pts[4] of content_type 'message/delivery-status' which has pts[4].is_multipart() [True] and contains 46 sub parts as returned by pts[4].get_payload(): these are pts[5], ... , pts[50]. Finally, pts[51] has content_type 'text/rfc822-headers' which is fine.

Each of the subparts of pts[4] (pts[5:51]) have "" returned by pts[n].get_payload() as their content is treated as headers. Where as pts[4].as_bytes includes the header (and separating blank line) for that part; namely, b'Content-Type: message/delivery-status\n\n'.

Looking at the raw file and in particular the MIME boundary makers it would seem to me that pts[4] should not be considered multipart and that there is no indication from a content-type of 'message/delivery-status' should or even could be considered an (rfc822) email.

Moreover, as the main developer of a system to forensically analyse large (million +) corpora of emails this behaviour of treating parts even of the specific content-type 'message/rfc822' is undisarable; typically, these occur as part of bounce messages and have their content-disposition set to 'attachment'.  As a developer what would seem more natural in the case that this behaviour is wanted would be to test parts for the content-type 'message/rfc822' and pass the .get_payload(decode=True) to the bytes parser parser.parse() method.

I appreciate the need to support backwards compatibility, so presumably this would require an addition to email.policy to govern which parts should be treated as multipart.  I would be more than happy to submit a patch for this but fear it would be rejected out of hand (as the original intent is clearly to parse out contained emails).

History
Date	User	Action	Args
2022-04-11 14:59:51	admin	set	github: 89789
2021-10-27 12:57:54	jdhowroyd	create