Issue 45066: email parser fails to decode quoted-printable rfc822 message attachemnt

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/89229

classification

Title:	email parser fails to decode quoted-printable rfc822 message attachemnt
Type:	crash	Stage:
Components:	email	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	DiddiLeija, anarcat, barry, r.david.murray
Priority:	normal	Keywords:

Created on 2021-08-31 18:04 by anarcat, last changed 2022-04-11 14:59 by admin.

Messages (2)
msg400764 - (view)	Author: anarcat (anarcat)	Date: 2021-08-31 18:04
If an email message has a message/rfc822 part and that part is quoted-printable encoded, Python freaks out. Consider this code: import email.parser import email.policy # python 3.9.2 cannot decode this message, it fails with # "email.errors.StartBoundaryNotFoundDefect" mail = """Mime-Version: 1.0 Content-Type: multipart/report; boundary=aaaaaa Content-Transfer-Encoding: 7bit --aaaaaa Content-Type: message/rfc822 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=3D"=3Dbbbbbb" --=3Dbbbbbb Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=3Dutf-8 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx= x --=3Dbbbbbb-- --aaaaaa-- """ msg_abuse = email.parser.Parser(policy=email.policy.default + email.policy.strict).parsestr(mail) That crashes with: email.errors.StartBoundaryNotFoundDefect This should normally work: the sub-message is valid, assuming you decode the content. But if you do not, you end up in this bizarre situation, because the multipart boundary is probably considered to be something like `3D"=3Dbbbbbb"`, and of course the above code crashes with the above exception. If you remove the quoted-printable part from the equation, the parser actually behaves: import email.parser import email.policy # python 3.9.2 cannot decode this message, it fails with # "email.errors.StartBoundaryNotFoundDefect" mail = """Mime-Version: 1.0 Content-Type: multipart/report; boundary=aaaaaa Content-Transfer-Encoding: 7bit --aaaaaa Content-Type: message/rfc822 Content-Disposition: inline MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=bbbbbb" --=bbbbbb Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=utf-8 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx --=bbbbbb-- --aaaaaa-- """ msg_abuse = email.parser.Parser(policy=email.policy.default + email.policy.strict).parsestr(mail) The above correctly parses the message. This problem causes all sorts of weird issues. In one real-world example, it would just stop parsing headers inside the email because long lines in headers (typical in Received-by headers) would get broken up... So it would not actually fail completely. Or, to be more accurate, by default (ie. if you do not use strict), it does not crash and instead produces invalid data (e.g. a message without a Message-ID or From). On most messages that are encoded this way, the strict mode will actually fail with: email.errors.MissingHeaderBodySeparatorDefect because it will stumble upon a header line that should be a continuation but instead is treated like a full header line, so it's missing a colon (":").
msg400767 - (view)	Author: anarcat (anarcat)	Date: 2021-08-31 18:23
looking at email.feedparser.FeedParser._parse_gen(), it looks like this is going to be really hard to fix, because the parser just happily recurses into the sub-part without ever checking the CTE (content-transfer-encoding). that's typically only done on "get_payload()", which is obviously not called there because we're streaming the email in. in general, it looks like support for quoted-printable, as a CTE (which is https://datatracker.ietf.org/doc/html/rfc2045#section-6.7), seems to be spotty at best. multipart/ parts will raise the (undocumented) exception InvalidMultipartContentTransferEncodingDefect if they encounter it, for example: https://github.com/python/cpython/blob/3.9/Lib/email/feedparser.py#L322 so I'm not sure how to handle this. it's not clear to me either how to workaround this problem at all... is there a way to keep the parser from recursing like this?

History
Date	User	Action	Args
2022-04-11 14:59:49	admin	set	github: 89229
2021-08-31 18:36:25	DiddiLeija	set	nosy: + DiddiLeija
2021-08-31 18:23:35	anarcat	set	messages: + msg400767
2021-08-31 18:04:32	anarcat	create