Issue 39071: [doc] email.parser.BytesParser - parse and parsebytes work not equivalent

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/83252

classification

Title:	[doc] email.parser.BytesParser - parse and parsebytes work not equivalent
Type:		Stage:
Components:	Documentation, email	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	barry, docs@python, iritkatriel, maxking, mkaiser, r.david.murray
Priority:	normal	Keywords:

Created on 2019-12-17 08:32 by mkaiser, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
test.eml	mkaiser, 2019-12-17 08:32	Test Mail with 2 mime parts (html and plain text)
test.py	mkaiser, 2019-12-17 08:32	Testscript for creating the hash values

Messages (5)
msg358533 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-17 08:32
I used email.parser.BytesParser for parsing mails. In one programm I used parse, because the email was stored in a file. In a second programm the email was stored in memory as a bytes object. I created hash values from each parts an compared them, to check if a part is already known to my programs. This works for attachments, but not for html and plain text parts. Documentation for parsebytes: Similar to the parse() method, except it takes a bytes-like object instead of a file-like object. Calling this method on a bytes-like object is equivalent to wrapping bytes in a BytesIO instance first and calling parse(). When I read the documentation, I expected that both methods will produce the same output. The testmail contains 2 mimeparts. One with html and one with plain text. The parse method with a file and the parse method with bytes-data, wrapped in a BytesIO produces the same hashes. The paesebytes method creates different hashes. Output of my testprogram: MD5 sums with parsebytes with bytes data 3f4ee7303378b62f723a8d958797507a 45c72465b931d32c7e700d2dd96f8383 ------------------------ MD5 sums with parse and BytesIO with bytes data fb0599d92750b72c25923139670e5127 9a54b64425b9003a9e6bf199ab6ba603 ------------------------ MD5 sums with parse from file fb0599d92750b72c25923139670e5127 9a54b64425b9003a9e6bf199ab6ba603 Is this an expected behavior or is this an error?
msg358566 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-17 19:04
I think, the best way is to fix the documentation. The reason is, when a developer rely to the behavior of the function but the behavior is changed, a program may work incorrect. Just think about forensic stuff. If a hash value will be created with the "parsebytes" method and the behavior will be changed to match the behavior of the "parse" method, the the evidence can not be validated with the latest python versions. We could add a comment to the documentation. For example "parsebytes parses the mail in a different way than parse, which may produce slightly different messages. If you rely on the same behavior for file and byte like objects you can use the parse method with BytesIO"
msg358570 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-17 20:10
The problem is that you are starting with different inputs. unicode strings and bytes are different things, and so parsing them can produce different results. The fact of that matter is that email messages are defined to be bytes, so parsing a unicode string pretending it is an email message is just asking for errors anyway. The string parsing methods are really only provided for backward compatibility and historical reasons. I thought this was clear from the existing documentation, but clearly it isn't :) I'll review a suggested doc change, but the thing to explain is not that parse and parsebytes might produce different results, but that parsing email from strings is not a good idea and will likely produce unexpected results for anything except the simplest non-mime messages. Note: the reason you got different checksums might have had to do with line ends, depending on how you calculated the checksums. You should also consider using get_content and not get_payload. get_payload has a weird legacy API that doesn't always do what you think it will, and that might be another source of checksum issues. But really, parsing a unicode representation of a mime message is just likely to be buggy.
msg358571 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-17 20:13
All of which isn't to discount that you might have a found a bug, by the way, if you want to investigate further :)
msg408408 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-12-12 19:52
The relevant section in the docs is https://docs.python.org/3/library/email.parser.html#email.parser.Parser It currently doesn't advise against using the text parser in any way. At the top of the page, the second paragraph says: "You can pass the parser a bytes, string or file object, and the parser will return to you the root EmailMessage instance of the object structure."

History
Date	User	Action	Args
2022-04-11 14:59:24	admin	set	github: 83252
2021-12-12 19:52:30	iritkatriel	set	assignee: docs@python components: + Documentation title: email.parser.BytesParser - parse and parsebytes work not equivalent -> [doc] email.parser.BytesParser - parse and parsebytes work not equivalent nosy: + iritkatriel, docs@python versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.5, Python 3.6, Python 3.7, Python 3.8 messages: + msg408408
2019-12-22 00:38:15	maxking	set	nosy: + maxking
2019-12-17 20:13:36	r.david.murray	set	messages: + msg358571
2019-12-17 20:10:29	r.david.murray	set	messages: + msg358570
2019-12-17 19:04:33	mkaiser	set	messages: + msg358566
2019-12-17 08:32:51	mkaiser	set	files: + test.py
2019-12-17 08:32:29	mkaiser	create