classification
Title: email.parser.BytesParser - parse and parsebytes work not equivalent
Type: Stage:
Components: email Versions: Python 3.8, Python 3.7, Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, maxking, mkaiser, r.david.murray
Priority: normal Keywords:

Created on 2019-12-17 08:32 by mkaiser, last changed 2019-12-22 00:38 by maxking.

Files
File name Uploaded Description Edit
test.eml mkaiser, 2019-12-17 08:32 Test Mail with 2 mime parts (html and plain text)
test.py mkaiser, 2019-12-17 08:32 Testscript for creating the hash values
Messages (4)
msg358533 - (view) Author: Manfred Kaiser (mkaiser) * Date: 2019-12-17 08:32
I used email.parser.BytesParser for parsing mails. 

In one programm I used parse, because the email was stored in a file.
In a second programm the email was stored in memory as a bytes object.

I created hash values from each parts an compared them, to check if a part is already known to my programs. This works for attachments, but not for html and plain text parts.

Documentation for parsebytes:

Similar to the parse() method, except it takes a bytes-like object instead of a file-like object. Calling this method on a bytes-like object is equivalent to wrapping bytes in a BytesIO instance first and calling parse().

When I read the documentation, I expected that both methods will produce the same output.

The testmail contains 2 mimeparts. One with html and one with plain text.

The parse method with a file and the parse method with bytes-data, wrapped in a BytesIO produces the same hashes. The paesebytes method creates different hashes.

Output of my testprogram:

MD5 sums with parsebytes with bytes data
3f4ee7303378b62f723a8d958797507a
45c72465b931d32c7e700d2dd96f8383
------------------------
MD5 sums with parse and BytesIO with bytes data
fb0599d92750b72c25923139670e5127
9a54b64425b9003a9e6bf199ab6ba603
------------------------
MD5 sums with parse from file
fb0599d92750b72c25923139670e5127
9a54b64425b9003a9e6bf199ab6ba603



Is this an expected behavior or is this an error?
msg358566 - (view) Author: Manfred Kaiser (mkaiser) * Date: 2019-12-17 19:04
I think, the best way is to fix the documentation. The reason is, when a developer rely to the behavior of the function but the behavior is changed, a program may work incorrect.

Just think about forensic stuff. If a hash value will be created with the "parsebytes" method and the behavior will be changed to match the behavior of the "parse" method, the the evidence can not be validated with the latest python versions.

We could add a comment to the documentation. For example "parsebytes parses the mail in a different way than parse, which may produce slightly different messages. If you rely on the same behavior for file and byte like objects you can use the parse method with BytesIO"
msg358570 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2019-12-17 20:10
The problem is that you are starting with different inputs.  unicode strings and bytes are different things, and so parsing them can produce different results.  The fact of that matter is that email messages are defined to be bytes, so parsing a unicode string pretending it is an email message is just asking for errors anyway.  The string parsing methods are really only provided for backward compatibility and historical reasons.

I thought this was clear from the existing documentation, but clearly it isn't :)  I'll review a suggested doc change, but the thing to explain is not that parse and parsebytes might produce different results, but that parsing email from strings is not a good idea and will likely produce unexpected results for anything except the simplest non-mime messages.

Note: the reason you got different checksums might have had to do with line ends, depending on how you calculated the checksums.  You should also consider using get_content and not get_payload.  get_payload has a weird legacy API that doesn't always do what you think it will, and that might be another source of checksum issues.  But really, parsing a unicode representation of a mime message is just likely to be buggy.
msg358571 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2019-12-17 20:13
All of which isn't to discount that you might have a found a bug, by the way, if you want to investigate further :)
History
Date User Action Args
2019-12-22 00:38:15maxkingsetnosy: + maxking
2019-12-17 20:13:36r.david.murraysetmessages: + msg358571
2019-12-17 20:10:29r.david.murraysetmessages: + msg358570
2019-12-17 19:04:33mkaisersetmessages: + msg358566
2019-12-17 08:32:51mkaisersetfiles: + test.py
2019-12-17 08:32:29mkaisercreate