This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: str serialization of Message object may mutate the payload and CTE.
Type: behavior Stage: resolved
Components: email, Library (Lib) Versions: Python 3.3, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, python-dev, r.david.murray, vajrasky
Priority: normal Keywords: easy, patch

Created on 2013-11-25 16:14 by r.david.murray, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
fix_serialization_message_object_mutation_v1.patch vajrasky, 2013-11-27 08:37 review
fix_serialization_message_object_mutation_v2.patch vajrasky, 2013-11-27 08:38 review
fix_serialization_message_object_mutation_v3.patch vajrasky, 2013-11-27 08:47 review
fix_serialization_message_object_mutation_v4.patch vajrasky, 2014-02-08 03:40 review
Messages (12)
msg204356 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-25 16:14
Currently the string generator will downcast 8bit body parts to 7bit by encoding them using a 7bit CTE.  This is good.  However, the way it does it is to modify the Message object accordingly.  This is not good.  Except for filling in required missing bits of data (such as MIME borders), serializing a Message should not mutate it.  (And, for the curious, I am the one who made this mistake when I wrote the code :)
msg204564 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-11-27 08:37
I come out with two patches.

The first patch using bytes generator to copy the message before mutating it in string generator, then restore it after printing the mutated message.
msg204565 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-11-27 08:38
The second patch copies the private variables of the message before mutating it in string generator, then restore the private variables after printing the mutated message.
msg204567 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-11-27 08:47
I got another inspiration.

The third patch uses copy.copy to copy the Message before mutating it in string generator, then restore it after printing the mutated message.
msg210534 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-07 19:27
While the copy solution seems reasonable, unfortunately it looks like the solution isn't that simple.

Your test method didn't start with 'test_', so it didn't get run.  Your fix doesn't fix the problem, since other parts of the code are holding on to a pointer to the original message object.  But even if I change your fix to only make a local copy and use it to get the transformed payload, that fix breaks other tests, since the *headers* aren't changed in the output, even if the payload is.  This is because of the odd way the generator accumulates its output, and I unfortunately don't have time to remember how it all works in order to craft a fix before the 3.4 alpha :(.
msg210586 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-02-08 03:40
> Your test method didn't start with 'test_', so it didn't get run.

Ah sorry about that. This is the updated patch. It fixed the problem and passed all test.
msg210593 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-02-08 08:14
Actually, I am thinking of this approach (cloning the message just after entering the flatten method):

diff -r b541ecd32115 Lib/email/generator.py
--- a/Lib/email/generator.py	Fri Feb 07 16:11:17 2014 -0800
+++ b/Lib/email/generator.py	Sat Feb 08 15:55:01 2014 +0800
@@ -67,7 +67,7 @@
         # Just delegate to the file object
         self._fp.write(s)
 
-    def flatten(self, msg, unixfrom=False, linesep=None):
+    def flatten(self, msg_obj, unixfrom=False, linesep=None):
         r"""Print the message object tree rooted at msg to the output file
         specified when the Generator instance was created.
 
@@ -86,6 +86,8 @@
         # from the msg, and _encoded_XXX constants for operating on data that
         # has already been converted (to bytes in the BytesGenerator) and
         # inserted into a temporary buffer.
+        import copy
+        msg = copy.copy(msg_obj)
         policy = msg.policy if self.policy is None else self.policy
         if linesep is not None:
             policy = policy.clone(linesep=linesep)

It works for this bug but apparently it fails one test.

======================================================================
FAIL: test_make_boundary (__main__.TestMessageAPI)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_email/test_email.py", line 199, in test_make_boundary
    'multipart/form-data; boundary="==')
AssertionError: 'multipart/form-data' != 'multipart/form-data; boundary="=='
- multipart/form-data
+ multipart/form-data; boundary="==

So somehow, according to the ultimate design of email library, msg can not leave flatten method unscratched. :)
msg210647 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-08 14:21
I'm guessing the problem is that copy is a shallow copy, and message objects can be deeply nested.  Doing a deepcopy might work, but as things are currently, that could cost a lot of memory, so I don't want to do that.
msg210661 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-08 15:36
Ah, it is because the boundary computation intentionally modifies the message object.  Which is a questionable design decision overall, but is now set in stone :(.
msg210665 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-02-08 16:51
New changeset 34fb36972f8d by R David Murray in branch '3.3':
#19772: Do not mutate message when downcoding to 7bit.
http://hg.python.org/cpython/rev/34fb36972f8d

New changeset 2e97d3500970 by R David Murray in branch 'default':
Merge #19772: Do not mutate message when downcoding to 7bit.
http://hg.python.org/cpython/rev/2e97d3500970
msg210666 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-08 17:09
Vajrasky: thanks for your work on this, it helped me find a reasonable (if ugly) solution.  However, the 'if' statement in your patch that checks specifically for the combination of old and new cte of 8bit and base64 puzzles me.  The same problem occurs here if the CTE we end up with is quoted-printable, which will be the case, for example, for latin-1.  The logical way to think abut the problem here (we don't want to modify the message) would be "if the CTE changed, we need to restore it".  So why check specifically for 8bit and base64?  The straightforward translation of "if the CTE changed, we need to restore it" would have been: 'if self.orig_cte != cte'.

In my solution I preferred to leave the original object unchanged, and re-copy it to re-change the headers when needed.  This is ugly, but it is more future proof against the down-coding code someday changing other headers or other characteristics of the message.  And I needed to use deepcopy, because copy would just copy a pointer to the _headers list, and therefore the original headers list would still get modified.  The deepcopy isn't that costly here, though, because we know that it is only called on a message that is not a multipart, since the copies are triggered only from _handle_text.  It's still not pretty, but I think it is the best we can do without completely restructuring the way the generator does its work.
msg210797 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-02-10 03:11
> So why check specifically for 8bit and base64?

I was in hurry when creating this patch. I think that was part of the debug code. This patch was created with the purpose to illuminate the culprit of this bug. And it served its purpose.

After uploading this patch, I was about to refine the patch. Then I started to think about.... shouldn't there be a more elegant way to solve this problem? Cloning the msg after entering the flatten message? It is supposed to be a clean and simple solution. Then I hit the part where the boundary computation intentionally modifies the message object. Then I remembered the email API had too much magics on it. >.<
History
Date User Action Args
2022-04-11 14:57:54adminsetgithub: 63971
2014-02-10 03:11:31vajraskysetmessages: + msg210797
2014-02-08 17:10:30r.david.murraysetstatus: open -> closed
resolution: fixed
stage: needs patch -> resolved
2014-02-08 17:09:39r.david.murraysetmessages: + msg210666
2014-02-08 16:51:47python-devsetnosy: + python-dev
messages: + msg210665
2014-02-08 15:36:46r.david.murraysetmessages: + msg210661
2014-02-08 14:21:53r.david.murraysetmessages: + msg210647
2014-02-08 08:14:43vajraskysetmessages: + msg210593
2014-02-08 03:40:30vajraskysetfiles: + fix_serialization_message_object_mutation_v4.patch

messages: + msg210586
2014-02-07 19:27:50r.david.murraysetmessages: + msg210534
2013-11-27 08:47:33vajraskysetfiles: + fix_serialization_message_object_mutation_v3.patch

messages: + msg204567
2013-11-27 08:38:54vajraskysetfiles: + fix_serialization_message_object_mutation_v2.patch

messages: + msg204565
2013-11-27 08:37:52vajraskysetfiles: + fix_serialization_message_object_mutation_v1.patch
keywords: + patch
messages: + msg204564
2013-11-25 16:14:11r.david.murraycreate