This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author r.david.murray
Recipients Arfrever, apollo13, barry, r.david.murray, vajrasky
Date 2013-11-20.19:20:51
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1384975252.98.0.993176952574.issue19063@psf.upfronthosting.co.za>
In-reply-to
Content
Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues.

It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly).  After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model.  We previously fixed the issue where if set_payload was passed binary data bad things would happen.  That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data.

But what the model *really* needs is that _payload *always* be an ascii+surrogateescape string, and never a full unicode string.  (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.)

Currently it can be a unicode string.  If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE.  This is pretty much just by luck, though.

If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation.  That doesn't affect the payload, but it does result in wrong CTE being used.

The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset.

Unfortunately it is also possible to call set_payload without a charset, and *then* call set_charset.  To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset.

My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set.  I believe this is an acceptable break, since otherwise you *must* leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake.

Patch attached.
History
Date User Action Args
2013-11-20 19:20:53r.david.murraysetrecipients: + r.david.murray, barry, Arfrever, apollo13, vajrasky
2013-11-20 19:20:52r.david.murraysetmessageid: <1384975252.98.0.993176952574.issue19063@psf.upfronthosting.co.za>
2013-11-20 19:20:52r.david.murraylinkissue19063 messages
2013-11-20 19:20:51r.david.murraycreate