Message203521
Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues.
It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly). After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model. We previously fixed the issue where if set_payload was passed binary data bad things would happen. That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data.
But what the model *really* needs is that _payload *always* be an ascii+surrogateescape string, and never a full unicode string. (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.)
Currently it can be a unicode string. If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE. This is pretty much just by luck, though.
If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation. That doesn't affect the payload, but it does result in wrong CTE being used.
The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset.
Unfortunately it is also possible to call set_payload without a charset, and *then* call set_charset. To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset.
My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set. I believe this is an acceptable break, since otherwise you *must* leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake.
Patch attached. |
|
Date |
User |
Action |
Args |
2013-11-20 19:20:53 | r.david.murray | set | recipients:
+ r.david.murray, barry, Arfrever, apollo13, vajrasky |
2013-11-20 19:20:52 | r.david.murray | set | messageid: <1384975252.98.0.993176952574.issue19063@psf.upfronthosting.co.za> |
2013-11-20 19:20:52 | r.david.murray | link | issue19063 messages |
2013-11-20 19:20:51 | r.david.murray | create | |
|