Message 203521 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	r.david.murray
Recipients	Arfrever, apollo13, barry, r.david.murray, vajrasky
Date	2013-11-20.19:20:51
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1384975252.98.0.993176952574.issue19063@psf.upfronthosting.co.za>
In-reply-to

Content
Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues. It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly). After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model. We previously fixed the issue where if set_payload was passed binary data bad things would happen. That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data. But what the model really needs is that _payload always be an ascii+surrogateescape string, and never a full unicode string. (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.) Currently it can be a unicode string. If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE. This is pretty much just by luck, though. If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation. That doesn't affect the payload, but it does result in wrong CTE being used. The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset. Unfortunately it is also possible to call set_payload without a charset, and then call set_charset. To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset. My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set. I believe this is an acceptable break, since otherwise you must leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake. Patch attached.

Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues.

It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly). After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model. We previously fixed the issue where if set_payload was passed binary data bad things would happen. That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data.

But what the model *really* needs is that _payload *always* be an ascii+surrogateescape string, and never a full unicode string. (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.)

Currently it can be a unicode string. If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE. This is pretty much just by luck, though.

If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation. That doesn't affect the payload, but it does result in wrong CTE being used.

The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset.

Unfortunately it is also possible to call set_payload without a charset, and *then* call set_charset. To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset.

My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set. I believe this is an acceptable break, since otherwise you *must* leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake.

Patch attached.

History
Date	User	Action	Args
2013-11-20 19:20:53	r.david.murray	set	recipients: + r.david.murray, barry, Arfrever, apollo13, vajrasky
2013-11-20 19:20:52	r.david.murray	set	messageid: <1384975252.98.0.993176952574.issue19063@psf.upfronthosting.co.za>
2013-11-20 19:20:52	r.david.murray	link	issue19063 messages
2013-11-20 19:20:51	r.david.murray	create