classification
Title: Python 3.3.3 encodes emails containing non-ascii data as 7bit
Type: behavior Stage: committed/rejected
Components: email, Library (Lib) Versions: Python 3.4, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, apollo13, barry, python-dev, r.david.murray, vajrasky
Priority: normal Keywords: patch

Created on 2013-09-21 19:22 by apollo13, last changed 2014-02-07 18:06 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
fix_8bit_data_charset_none_set_payload.patch vajrasky, 2013-10-30 15:01 review
fix_8bit_data_charset_none_set_payload_v2.patch vajrasky, 2013-11-01 09:06 review
support_8bit_charset_cte.patch r.david.murray, 2013-11-21 16:06 review
support_8bit_charset_cte_v2.patch vajrasky, 2013-11-23 14:36 review
support_8bit_charset_cte_v3.patch vajrasky, 2013-11-23 14:51 review
support_8bit_charset_cte.patch r.david.murray, 2013-11-25 16:05 3.3 patch review
support_8bit_charset_cte.patch r.david.murray, 2013-11-25 16:05 3.4 patch review
support_8bit_charset_cte.patch r.david.murray, 2013-11-29 21:39 updated 3.4 patch review
Messages (24)
msg198217 - (view) Author: Florian Apolloner (apollo13) Date: 2013-09-21 19:22
Take the following example:

>>> from email.mime.nonmultipart import *
>>> from email.charset import *
>>> msg = MIMENonMultipart('text', 'plain')
>>> cs = Charset('utf-8')
>>> cs.body_encoding = None
>>> msg.set_payload('А Б В Г Д Е Ж Ѕ З И І К Л М Н О П.', cs)
>>> msg.as_string()
'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 7bit\n\nА Б В Г Д Е Ж Ѕ З И І К Л М Н О П.'

Till python 3.3.2 this created the email as 8bit, which is in line with what Thunderbird creates on my box. And if I understand the RFC correctly (https://tools.ietf.org/html/rfc2045#section-2.7) 7bit should only be used for ASCII. This issue was introduced by http://hg.python.org/cpython/rev/64e004737837 -- I changed orig = msg.get_payload(decode=True) back to it's original value (decode=False) in encode_7or8bit and get the "correct" behavior again. Running the tests didn't cause any errors (I hope I ran the correct one).
msg198220 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-09-21 19:31
There is definitely a bug here, but 8bit would also be wrong, since you are calling as_string.  It *should* be producing a 7bit CTE with a base64 encoded part in that case.
msg198221 - (view) Author: Florian Apolloner (apollo13) Date: 2013-09-21 19:36
Am I not explicitelly disabling base64 by setting body_encoding to None?
msg198222 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-09-21 19:39
You are, but you are also calling as_string.  Unicode can not handle 8bit data, therefore the email package must down-transform all data to 7bit when converting it to a string, just like a mail server trying to send to another mail server that can only handle 7bit would do.

If you used BytesGenerator to serialize the message, *then* you would expect to see 8bit utf-8 data in the body.

There are supposed to be tests for these cases, so I have to investigate why the tests are broken, as well :)
msg198226 - (view) Author: Florian Apolloner (apollo13) Date: 2013-09-21 22:05
Using BytesGenerator I get:
>>> fp = BytesIO()
>>> g = BytesGenerator(fp)
>>> msg = MIMENonMultipart('text', 'plain')
>>> msg.set_payload('А Б В Г Д Е Ж Ѕ З И І К Л М Н О П.', cs)
>>> g.flatten(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/email/generator.py", line 112, in flatten
    self._write(msg)
  File "/usr/lib/python3.3/email/generator.py", line 177, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.3/email/generator.py", line 203, in _dispatch
    meth(msg)
  File "/usr/lib/python3.3/email/generator.py", line 421, in _handle_text
    super(BytesGenerator,self)._handle_text(msg)
  File "/usr/lib/python3.3/email/generator.py", line 233, in _handle_text
    self._write_lines(payload)
  File "/usr/lib/python3.3/email/generator.py", line 158, in _write_lines
    self.write(laststripped)
  File "/usr/lib/python3.3/email/generator.py", line 395, in write
    self._fp.write(s.encode('ascii', 'surrogateescape'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u0410' in position 0: ordinal not in range(128)

But I might be using it wrong :/
msg199393 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-10 15:25
There is definitely a bug in set_payload here, and (obviously :) no test for that case (passing an 8bit charset to set_payload).
msg201741 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-10-30 15:01
Here is the preliminary patch to fix the problem. My patch produces 8bit for msg.as_string and msg.as_bytes for simplicity reason.

If msg.as_string should gives content-transfer-encoding 7bit with 8bit data but msg.as_bytes should gives content-transfer-encoding 8bit with 8bit data, I can modify the patch. But it doesn't feel right...
msg201770 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-30 19:16
msg.as_string should not be producing a CTE of 8bit.  I haven't looked at your patch so I don't know what you mean by having as_string produce 8bit data, but it can't be right :)

To clarify: as_string must produce valid unicode data, and therefore *cannot* have any 8bit data in it.  Unicode is not an 8bit channel (in SMTP terms) it is a 7bit channel (ie: restricted to ASCII).  The *decoded* version of the message can have non-ASCII in it, but as_string is producing an *encoded* version.  The CTE applies only to an encoded version.  

It gets a little confusing because in Python we are used to 'decoding' to unicode and 'encoding' to bytes, but in the email package we also sometimes 'encode' to ASCII but use unicode strings to store it.  If we didn't have to maintain backward compatibility it would probably be better to just drop as_string :)
msg201792 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-10-31 04:29
Okay, so for this case, what are the correct outputs for the cte and the message?

        from email.charset import Charset
        from email.message import Message

        cs = Charset('utf-8')
        cs.body_encoding = None # disable base64
        msg = Message()
        msg.set_payload('АБВ', cs)

        msg.as_string() ==>
           cte -> 7bit
           message -> АБВ or \\u0410\\u0411\\u0412 or \xd0\x90\xd0\x91\xd0\x92?

        msg.as_bytes() ==>
           cte -> 8bit
           message -> \\u0410\\u0411\\u0412 or \xd0\x90\xd0\x91\xd0\x92?
msg201793 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-31 05:38
cte base64 I think (see below).

Basically, set_payload should be putting the surrogateescape encoded utf-8 into the _payload (which it should now be doing), and probably calling set_charset.  The cte will at that point be 8bit, but when as_string calls Generator, it will get converted to 7bit clean by doing (I think) a base64 encode and emitting that as the CTE.  I have to look through the code to remind myself how it all works, which I haven't had time for yet, which is why I haven't tried to make a fix myself yet.

(Yes, this is poor design, but we are dealing with a long line of legacy code and API here...)
msg201794 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-10-31 06:13
So msg.as_string() =>
   cte -> base64
   message -> 0JDQkdCS # base64 encoded string

What about msg.as_bytes()? Should it be:
   cte -> 8bit
   message -> \\u0410\\u0411\\u0412 (raw-unicode-escape) or \xd0\x90\xd0\x91\xd0\x92 (utf-8)?

or message -> 0JDQkdCS (base64) as well?
msg201831 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-31 17:28
as_bytes should be producing the raw utf8 bytes with cte 8bit.
msg201883 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-11-01 09:06
Attached the *preliminary* patch to address R. David Murray's request.

It does not address the case where we send raw utf-8 bytes to payload. Maybe we should handle that in different ticket.

msg.set_payload(b'\xd0\x90\xd0\x91\xd0\x92') ==> chucks
msg203521 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-20 19:20
Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues.

It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly).  After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model.  We previously fixed the issue where if set_payload was passed binary data bad things would happen.  That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data.

But what the model *really* needs is that _payload *always* be an ascii+surrogateescape string, and never a full unicode string.  (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.)

Currently it can be a unicode string.  If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE.  This is pretty much just by luck, though.

If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation.  That doesn't affect the payload, but it does result in wrong CTE being used.

The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset.

Unfortunately it is also possible to call set_payload without a charset, and *then* call set_charset.  To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset.

My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set.  I believe this is an acceptable break, since otherwise you *must* leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake.

Patch attached.
msg203648 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-11-21 15:58
No review link?
msg203650 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-21 16:06
Ah, I posted a git-diff 3.3 patch.  Let me repost it.
msg204033 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-11-23 14:36
R. David Murray, your patch fails with this situation:

from email.mime.nonmultipart import *
from email.charset import *
from email.message import Message
from io import BytesIO
from email.generator import BytesGenerator
msg = Message()
cs = Charset('utf-8')
cs.body_encoding = None
msg.set_payload('АБВ', cs)
msg.as_string()
fp = BytesIO()
g = BytesGenerator(fp)
g.flatten(msg)
print(fp.getvalue())

===> b'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: base64\n\n0JDQkdCS\n'

Apparently, there is a funky bug. If you never call msg.as_string(), the fp.get_value() will output correctly:

b'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 8bit\n\n\xd0\x90\xd0\x91\xd0\x92'

It turns out that msg.as_string() calls set_payload with different kind of charset!

Attached the patch to solve this problem. Not sure whether this is important or not to fix in 3.3. How many people call bytes generator after calling string generator?
msg204037 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-11-23 14:51
Simpler patch.
msg204045 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-23 16:16
Yes, I discovered this in testing, but I forgot to file a bug report for it.  It should be dealt with in a separate issue.  And yes, it should be fixed, since except for the documented on-demand filling out of missing pieces such as MIME borders, the model should not be changed by serializing it.

There are also some other issues that I discovered while running the more complete 3.4 tests, but I haven't had time to fix those yet.
msg204355 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-25 16:05
Updated patch for 3.3, and a new patch for 3.4.  In 3.4, set_payload raises an error if non-ascii-surrogateescape text is passed in as the argument (ie: there are non-ascii unicode characters in the string) and no charset is specified with which to encode them.  In an ideal world it would instead default to utf-8, but because there may exist code that passes in unicode and then encodes it later by calling set_charset, we can't go that route in 3.4.  In 3.5, after the few people who may be in that boat have fixed their code to include the charset on set_payload, we can in fact make it default to utf-8.

I had to fix a couple 3.4 test, one of which was no longer valid because I've now defined the previously undefined behavior of set_payload when passed unicode, and one of which was just wrong but I didn't notice because of this bug.
msg204757 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-29 21:39
3.4 patch updated to address Vajrasky's review comment.

I'll probably apply this tomorrow.
msg205929 - (view) Author: Roundup Robot (python-dev) Date: 2013-12-11 21:52
New changeset d842bc07d30b by R David Murray in branch '3.3':
#19063: partially fix set_payload handling of non-ASCII string input.
http://hg.python.org/cpython/rev/d842bc07d30b

New changeset 02cb48459b58 by R David Murray in branch 'default':
Null merge for #19063 (3.4 fix is different).
http://hg.python.org/cpython/rev/02cb48459b58

New changeset e20f98a8ed71 by R David Murray in branch 'default':
#19063: fix set_payload handling of non-ASCII string input.
http://hg.python.org/cpython/rev/e20f98a8ed71
msg205930 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-11 21:54
Well, it's a fair while after "tomorrow", but now it is committed.
msg210505 - (view) Author: Roundup Robot (python-dev) Date: 2014-02-07 18:06
New changeset 4daf3cec9419 by R David Murray in branch '3.3':
#19063: the unicode-in-set_payload problem isn't getting fixed in 3.4.
http://hg.python.org/cpython/rev/4daf3cec9419

New changeset f942f1eddfea by R David Murray in branch 'default':
#20531: Revert e20f98a8ed71, the 3.4 version of the #19063 fix.
http://hg.python.org/cpython/rev/f942f1eddfea

New changeset ef8aaace85ca by R David Murray in branch 'default':
#20531: Apply the 3.3 version of the #19063 fix.
http://hg.python.org/cpython/rev/ef8aaace85ca

New changeset aab7258a31d3 by R David Murray in branch 'default':
Merge: #19063: the unicode-in-set_payload problem isn't getting fixed in 3.4.
http://hg.python.org/cpython/rev/aab7258a31d3
History
Date User Action Args
2014-02-07 18:06:35python-devsetmessages: + msg210505
2013-12-11 21:54:34r.david.murraysetstatus: open -> closed
resolution: fixed
messages: + msg205930

stage: patch review -> committed/rejected
2013-12-11 21:52:36python-devsetnosy: + python-dev
messages: + msg205929
2013-11-29 21:39:48r.david.murraysetfiles: + support_8bit_charset_cte.patch

messages: + msg204757
2013-11-25 16:05:53r.david.murraysetfiles: + support_8bit_charset_cte.patch
2013-11-25 16:05:11r.david.murraysetfiles: + support_8bit_charset_cte.patch

messages: + msg204355
2013-11-23 16:16:11r.david.murraysetmessages: + msg204045
2013-11-23 14:51:23vajraskysetfiles: + support_8bit_charset_cte_v3.patch

messages: + msg204037
2013-11-23 14:36:02vajraskysetfiles: + support_8bit_charset_cte_v2.patch

messages: + msg204033
2013-11-21 16:06:56r.david.murraysetfiles: - support_8bit_charset_cte.patch
2013-11-21 16:06:32r.david.murraysetfiles: + support_8bit_charset_cte.patch

messages: + msg203650
2013-11-21 15:58:02vajraskysetmessages: + msg203648
2013-11-20 19:28:03r.david.murraysetfiles: + support_8bit_charset_cte.patch
2013-11-20 19:27:42r.david.murraysetfiles: - support_8bit_charset_cte.patch
2013-11-20 19:20:52r.david.murraysetfiles: + support_8bit_charset_cte.patch
versions: - Python 3.2
messages: + msg203521

type: behavior
stage: patch review
2013-11-01 09:06:46vajraskysetfiles: + fix_8bit_data_charset_none_set_payload_v2.patch

messages: + msg201883
2013-10-31 17:28:37r.david.murraysetmessages: + msg201831
2013-10-31 14:54:36Eric.Hanchrowsetnosy: - Eric.Hanchrow
2013-10-31 06:13:31vajraskysetmessages: + msg201794
2013-10-31 05:38:53r.david.murraysetmessages: + msg201793
2013-10-31 04:29:11vajraskysetmessages: + msg201792
2013-10-30 23:37:55Arfreversetmessages: - msg201782
2013-10-30 23:37:51Arfreversetmessages: - msg201781
2013-10-30 23:00:55Eric.Hanchrowsetmessages: + msg201782
2013-10-30 22:59:37Eric.Hanchrowsetnosy: + Eric.Hanchrow
messages: + msg201781
2013-10-30 19:16:43r.david.murraysetmessages: + msg201770
2013-10-30 15:01:15vajraskysetfiles: + fix_8bit_data_charset_none_set_payload.patch

nosy: + vajrasky
messages: + msg201741

keywords: + patch
2013-10-10 15:25:57r.david.murraysetmessages: + msg199393
2013-09-21 22:29:39Arfreversetnosy: + Arfrever
2013-09-21 22:05:22apollo13setmessages: + msg198226
2013-09-21 19:39:31r.david.murraysetmessages: + msg198222
2013-09-21 19:36:35apollo13setmessages: + msg198221
2013-09-21 19:31:37r.david.murraysetversions: + Python 3.2, Python 3.4
nosy: + barry, r.david.murray

messages: + msg198220

components: + email
2013-09-21 19:22:28apollo13create