Issue 19063: Python 3.3.3 encodes emails containing non-ascii data as 7bit

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63263

classification

Title:	Python 3.3.3 encodes emails containing non-ascii data as 7bit
Type:	behavior	Stage:	resolved
Components:	email, Library (Lib)	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, apollo13, barry, python-dev, r.david.murray, vajrasky
Priority:	normal	Keywords:	patch

Created on 2013-09-21 19:22 by apollo13, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fix_8bit_data_charset_none_set_payload.patch	vajrasky, 2013-10-30 15:01		review
fix_8bit_data_charset_none_set_payload_v2.patch	vajrasky, 2013-11-01 09:06		review
support_8bit_charset_cte.patch	r.david.murray, 2013-11-21 16:06		review
support_8bit_charset_cte_v2.patch	vajrasky, 2013-11-23 14:36		review
support_8bit_charset_cte_v3.patch	vajrasky, 2013-11-23 14:51		review
support_8bit_charset_cte.patch	r.david.murray, 2013-11-25 16:05	3.3 patch	review
support_8bit_charset_cte.patch	r.david.murray, 2013-11-25 16:05	3.4 patch	review
support_8bit_charset_cte.patch	r.david.murray, 2013-11-29 21:39	updated 3.4 patch	review

Messages (24)
msg198217 - (view)	Author: Florian Apolloner (apollo13)	Date: 2013-09-21 19:22
Take the following example: >>> from email.mime.nonmultipart import * >>> from email.charset import * >>> msg = MIMENonMultipart('text', 'plain') >>> cs = Charset('utf-8') >>> cs.body_encoding = None >>> msg.set_payload('А Б В Г Д Е Ж Ѕ З И І К Л М Н О П.', cs) >>> msg.as_string() 'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 7bit\n\nА Б В Г Д Е Ж Ѕ З И І К Л М Н О П.' Till python 3.3.2 this created the email as 8bit, which is in line with what Thunderbird creates on my box. And if I understand the RFC correctly (https://tools.ietf.org/html/rfc2045#section-2.7) 7bit should only be used for ASCII. This issue was introduced by http://hg.python.org/cpython/rev/64e004737837 -- I changed orig = msg.get_payload(decode=True) back to it's original value (decode=False) in encode_7or8bit and get the "correct" behavior again. Running the tests didn't cause any errors (I hope I ran the correct one).
msg198220 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-09-21 19:31
There is definitely a bug here, but 8bit would also be wrong, since you are calling as_string. It should be producing a 7bit CTE with a base64 encoded part in that case.
msg198221 - (view)	Author: Florian Apolloner (apollo13)	Date: 2013-09-21 19:36
Am I not explicitelly disabling base64 by setting body_encoding to None?
msg198222 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-09-21 19:39
You are, but you are also calling as_string. Unicode can not handle 8bit data, therefore the email package must down-transform all data to 7bit when converting it to a string, just like a mail server trying to send to another mail server that can only handle 7bit would do. If you used BytesGenerator to serialize the message, then you would expect to see 8bit utf-8 data in the body. There are supposed to be tests for these cases, so I have to investigate why the tests are broken, as well :)
msg198226 - (view)	Author: Florian Apolloner (apollo13)	Date: 2013-09-21 22:05
Using BytesGenerator I get: >>> fp = BytesIO() >>> g = BytesGenerator(fp) >>> msg = MIMENonMultipart('text', 'plain') >>> msg.set_payload('А Б В Г Д Е Ж Ѕ З И І К Л М Н О П.', cs) >>> g.flatten(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.3/email/generator.py", line 112, in flatten self._write(msg) File "/usr/lib/python3.3/email/generator.py", line 177, in _write self._dispatch(msg) File "/usr/lib/python3.3/email/generator.py", line 203, in _dispatch meth(msg) File "/usr/lib/python3.3/email/generator.py", line 421, in _handle_text super(BytesGenerator,self)._handle_text(msg) File "/usr/lib/python3.3/email/generator.py", line 233, in _handle_text self._write_lines(payload) File "/usr/lib/python3.3/email/generator.py", line 158, in _write_lines self.write(laststripped) File "/usr/lib/python3.3/email/generator.py", line 395, in write self._fp.write(s.encode('ascii', 'surrogateescape')) UnicodeEncodeError: 'ascii' codec can't encode character '\u0410' in position 0: ordinal not in range(128) But I might be using it wrong :/
msg199393 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-10 15:25
There is definitely a bug in set_payload here, and (obviously :) no test for that case (passing an 8bit charset to set_payload).
msg201741 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-10-30 15:01
Here is the preliminary patch to fix the problem. My patch produces 8bit for msg.as_string and msg.as_bytes for simplicity reason. If msg.as_string should gives content-transfer-encoding 7bit with 8bit data but msg.as_bytes should gives content-transfer-encoding 8bit with 8bit data, I can modify the patch. But it doesn't feel right...
msg201770 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-30 19:16
msg.as_string should not be producing a CTE of 8bit. I haven't looked at your patch so I don't know what you mean by having as_string produce 8bit data, but it can't be right :) To clarify: as_string must produce valid unicode data, and therefore cannot have any 8bit data in it. Unicode is not an 8bit channel (in SMTP terms) it is a 7bit channel (ie: restricted to ASCII). The decoded version of the message can have non-ASCII in it, but as_string is producing an encoded version. The CTE applies only to an encoded version. It gets a little confusing because in Python we are used to 'decoding' to unicode and 'encoding' to bytes, but in the email package we also sometimes 'encode' to ASCII but use unicode strings to store it. If we didn't have to maintain backward compatibility it would probably be better to just drop as_string :)
msg201792 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-10-31 04:29
Okay, so for this case, what are the correct outputs for the cte and the message? from email.charset import Charset from email.message import Message cs = Charset('utf-8') cs.body_encoding = None # disable base64 msg = Message() msg.set_payload('АБВ', cs) msg.as_string() ==> cte -> 7bit message -> АБВ or \\u0410\\u0411\\u0412 or \xd0\x90\xd0\x91\xd0\x92? msg.as_bytes() ==> cte -> 8bit message -> \\u0410\\u0411\\u0412 or \xd0\x90\xd0\x91\xd0\x92?
msg201793 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-31 05:38
cte base64 I think (see below). Basically, set_payload should be putting the surrogateescape encoded utf-8 into the _payload (which it should now be doing), and probably calling set_charset. The cte will at that point be 8bit, but when as_string calls Generator, it will get converted to 7bit clean by doing (I think) a base64 encode and emitting that as the CTE. I have to look through the code to remind myself how it all works, which I haven't had time for yet, which is why I haven't tried to make a fix myself yet. (Yes, this is poor design, but we are dealing with a long line of legacy code and API here...)
msg201794 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-10-31 06:13
So msg.as_string() => cte -> base64 message -> 0JDQkdCS # base64 encoded string What about msg.as_bytes()? Should it be: cte -> 8bit message -> \\u0410\\u0411\\u0412 (raw-unicode-escape) or \xd0\x90\xd0\x91\xd0\x92 (utf-8)? or message -> 0JDQkdCS (base64) as well?
msg201831 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-31 17:28
as_bytes should be producing the raw utf8 bytes with cte 8bit.
msg201883 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-11-01 09:06
Attached the preliminary patch to address R. David Murray's request. It does not address the case where we send raw utf-8 bytes to payload. Maybe we should handle that in different ticket. msg.set_payload(b'\xd0\x90\xd0\x91\xd0\x92') ==> chucks
msg203521 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-20 19:20
Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues. It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly). After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model. We previously fixed the issue where if set_payload was passed binary data bad things would happen. That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data. But what the model really needs is that _payload always be an ascii+surrogateescape string, and never a full unicode string. (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.) Currently it can be a unicode string. If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE. This is pretty much just by luck, though. If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation. That doesn't affect the payload, but it does result in wrong CTE being used. The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset. Unfortunately it is also possible to call set_payload without a charset, and then call set_charset. To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset. My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set. I believe this is an acceptable break, since otherwise you must leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake. Patch attached.
msg203648 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-11-21 15:58
No review link?
msg203650 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-21 16:06
Ah, I posted a git-diff 3.3 patch. Let me repost it.
msg204033 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-11-23 14:36
R. David Murray, your patch fails with this situation: from email.mime.nonmultipart import * from email.charset import * from email.message import Message from io import BytesIO from email.generator import BytesGenerator msg = Message() cs = Charset('utf-8') cs.body_encoding = None msg.set_payload('АБВ', cs) msg.as_string() fp = BytesIO() g = BytesGenerator(fp) g.flatten(msg) print(fp.getvalue()) ===> b'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: base64\n\n0JDQkdCS\n' Apparently, there is a funky bug. If you never call msg.as_string(), the fp.get_value() will output correctly: b'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 8bit\n\n\xd0\x90\xd0\x91\xd0\x92' It turns out that msg.as_string() calls set_payload with different kind of charset! Attached the patch to solve this problem. Not sure whether this is important or not to fix in 3.3. How many people call bytes generator after calling string generator?
msg204037 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-11-23 14:51
Simpler patch.
msg204045 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-23 16:16
Yes, I discovered this in testing, but I forgot to file a bug report for it. It should be dealt with in a separate issue. And yes, it should be fixed, since except for the documented on-demand filling out of missing pieces such as MIME borders, the model should not be changed by serializing it. There are also some other issues that I discovered while running the more complete 3.4 tests, but I haven't had time to fix those yet.
msg204355 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-25 16:05
Updated patch for 3.3, and a new patch for 3.4. In 3.4, set_payload raises an error if non-ascii-surrogateescape text is passed in as the argument (ie: there are non-ascii unicode characters in the string) and no charset is specified with which to encode them. In an ideal world it would instead default to utf-8, but because there may exist code that passes in unicode and then encodes it later by calling set_charset, we can't go that route in 3.4. In 3.5, after the few people who may be in that boat have fixed their code to include the charset on set_payload, we can in fact make it default to utf-8. I had to fix a couple 3.4 test, one of which was no longer valid because I've now defined the previously undefined behavior of set_payload when passed unicode, and one of which was just wrong but I didn't notice because of this bug.
msg204757 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-29 21:39
3.4 patch updated to address Vajrasky's review comment. I'll probably apply this tomorrow.
msg205929 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-12-11 21:52
New changeset d842bc07d30b by R David Murray in branch '3.3': #19063: partially fix set_payload handling of non-ASCII string input. http://hg.python.org/cpython/rev/d842bc07d30b New changeset 02cb48459b58 by R David Murray in branch 'default': Null merge for #19063 (3.4 fix is different). http://hg.python.org/cpython/rev/02cb48459b58 New changeset e20f98a8ed71 by R David Murray in branch 'default': #19063: fix set_payload handling of non-ASCII string input. http://hg.python.org/cpython/rev/e20f98a8ed71
msg205930 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-12-11 21:54
Well, it's a fair while after "tomorrow", but now it is committed.
msg210505 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-07 18:06
New changeset 4daf3cec9419 by R David Murray in branch '3.3': #19063: the unicode-in-set_payload problem isn't getting fixed in 3.4. http://hg.python.org/cpython/rev/4daf3cec9419 New changeset f942f1eddfea by R David Murray in branch 'default': #20531: Revert e20f98a8ed71, the 3.4 version of the #19063 fix. http://hg.python.org/cpython/rev/f942f1eddfea New changeset ef8aaace85ca by R David Murray in branch 'default': #20531: Apply the 3.3 version of the #19063 fix. http://hg.python.org/cpython/rev/ef8aaace85ca New changeset aab7258a31d3 by R David Murray in branch 'default': Merge: #19063: the unicode-in-set_payload problem isn't getting fixed in 3.4. http://hg.python.org/cpython/rev/aab7258a31d3

History
Date	User	Action	Args
2022-04-11 14:57:51	admin	set	github: 63263
2014-02-07 18:06:35	python-dev	set	messages: + msg210505
2013-12-11 21:54:34	r.david.murray	set	status: open -> closed resolution: fixed messages: + msg205930 stage: patch review -> resolved
2013-12-11 21:52:36	python-dev	set	nosy: + python-dev messages: + msg205929
2013-11-29 21:39:48	r.david.murray	set	files: + support_8bit_charset_cte.patch messages: + msg204757
2013-11-25 16:05:53	r.david.murray	set	files: + support_8bit_charset_cte.patch
2013-11-25 16:05:11	r.david.murray	set	files: + support_8bit_charset_cte.patch messages: + msg204355
2013-11-23 16:16:11	r.david.murray	set	messages: + msg204045
2013-11-23 14:51:23	vajrasky	set	files: + support_8bit_charset_cte_v3.patch messages: + msg204037
2013-11-23 14:36:02	vajrasky	set	files: + support_8bit_charset_cte_v2.patch messages: + msg204033
2013-11-21 16:06:56	r.david.murray	set	files: - support_8bit_charset_cte.patch
2013-11-21 16:06:32	r.david.murray	set	files: + support_8bit_charset_cte.patch messages: + msg203650
2013-11-21 15:58:02	vajrasky	set	messages: + msg203648
2013-11-20 19:28:03	r.david.murray	set	files: + support_8bit_charset_cte.patch
2013-11-20 19:27:42	r.david.murray	set	files: - support_8bit_charset_cte.patch
2013-11-20 19:20:52	r.david.murray	set	files: + support_8bit_charset_cte.patch versions: - Python 3.2 messages: + msg203521 type: behavior stage: patch review
2013-11-01 09:06:46	vajrasky	set	files: + fix_8bit_data_charset_none_set_payload_v2.patch messages: + msg201883
2013-10-31 17:28:37	r.david.murray	set	messages: + msg201831
2013-10-31 14:54:36	Eric.Hanchrow	set	nosy: - Eric.Hanchrow
2013-10-31 06:13:31	vajrasky	set	messages: + msg201794
2013-10-31 05:38:53	r.david.murray	set	messages: + msg201793
2013-10-31 04:29:11	vajrasky	set	messages: + msg201792
2013-10-30 23:37:55	Arfrever	set	messages: - msg201782
2013-10-30 23:37:51	Arfrever	set	messages: - msg201781
2013-10-30 23:00:55	Eric.Hanchrow	set	messages: + msg201782
2013-10-30 22:59:37	Eric.Hanchrow	set	nosy: + Eric.Hanchrow messages: + msg201781
2013-10-30 19:16:43	r.david.murray	set	messages: + msg201770
2013-10-30 15:01:15	vajrasky	set	files: + fix_8bit_data_charset_none_set_payload.patch nosy: + vajrasky messages: + msg201741 keywords: + patch
2013-10-10 15:25:57	r.david.murray	set	messages: + msg199393
2013-09-21 22:29:39	Arfrever	set	nosy: + Arfrever
2013-09-21 22:05:22	apollo13	set	messages: + msg198226
2013-09-21 19:39:31	r.david.murray	set	messages: + msg198222
2013-09-21 19:36:35	apollo13	set	messages: + msg198221
2013-09-21 19:31:37	r.david.murray	set	versions: + Python 3.2, Python 3.4 nosy: + barry, r.david.murray messages: + msg198220 components: + email
2013-09-21 19:22:28	apollo13	create