msg198217 - (view) |
Author: Florian Apolloner (apollo13) |
Date: 2013-09-21 19:22 |
Take the following example:
>>> from email.mime.nonmultipart import *
>>> from email.charset import *
>>> msg = MIMENonMultipart('text', 'plain')
>>> cs = Charset('utf-8')
>>> cs.body_encoding = None
>>> msg.set_payload('А Б В Г Д Е Ж Ѕ З И І К Л М Н О П.', cs)
>>> msg.as_string()
'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 7bit\n\nА Б В Г Д Е Ж Ѕ З И І К Л М Н О П.'
Till python 3.3.2 this created the email as 8bit, which is in line with what Thunderbird creates on my box. And if I understand the RFC correctly (https://tools.ietf.org/html/rfc2045#section-2.7) 7bit should only be used for ASCII. This issue was introduced by http://hg.python.org/cpython/rev/64e004737837 -- I changed orig = msg.get_payload(decode=True) back to it's original value (decode=False) in encode_7or8bit and get the "correct" behavior again. Running the tests didn't cause any errors (I hope I ran the correct one).
|
msg198220 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-09-21 19:31 |
There is definitely a bug here, but 8bit would also be wrong, since you are calling as_string. It *should* be producing a 7bit CTE with a base64 encoded part in that case.
|
msg198221 - (view) |
Author: Florian Apolloner (apollo13) |
Date: 2013-09-21 19:36 |
Am I not explicitelly disabling base64 by setting body_encoding to None?
|
msg198222 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-09-21 19:39 |
You are, but you are also calling as_string. Unicode can not handle 8bit data, therefore the email package must down-transform all data to 7bit when converting it to a string, just like a mail server trying to send to another mail server that can only handle 7bit would do.
If you used BytesGenerator to serialize the message, *then* you would expect to see 8bit utf-8 data in the body.
There are supposed to be tests for these cases, so I have to investigate why the tests are broken, as well :)
|
msg198226 - (view) |
Author: Florian Apolloner (apollo13) |
Date: 2013-09-21 22:05 |
Using BytesGenerator I get:
>>> fp = BytesIO()
>>> g = BytesGenerator(fp)
>>> msg = MIMENonMultipart('text', 'plain')
>>> msg.set_payload('А Б В Г Д Е Ж Ѕ З И І К Л М Н О П.', cs)
>>> g.flatten(msg)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/email/generator.py", line 112, in flatten
self._write(msg)
File "/usr/lib/python3.3/email/generator.py", line 177, in _write
self._dispatch(msg)
File "/usr/lib/python3.3/email/generator.py", line 203, in _dispatch
meth(msg)
File "/usr/lib/python3.3/email/generator.py", line 421, in _handle_text
super(BytesGenerator,self)._handle_text(msg)
File "/usr/lib/python3.3/email/generator.py", line 233, in _handle_text
self._write_lines(payload)
File "/usr/lib/python3.3/email/generator.py", line 158, in _write_lines
self.write(laststripped)
File "/usr/lib/python3.3/email/generator.py", line 395, in write
self._fp.write(s.encode('ascii', 'surrogateescape'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u0410' in position 0: ordinal not in range(128)
But I might be using it wrong :/
|
msg199393 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-10-10 15:25 |
There is definitely a bug in set_payload here, and (obviously :) no test for that case (passing an 8bit charset to set_payload).
|
msg201741 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-10-30 15:01 |
Here is the preliminary patch to fix the problem. My patch produces 8bit for msg.as_string and msg.as_bytes for simplicity reason.
If msg.as_string should gives content-transfer-encoding 7bit with 8bit data but msg.as_bytes should gives content-transfer-encoding 8bit with 8bit data, I can modify the patch. But it doesn't feel right...
|
msg201770 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-10-30 19:16 |
msg.as_string should not be producing a CTE of 8bit. I haven't looked at your patch so I don't know what you mean by having as_string produce 8bit data, but it can't be right :)
To clarify: as_string must produce valid unicode data, and therefore *cannot* have any 8bit data in it. Unicode is not an 8bit channel (in SMTP terms) it is a 7bit channel (ie: restricted to ASCII). The *decoded* version of the message can have non-ASCII in it, but as_string is producing an *encoded* version. The CTE applies only to an encoded version.
It gets a little confusing because in Python we are used to 'decoding' to unicode and 'encoding' to bytes, but in the email package we also sometimes 'encode' to ASCII but use unicode strings to store it. If we didn't have to maintain backward compatibility it would probably be better to just drop as_string :)
|
msg201792 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-10-31 04:29 |
Okay, so for this case, what are the correct outputs for the cte and the message?
from email.charset import Charset
from email.message import Message
cs = Charset('utf-8')
cs.body_encoding = None # disable base64
msg = Message()
msg.set_payload('АБВ', cs)
msg.as_string() ==>
cte -> 7bit
message -> АБВ or \\u0410\\u0411\\u0412 or \xd0\x90\xd0\x91\xd0\x92?
msg.as_bytes() ==>
cte -> 8bit
message -> \\u0410\\u0411\\u0412 or \xd0\x90\xd0\x91\xd0\x92?
|
msg201793 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-10-31 05:38 |
cte base64 I think (see below).
Basically, set_payload should be putting the surrogateescape encoded utf-8 into the _payload (which it should now be doing), and probably calling set_charset. The cte will at that point be 8bit, but when as_string calls Generator, it will get converted to 7bit clean by doing (I think) a base64 encode and emitting that as the CTE. I have to look through the code to remind myself how it all works, which I haven't had time for yet, which is why I haven't tried to make a fix myself yet.
(Yes, this is poor design, but we are dealing with a long line of legacy code and API here...)
|
msg201794 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-10-31 06:13 |
So msg.as_string() =>
cte -> base64
message -> 0JDQkdCS # base64 encoded string
What about msg.as_bytes()? Should it be:
cte -> 8bit
message -> \\u0410\\u0411\\u0412 (raw-unicode-escape) or \xd0\x90\xd0\x91\xd0\x92 (utf-8)?
or message -> 0JDQkdCS (base64) as well?
|
msg201831 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-10-31 17:28 |
as_bytes should be producing the raw utf8 bytes with cte 8bit.
|
msg201883 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-11-01 09:06 |
Attached the *preliminary* patch to address R. David Murray's request.
It does not address the case where we send raw utf-8 bytes to payload. Maybe we should handle that in different ticket.
msg.set_payload(b'\xd0\x90\xd0\x91\xd0\x92') ==> chucks
|
msg203521 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-11-20 19:20 |
Vajrasky: thanks for taking a crack at this, but, well, there are a lot of subtleties involved here, due to the way the organic growth of the email package over many years has led to some really bad design issues.
It took me a lot of time to boot back up my understanding of how all this stuff hangs together (answer: badly). After wandering down many blind alleys, the problem turns out to be yet one more disconnect in the model. We previously fixed the issue where if set_payload was passed binary data bad things would happen. That made the model more consistent, in that _payload was now a surrogateescaped string when the payload was specified as binary data.
But what the model *really* needs is that _payload *always* be an ascii+surrogateescape string, and never a full unicode string. (Yeah, this is a sucky model...it ought to always be binary instead, but we are dealing with legacy code here.)
Currently it can be a unicode string. If it is, set_charset turns it into an ascii only string by encoding it with the qp or base64 CTE. This is pretty much just by luck, though.
If you set body_encode to None what happens is that the encode_7or8bit encoder thinks the string is 7bit because it does get_payload(decode=True) which, because the model invariant was broken, turns into a raw-unicode-escape string, which is a 7bit representation. That doesn't affect the payload, but it does result in wrong CTE being used.
The fix is to fix the model invariant by turning a unicode string passed in to set_payload into an ascii+surrogateescape string with the escaped bytes being the unicode encoded to the output charset.
Unfortunately it is also possible to call set_payload without a charset, and *then* call set_charset. To keep from breaking the code of anyone currently doing that, I had to allow a full unicode _payload, and detect it in set_charset.
My plan is to fix that in 3.4, causing a backward compatibility break because it will no longer be possible to call set_payload with a unicode string containing non-ascii if you don't also provide a character set. I believe this is an acceptable break, since otherwise you *must* leave the model in an ambiguous state, and you have the possibility "leaking" unicode characters out into your wire-format message, which would ultimately result in either an exception at serialization time or, worse, mojibake.
Patch attached.
|
msg203648 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-11-21 15:58 |
No review link?
|
msg203650 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-11-21 16:06 |
Ah, I posted a git-diff 3.3 patch. Let me repost it.
|
msg204033 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-11-23 14:36 |
R. David Murray, your patch fails with this situation:
from email.mime.nonmultipart import *
from email.charset import *
from email.message import Message
from io import BytesIO
from email.generator import BytesGenerator
msg = Message()
cs = Charset('utf-8')
cs.body_encoding = None
msg.set_payload('АБВ', cs)
msg.as_string()
fp = BytesIO()
g = BytesGenerator(fp)
g.flatten(msg)
print(fp.getvalue())
===> b'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: base64\n\n0JDQkdCS\n'
Apparently, there is a funky bug. If you never call msg.as_string(), the fp.get_value() will output correctly:
b'MIME-Version: 1.0\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 8bit\n\n\xd0\x90\xd0\x91\xd0\x92'
It turns out that msg.as_string() calls set_payload with different kind of charset!
Attached the patch to solve this problem. Not sure whether this is important or not to fix in 3.3. How many people call bytes generator after calling string generator?
|
msg204037 - (view) |
Author: Vajrasky Kok (vajrasky) * |
Date: 2013-11-23 14:51 |
Simpler patch.
|
msg204045 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-11-23 16:16 |
Yes, I discovered this in testing, but I forgot to file a bug report for it. It should be dealt with in a separate issue. And yes, it should be fixed, since except for the documented on-demand filling out of missing pieces such as MIME borders, the model should not be changed by serializing it.
There are also some other issues that I discovered while running the more complete 3.4 tests, but I haven't had time to fix those yet.
|
msg204355 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-11-25 16:05 |
Updated patch for 3.3, and a new patch for 3.4. In 3.4, set_payload raises an error if non-ascii-surrogateescape text is passed in as the argument (ie: there are non-ascii unicode characters in the string) and no charset is specified with which to encode them. In an ideal world it would instead default to utf-8, but because there may exist code that passes in unicode and then encodes it later by calling set_charset, we can't go that route in 3.4. In 3.5, after the few people who may be in that boat have fixed their code to include the charset on set_payload, we can in fact make it default to utf-8.
I had to fix a couple 3.4 test, one of which was no longer valid because I've now defined the previously undefined behavior of set_payload when passed unicode, and one of which was just wrong but I didn't notice because of this bug.
|
msg204757 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-11-29 21:39 |
3.4 patch updated to address Vajrasky's review comment.
I'll probably apply this tomorrow.
|
msg205929 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-12-11 21:52 |
New changeset d842bc07d30b by R David Murray in branch '3.3':
#19063: partially fix set_payload handling of non-ASCII string input.
http://hg.python.org/cpython/rev/d842bc07d30b
New changeset 02cb48459b58 by R David Murray in branch 'default':
Null merge for #19063 (3.4 fix is different).
http://hg.python.org/cpython/rev/02cb48459b58
New changeset e20f98a8ed71 by R David Murray in branch 'default':
#19063: fix set_payload handling of non-ASCII string input.
http://hg.python.org/cpython/rev/e20f98a8ed71
|
msg205930 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2013-12-11 21:54 |
Well, it's a fair while after "tomorrow", but now it is committed.
|
msg210505 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2014-02-07 18:06 |
New changeset 4daf3cec9419 by R David Murray in branch '3.3':
#19063: the unicode-in-set_payload problem isn't getting fixed in 3.4.
http://hg.python.org/cpython/rev/4daf3cec9419
New changeset f942f1eddfea by R David Murray in branch 'default':
#20531: Revert e20f98a8ed71, the 3.4 version of the #19063 fix.
http://hg.python.org/cpython/rev/f942f1eddfea
New changeset ef8aaace85ca by R David Murray in branch 'default':
#20531: Apply the 3.3 version of the #19063 fix.
http://hg.python.org/cpython/rev/ef8aaace85ca
New changeset aab7258a31d3 by R David Murray in branch 'default':
Merge: #19063: the unicode-in-set_payload problem isn't getting fixed in 3.4.
http://hg.python.org/cpython/rev/aab7258a31d3
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:51 | admin | set | github: 63263 |
2014-02-07 18:06:35 | python-dev | set | messages:
+ msg210505 |
2013-12-11 21:54:34 | r.david.murray | set | status: open -> closed resolution: fixed messages:
+ msg205930
stage: patch review -> resolved |
2013-12-11 21:52:36 | python-dev | set | nosy:
+ python-dev messages:
+ msg205929
|
2013-11-29 21:39:48 | r.david.murray | set | files:
+ support_8bit_charset_cte.patch
messages:
+ msg204757 |
2013-11-25 16:05:53 | r.david.murray | set | files:
+ support_8bit_charset_cte.patch |
2013-11-25 16:05:11 | r.david.murray | set | files:
+ support_8bit_charset_cte.patch
messages:
+ msg204355 |
2013-11-23 16:16:11 | r.david.murray | set | messages:
+ msg204045 |
2013-11-23 14:51:23 | vajrasky | set | files:
+ support_8bit_charset_cte_v3.patch
messages:
+ msg204037 |
2013-11-23 14:36:02 | vajrasky | set | files:
+ support_8bit_charset_cte_v2.patch
messages:
+ msg204033 |
2013-11-21 16:06:56 | r.david.murray | set | files:
- support_8bit_charset_cte.patch |
2013-11-21 16:06:32 | r.david.murray | set | files:
+ support_8bit_charset_cte.patch
messages:
+ msg203650 |
2013-11-21 15:58:02 | vajrasky | set | messages:
+ msg203648 |
2013-11-20 19:28:03 | r.david.murray | set | files:
+ support_8bit_charset_cte.patch |
2013-11-20 19:27:42 | r.david.murray | set | files:
- support_8bit_charset_cte.patch |
2013-11-20 19:20:52 | r.david.murray | set | files:
+ support_8bit_charset_cte.patch versions:
- Python 3.2 messages:
+ msg203521
type: behavior stage: patch review |
2013-11-01 09:06:46 | vajrasky | set | files:
+ fix_8bit_data_charset_none_set_payload_v2.patch
messages:
+ msg201883 |
2013-10-31 17:28:37 | r.david.murray | set | messages:
+ msg201831 |
2013-10-31 14:54:36 | Eric.Hanchrow | set | nosy:
- Eric.Hanchrow
|
2013-10-31 06:13:31 | vajrasky | set | messages:
+ msg201794 |
2013-10-31 05:38:53 | r.david.murray | set | messages:
+ msg201793 |
2013-10-31 04:29:11 | vajrasky | set | messages:
+ msg201792 |
2013-10-30 23:37:55 | Arfrever | set | messages:
- msg201782 |
2013-10-30 23:37:51 | Arfrever | set | messages:
- msg201781 |
2013-10-30 23:00:55 | Eric.Hanchrow | set | messages:
+ msg201782 |
2013-10-30 22:59:37 | Eric.Hanchrow | set | nosy:
+ Eric.Hanchrow messages:
+ msg201781
|
2013-10-30 19:16:43 | r.david.murray | set | messages:
+ msg201770 |
2013-10-30 15:01:15 | vajrasky | set | files:
+ fix_8bit_data_charset_none_set_payload.patch
nosy:
+ vajrasky messages:
+ msg201741
keywords:
+ patch |
2013-10-10 15:25:57 | r.david.murray | set | messages:
+ msg199393 |
2013-09-21 22:29:39 | Arfrever | set | nosy:
+ Arfrever
|
2013-09-21 22:05:22 | apollo13 | set | messages:
+ msg198226 |
2013-09-21 19:39:31 | r.david.murray | set | messages:
+ msg198222 |
2013-09-21 19:36:35 | apollo13 | set | messages:
+ msg198221 |
2013-09-21 19:31:37 | r.david.murray | set | versions:
+ Python 3.2, Python 3.4 nosy:
+ barry, r.david.murray
messages:
+ msg198220
components:
+ email |
2013-09-21 19:22:28 | apollo13 | create | |