classification
Title: "as_string" method in email's mime objects encode text segmentedly
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: dongying, r.david.murray
Priority: normal Keywords:

Created on 2010-03-04 08:24 by dongying, last changed 2010-03-05 14:14 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
test.py dongying, 2010-03-04 08:24 An example
utf8_test.py dongying, 2010-03-05 06:34 Another example
Messages (4)
msg100380 - (view) Author: Dongying Zhang (dongying) Date: 2010-03-04 08:24
The as_string method in mime classes in module email.mime use base64 to encode the text, but segmentedly while the text contents non-acsii characters and is in type of unicode. This behavior confuse some of the email servers.
For example:
===================================================================
#-*- coding: utf-8 -*-
import base64
from email.mime.text import MIMEText
content = u'''Hello:
    这是一封测试邮件.
   Please remove this message after reading,
   and I hope this won't bother you for a long time
'''
m = MIMEText(content, 'plain', 'utf-8')
print m.as_string()
m = MIMEText(content.encode('utf-8'), 'plain', 'utf-8')
print m.as_string()
print base64.encodestring(content.encode('utf-8'))
===================================================================
The first as_string method gives:
-------------------------------------------------------------------
SGVsbG86CiAgICDov5nmmK/kuIDlsIHmtYvor5Xpgq7ku7YuCiAgIFBsZWFzZSByZW1vdmUgdGhpcyBtZXNzYWdlIGFmdGVyIA==
cmVhZGluZywKICAgYW5kIEkgaG9wZSB0aGlzIHdvbid0IGJvdGhlciB5b3UgZm9yIGEgbG9uZyB0
aW1lCg==
-------------------------------------------------------------------
Please notice that there is a '==' at the end of the first line.
The output of both the second as_string and base64.encodestring method maybe more appropriate, which is:
-------------------------------------------------------------------
SGVsbG86CiAgICDov5nmmK/kuIDlsIHmtYvor5Xpgq7ku7YuCiAgIFBsZWFzZSByZW1vdmUgdGhp
cyBtZXNzYWdlIGFmdGVyIHJlYWRpbmcsCiAgIGFuZCBJIGhvcGUgdGhpcyB3b24ndCBib3RoZXIg
eW91IGZvciBhIGxvbmcgdGltZQo=
msg100386 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-04 12:53
Using python 2.6.4, your first example gives me an error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-18: ordinal not in range(128)

while your second example works, as you indicated.

So, at the moment I can not reproduce the bug.  Are you using something other than the python from python.org?
msg100454 - (view) Author: Dongying Zhang (dongying) Date: 2010-03-05 06:34
Hello R. David Murray:

Thanks for your care.

The examples I given both in message and file is just the same. You got the 'UnicodeEncodeError' because your system default encoding is ascii. The declaration of encoding at the top didn't help with this situation.

To solve this, you can add following lines at the import part of the codes.
=====================================================================
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
=====================================================================
Then it should work by executing it directly or in a terminal (but not in IDLE).

You can try the new file I submit. Thanks!
msg100479 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-03-05 14:14
We don't fully support setting defaultencoding to anything other than ASCII.  The test suite doesn't fully pass, for example, if defaultencoding is set to 'utf-8' in site.py.

But that aside, the documentation for MIMEText says: "No guessing or encoding is performed on the text data.".  In your first example you are passing it unicode, which is un-encoded.  It might be helpful if it threw a ValueError when passed unicode, but it isn't technically a bug that it doesn't, since it does throw an error if you haven't changed defaultencoding.  The behavior also can't be changed, since existing code may be depending on being able to pass ascii-only unicode strings in and having them auto-coerced to ascii.

Note that the cause of the problem is the fact that the email transport encoder is assuming that the input is binary data and is breaking it up into appropriately sized lines by counting bytes.  You've fed it a unicode string, which it then winds up breaking up by *unicode* character count, then passing the lines to binascii.b2a_base64, which given the non-standard defaultencoding then coerces it to utf-8, which contains a number of bytes different from the original character count, which are then encoded in base64, giving you the uneven length lines in the final output.

In Python3 this isn't a problem, since you can't accidentally mix up unicode and bytes in Python3.
History
Date User Action Args
2010-03-05 14:14:05r.david.murraysetstatus: open -> closed
resolution: wont fix
messages: + msg100479

stage: test needed -> resolved
2010-03-05 06:34:14dongyingsetfiles: + utf8_test.py

messages: + msg100454
2010-03-04 12:53:57r.david.murraysetpriority: normal

nosy: + r.david.murray
messages: + msg100386

components: + Library (Lib), - IO
stage: test needed
2010-03-04 08:24:53dongyingcreate