Issue 8054: "as_string" method in email's mime objects encode text segmentedly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52302

classification

Title:	"as_string" method in email's mime objects encode text segmentedly
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 2.6

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	dongying, r.david.murray
Priority:	normal	Keywords:

Created on 2010-03-04 08:24 by dongying, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.py	dongying, 2010-03-04 08:24	An example
utf8_test.py	dongying, 2010-03-05 06:34	Another example

Messages (4)
msg100380 - (view)	Author: Dongying Zhang (dongying)	Date: 2010-03-04 08:24
The as_string method in mime classes in module email.mime use base64 to encode the text, but segmentedly while the text contents non-acsii characters and is in type of unicode. This behavior confuse some of the email servers. For example: =================================================================== #-- coding: utf-8 -- import base64 from email.mime.text import MIMEText content = u'''Hello: 这是一封测试邮件. Please remove this message after reading, and I hope this won't bother you for a long time ''' m = MIMEText(content, 'plain', 'utf-8') print m.as_string() m = MIMEText(content.encode('utf-8'), 'plain', 'utf-8') print m.as_string() print base64.encodestring(content.encode('utf-8')) =================================================================== The first as_string method gives: ------------------------------------------------------------------- SGVsbG86CiAgICDov5nmmK/kuIDlsIHmtYvor5Xpgq7ku7YuCiAgIFBsZWFzZSByZW1vdmUgdGhpcyBtZXNzYWdlIGFmdGVyIA== cmVhZGluZywKICAgYW5kIEkgaG9wZSB0aGlzIHdvbid0IGJvdGhlciB5b3UgZm9yIGEgbG9uZyB0 aW1lCg== ------------------------------------------------------------------- Please notice that there is a '==' at the end of the first line. The output of both the second as_string and base64.encodestring method maybe more appropriate, which is: ------------------------------------------------------------------- SGVsbG86CiAgICDov5nmmK/kuIDlsIHmtYvor5Xpgq7ku7YuCiAgIFBsZWFzZSByZW1vdmUgdGhp cyBtZXNzYWdlIGFmdGVyIHJlYWRpbmcsCiAgIGFuZCBJIGhvcGUgdGhpcyB3b24ndCBib3RoZXIg eW91IGZvciBhIGxvbmcgdGltZQo=
msg100386 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-03-04 12:53
Using python 2.6.4, your first example gives me an error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-18: ordinal not in range(128) while your second example works, as you indicated. So, at the moment I can not reproduce the bug. Are you using something other than the python from python.org?
msg100454 - (view)	Author: Dongying Zhang (dongying)	Date: 2010-03-05 06:34
Hello R. David Murray: Thanks for your care. The examples I given both in message and file is just the same. You got the 'UnicodeEncodeError' because your system default encoding is ascii. The declaration of encoding at the top didn't help with this situation. To solve this, you can add following lines at the import part of the codes. ===================================================================== import sys reload(sys) sys.setdefaultencoding('utf-8') ===================================================================== Then it should work by executing it directly or in a terminal (but not in IDLE). You can try the new file I submit. Thanks!
msg100479 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-03-05 14:14
We don't fully support setting defaultencoding to anything other than ASCII. The test suite doesn't fully pass, for example, if defaultencoding is set to 'utf-8' in site.py. But that aside, the documentation for MIMEText says: "No guessing or encoding is performed on the text data.". In your first example you are passing it unicode, which is un-encoded. It might be helpful if it threw a ValueError when passed unicode, but it isn't technically a bug that it doesn't, since it does throw an error if you haven't changed defaultencoding. The behavior also can't be changed, since existing code may be depending on being able to pass ascii-only unicode strings in and having them auto-coerced to ascii. Note that the cause of the problem is the fact that the email transport encoder is assuming that the input is binary data and is breaking it up into appropriately sized lines by counting bytes. You've fed it a unicode string, which it then winds up breaking up by unicode character count, then passing the lines to binascii.b2a_base64, which given the non-standard defaultencoding then coerces it to utf-8, which contains a number of bytes different from the original character count, which are then encoded in base64, giving you the uneven length lines in the final output. In Python3 this isn't a problem, since you can't accidentally mix up unicode and bytes in Python3.

History
Date	User	Action	Args
2022-04-11 14:56:58	admin	set	github: 52302
2010-03-05 14:14:05	r.david.murray	set	status: open -> closed resolution: wont fix messages: + msg100479 stage: test needed -> resolved
2010-03-05 06:34:14	dongying	set	files: + utf8_test.py messages: + msg100454
2010-03-04 12:53:57	r.david.murray	set	priority: normal nosy: + r.david.murray messages: + msg100386 components: + Library (Lib), - IO stage: test needed
2010-03-04 08:24:53	dongying	create