Issue 1974: email.MIMEText.MIMEText.as_string incorrectly folding long subject header

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46266

classification

Title:	email.MIMEText.MIMEText.as_string incorrectly folding long subject header
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 3.1, Python 2.7

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	barry	Nosy List:	aalbrecht, barry, cjw296, djc, gagern, nicdumz, salty-horse, splorgdar
Priority:	normal	Keywords:

Created on 2008-01-30 14:48 by cjw296, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.py	splorgdar, 2008-05-23 21:49	reproduction case
py_issue1974.diff	aalbrecht, 2008-06-22 12:06	Init Header class in test case without continuation_ws keyword

Messages (16)
msg61866 - (view)	Author: Chris Withers (cjw296) *	Date: 2008-01-30 14:48
Somewhere in email.MIMEText.MIMEText.as_string (I'm not sure where) long subject headers are folded using a newline followed by a tab: >>> from email.MIMEText import MIMEText >>> m = MIMEText('foo') >>> m['Subject']='AA '*40 >>> m.as_string() 'Content-Type: text/plain; charset="us-ascii"\nMIME-Version: 1.0\nContent-Transfer-Encoding: 7bit\nSubject: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA\n\tAA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA \n\nfoo' While RFC 822 section 3.1.1 doesn't forbid this type of folding, the current behaviour mis-renders in both Thunderbird and Outlook. Messages generated by Thunderbird and Outlook represent the above subject as follows: 'Content-Type: text/plain; charset="us-ascii"\nMIME-Version: 1.0\nContent-Transfer-Encoding: 7bit\nSubject: AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA\n AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA \n\nfoo' Could the email library be adjusted to do the same? Years of wondering why mails re-sent by mailman, mails from any number of python web systems and recent complaints from customers of mine of mis-rendered emails could be solved by doing so.
msg67270 - (view)	Author: Steve Baker (splorgdar)	Date: 2008-05-23 21:33
I'm also hitting this problem. I'm including a demo script. The minimal condition for me to reproduce this bug is that the subject contain at least 76 characters of any kind, followed by a space, followed by at least 3 characters of any kind.
msg68562 - (view)	Author: Andi Albrecht (aalbrecht) *	Date: 2008-06-22 12:06
For me this issue seems to be a duplicate of issue1645148. In my opinion the test case that checks if headers created by strings or Header instances are equal is incorrect. It shouldn't set the continuation whitespace explicitly when creating a Headers instance. I would expect that A) msg['aheader'] = 'a long string that breaks, but shortened here' and B) msg['aheader'] = Header('a long string that breaks, but shortened here') result in the same output. But for A) a Header instance is initialized in Generator._write_headers() with continuation_ws set to `\t` and for B) the default ' ' is used (see http://bugs.python.org/msg31102). I'm uploading a patch that modifies test_string_headerinst_eq() to what I think it should look like with this message. Of course, this test will fail at the moment, but maybe you agree that the Header instance in this test should be initialized without explicitly setting continuation_ws.
msg68568 - (view)	Author: Chris Withers (cjw296) *	Date: 2008-06-22 16:06
Andi, I'm in total agreement with you :-) (so if this bug could get fixed, both issues could get closed)
msg68630 - (view)	Author: Andi Albrecht (aalbrecht) *	Date: 2008-06-23 14:45
IMO, the best way to resolve this is to use the default continuation whitespace when a Header class is created from a string. But I'm pretty unsure about the default whitespace character. After having a look at a random set of mails in my inbox it seems that in most cases a tab is used for all headers except for the subject (excluding mails sent from Python scripts of course...). I would prefer '\t'; it's allowed and results in a cleaner output - but it won't solve the problem with some email clients which was the starting point for this issue (a short note in the docs should resolve this anyway... ;-) But I wonder if for some reason setting headers from strings intentionally differs from using the Header class? There are some tests that explicitly test both cases and expect different output. In TestLongHeaders it's test_long_unbreakable_lines_with_continuation() test_long_lines_with_different_header(self) test_long_received_header() [continuation_ws set to tab here] So, what I would do is * remove continuation_ws keyword from Generator._write_headers() * make '\t' the default cont. whitespace for headers (Header class) * update test cases and the documentation accordingly Do you think that this is the right way to go? Or would it break too much code outside stdlib if the default ws changes?
msg68634 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2008-06-23 14:59
I'm pretty convinced that this stuff is broken and needs serious attention. Unfortunately, I won't have time to address the email package before 2.6 and 3.0 get released. My current work lives at bzr+ssh://pythonbzr@code.python.org/python/users/barry/30email but it hasn't been updated for a while. I highly suggest taking this to the email-sig to see if anybody else has spare cycles to work in this for now.
msg68658 - (view)	Author: Chris Withers (cjw296) *	Date: 2008-06-23 23:06
Again, in total agreement with Andi EXCEPT: Please use ' ' instead of '\t' for the continuation character. It's the \t that gets mis-rendered by Outlook and Thunderbird (at the very least!) and since ' ' is also valid according to the RFC, we make life much easier for ourselves if we just stick to that...
msg68665 - (view)	Author: Andi Albrecht (aalbrecht) *	Date: 2008-06-24 04:33
FWIW, I've uploaded a patch to codereview: http://codereview.appspot.com/2407 It uses a space character as Chris suggested.
msg68750 - (view)	Author: Ori Avtalion (salty-horse) *	Date: 2008-06-25 19:42
I think there's been a little misinterpretation of the standard in the comments above. It's important to note that RFC 2822 basically defines folding as "adding a CRLF before an existing whitespace in the original message". See http://tools.ietf.org/html/rfc2822#section-2.2.3 It does not allow prepending folded lines with extra characters that were not in the original message such as '\t' or ' '. This is exactly what _encode_chunks does in header.py: joiner = NL + self._continuation_ws (Note that the email package docs and Header docstring use the word 'prepend' which is reflects the error in the code). With a correct implementation, why would I want to choice of which type of character to line-break on when folding? The whole notion of controlling the value of continuation_ws seems wrong. However, changing the default continuation_ws to ' ', as the patch suggests, will output syntactically correct headers in the majority of cases (due to other bugs that remove trailing whitespace and merge consecutive whitespace into one character). All in all, I agree with the change of the default continuation_ws due to its lucky side-effects, but as Barry hinted, the algorithm needs some serious work to really output valid headers. Some examples of the good and bad behaviors: >>> from email.Header import Header >>> l = ['<%d@dom.ain>' % i for i in range(8)] >>> # this turns out fine >>> Header(' '.join(l), continuation_ws=' ').encode() '<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain> <5@dom.ain>\n <6@dom.ain> <7@dom.ain>' # This does not fold even though it should >>> Header('\t'.join(l), continuation_ws=' ').encode() '<0@dom.ain>\t<1@dom.ain>\t<2@dom.ain>\t<3@dom.ain>\t<4@dom.ain>\t<5@dom.ain>\t<6@dom.ain>\t<7@dom.ain>' # And here the 4-char whitespace is shrinked into one >>> Header(' '.join(l), continuation_ws=' ').encode() '<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain> <5@dom.ain>\n <6@dom.ain> <7@dom.ain>'
msg68767 - (view)	Author: Chris Withers (cjw296) *	Date: 2008-06-26 08:01
Ori, I do agree with both you and Barry but is there any chance someone could make the one-character change to make the /t a space so we can stop seeing weirdness in common mail clients? Perhaps a separate issue could be raised to refactor this all correctly?
msg84601 - (view)	Author: Chris Withers (cjw296) *	Date: 2009-03-30 18:10
Barry and I talked about this and he's is now working on it. We're literally going to remove the \t folding substitution and have it do the default space folding substitution instead.
msg84698 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2009-03-30 22:43
So astoundingly, this change has deep implications. The upshot is that it's difficult to fix this so that headers look nice for e.g. Subject headers, but so that splitting and wrapping work as expected for e.g. machine readability of Received headers. After discussion with other sprinters, I'm committing a change to Python 2.7 to fix this, but I am not back porting the change to 2.6. I think we should Do It Right for Python 3.1 but this requires (IMO) API changes. r70772
msg84755 - (view)	Author: Chris Withers (cjw296) *	Date: 2009-03-31 07:16
It's probably worth noting that changing: from email.mime.text import MIMEText m = MIMEText('foo') m['subject']='something long' ...to: from email.header import Header m = MIMEText('foo') m['subject']=Header('something long') ...will do folding without the \t problem, even in Python 2.6 I guess the moral of the story is that all headers should really be header objects. I think Barry has some ideas on that ;-)
msg109370 - (view)	Author: Nicolas Dumazet (nicdumz)	Date: 2010-07-06 08:22
Hello folks. (stumbling on this bug with Python2.7 release, noting that a few Mercurial tests broke with 2.7) I have no problem whatsoever with the fix itself (you know emails better than me), but you broke backwards compatibility for email.generator.Generator here, without propagating the changes to previous versions. 1) It would have been nice to flag this somewhere in the release notes, and/or the documentation for email.generator.Generator. Finding the reason of the Mercurial failures was... a bit trickier than necessary :) 2) What durable solutions can you propose for those of us that have to support several Python versions and share a similar behaviour? In a perfect world, I would simply do this: class CompatibleHeader(email.Header.Header.__class__): """ Python2.7 introduces a backwards incompatible change (Python issue1974, r70772) in email.generaor.Generator code: pre-2.7 code passed "continuation_ws='\t'" to the Header constructor, and 2.7 removed this parameter. Default argument is continuation_ws=' ', which means that the behaviour is different in <2.7 and 2.7 We consider the 2.7 behaviour to be preferable, but need to have an unified behaviour for versions 2.4 to 2.7 """ def __init__(self, kw): # override continuation_ws kw['continuation_ws'] = ' ' email.Header.Header.__init__(self, kw) email.Header.Header.__class__ = CompatibleHeader and get over it, but you'll notice that Header is still an old-style class... so that's not possible. Any suggestions? Thanks!
msg109371 - (view)	Author: Chris Withers (cjw296) *	Date: 2010-07-06 08:30
Maybe an old-fashioned monkey-patch would be the way to go?
msg109376 - (view)	Author: Nicolas Dumazet (nicdumz)	Date: 2010-07-06 09:40
Sure, where was my head. So, a simple patch like this one: _oldheaderinit = email.Header.Header.__init__ def _unifiedheaderinit(self, args, kw): # override continuation_ws kw['continuation_ws'] = ' ' _oldheaderinit(self, args, **kw) email.Header.Header.__dict__['__init__'] = _unifiedheaderinit fixes the issue for us, and might be helpful to others.

History
Date	User	Action	Args
2022-04-11 14:56:30	admin	set	github: 46266
2010-07-06 09:40:57	nicdumz	set	messages: + msg109376
2010-07-06 08:30:03	cjw296	set	messages: + msg109371
2010-07-06 08:22:45	nicdumz	set	nosy: + djc, nicdumz messages: + msg109370
2009-03-31 07:16:01	cjw296	set	keywords: - patch messages: + msg84755
2009-03-30 22:43:41	barry	set	status: open -> closed messages: + msg84698 versions: - Python 2.6
2009-03-30 18:10:05	cjw296	set	resolution: accepted messages: + msg84601 versions: + Python 2.6, Python 3.1, Python 2.7, - Python 2.4
2009-03-22 15:14:18	gagern	set	nosy: + gagern
2008-06-26 08:01:52	cjw296	set	messages: + msg68767
2008-06-25 19:42:05	salty-horse	set	nosy: + salty-horse messages: + msg68750
2008-06-24 04:34:00	aalbrecht	set	messages: + msg68665
2008-06-23 23:06:35	cjw296	set	messages: + msg68658
2008-06-23 14:59:24	barry	set	messages: + msg68634
2008-06-23 14:45:33	aalbrecht	set	messages: + msg68630
2008-06-22 16:06:05	cjw296	set	messages: + msg68568
2008-06-22 12:06:53	aalbrecht	set	files: + py_issue1974.diff keywords: + patch messages: + msg68562 nosy: + aalbrecht
2008-05-23 21:49:11	splorgdar	set	files: - test
2008-05-23 21:49:02	splorgdar	set	files: + test.py
2008-05-23 21:33:33	splorgdar	set	files: + test nosy: + splorgdar messages: + msg67270
2008-02-01 15:52:13	gvanrossum	set	assignee: barry nosy: + barry
2008-01-30 14:48:04	cjw296	create