Issue 43493: EmailMessage mis-folding headers of a certain length

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87659

classification

Title:	EmailMessage mis-folding headers of a certain length
Type:	behavior	Stage:
Components:	email	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	andrei.avk, barry, mglover, r.david.murray
Priority:	normal	Keywords:

Created on 2021-03-14 20:46 by mglover, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
header_misfolding.py	mglover, 2021-03-14 20:46	test code showing incorrect behavior
foldfix.py	mglover, 2021-03-18 05:23

Messages (5)
msg388687 - (view)	Author: Mike Glover (mglover)	Date: 2021-03-14 20:46
The attached file demonstrates the incorrect folding behavior I'm seeing. Header lines of a certain total length get folded after the colon following the header name, which is not valid RFC. Slightly longer or shorter lines are folded correctly. Interestingly, the test file produces correct output on 3.5.2 $ python --version Python 3.8.5 $ sudo apt install python3 ... python3 is already the newest version (3.8.2-0ubuntu2). (yes, that difference has me scratching my head) And yes, I realize this is not the latest release of the 3.8 branch, but it is the latest available through apt on Ubuntu 20.04 LTS, and a search of the issue tracker and the release notes for all of 3.8.* turned up nothing applicable.
msg388989 - (view)	Author: Mike Glover (mglover)	Date: 2021-03-18 05:23
Further research shows that email.parser.Parser is not handling the affected lines correctly -- the leading '\n ' is not being stripped from the header value. Attached is the (ugly, worksforme) function I'm using to workaround this problem
msg389014 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2021-03-18 13:12
Parsing and newlines have nothing to do with this bug, actually. I don't think your foldfix post-processing is going to do what you want in the general case. The source of the bug here is in the folding algorithm in _header_value_parser. It has checks to see if the "text so far" will fit within the header width, and it starts a new line under vafious conditions. For example, if there is a single word after Subject: whose length is, say, 70, it would produce the effect you show, because the single word would fit without folding or encoding on a new line. I don't think this violates the RFC. What your example shows makes it look like the folder is treating all of the text as if it were a single word, which is obviously wrong. It is supposed to break at spaces. You will note that if you increase the repeat count in your example to 16 it folds the line correctly. So the bug has something to do with the total text so far accumulated for the line being right in that window where it won't fit on the first line but does fit on a line by itself. This is obviously a bug in the folder, since it should be splitting that text if it isn't a single word, not moving it to a new line as a whole. Note that this bug is still present on master.
msg396884 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-07-03 00:29
I've looked into this and it seems to be somewhat intentional, as can be seen in this test case for example: test_headerregistry.py", line 1725, in test_fold_address_list + To: "Theodore H. Perfect" <yes@man.com>, + "My address is very long because my name is long" <foo@bar.com>, + "Only A. Friend" <no@yes.com> Relevant code is here: https://github.com/python/cpython/blob/main/Lib/email/_header_value_parser.py#L2829-L2849 The logic goes like this: tstr = header value - try to add it to current line, if fits; continue to start of loop - try to add it to next line, if fits; continue to start of loop - split tstr; continue to start of loop So as you can see from test case, if split happened before step 2, the name would be split over 2 lines which is not ideal. I tested splitting before step 2, which fixed this bug but failed 11 test cases, all of which deal with email address folding. (To and From headers). So, is this actually an issue worth fixing? If the answer is yes, one option would be to special case Subject header and split it before going to step 2. IOW, the block that starts here: https://github.com/python/cpython/blob/4bcef2bb48b3fd82011a89c1c716421b789f1442/Lib/email/_header_value_parser.py#L2835 would need to be moved after the next block that starts 6 lines below.
msg397049 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2021-07-06 17:16
Ah, yes, the problem is more subtle than I thought. The design here is that we should be starting with the largest lexical unit, seeing if that fits on the current line, or a line by itself, and if so, using that, and if not, move down to the next smaller lexical unit and try again, until we are finally left with an unbreakable unit. For unstructured headers such as Subject the lexical units should be encoded words followed by blank delimited words. I'm guessing the code is treating the collection of words it has accumulated as a unit in the above algorithm, and since it fits on a line by itself, it goes with that. So yeah, it's sort of intentional. So the bug here is that in your step 2 we ideally want to be considering whether the last token on the current line is at the same lexical level as the token that precedes it...and if so, and if moving that token to the next line lets the remainder fit on the first line, we should do that. Exactly how to implement that correctly is a good question...it's been too long since I wrote that code, and I may not have time to investigate it more deeply. If you come up with something based on my description of the intent above, I should be able to review it (though you might need to ping me directly to get my attention).

History
Date	User	Action	Args
2022-04-11 14:59:42	admin	set	github: 87659
2021-07-06 17:16:58	r.david.murray	set	messages: + msg397049
2021-07-03 00:29:19	andrei.avk	set	nosy: + andrei.avk messages: + msg396884
2021-03-18 13:12:07	r.david.murray	set	messages: + msg389014
2021-03-18 05:23:57	mglover	set	files: + foldfix.py messages: + msg388989
2021-03-14 20:46:29	mglover	create