Issue 39040: Wrong attachement filename when mail mime header was too long

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/83221

classification

Title:	Wrong attachement filename when mail mime header was too long
Type:	behavior	Stage:	patch review
Components:	email	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, maxking, miss-islington, mkaiser, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2019-12-13 16:59 by mkaiser, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
testscript.py	mkaiser, 2019-12-13 18:29	Script to read filenames
error.eml	mkaiser, 2019-12-13 18:34	Mail with broken filename
original_mail_from_gmail.eml	mkaiser, 2019-12-13 18:52	downloaded mail from gmail (web interface). I only edited the the mail addresses for privacy reasons. Same problem with this mail

Pull Requests
URL	Status	Linked	Edit
PR 17590	closed	mkaiser, 2019-12-13 17:04
PR 17620	merged	maxking, 2019-12-16 02:51
PR 20504	merged	miss-islington, 2020-05-29 00:05
PR 20505	merged	miss-islington, 2020-05-29 00:05
PR 20506	merged	miss-islington, 2020-05-29 00:05

Messages (23)
msg358343 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-13 16:59
I'm working on a mailfilter in python and used the method "get_filename" of the "EmailMessage" class. In some cases a wrong filename was returned. The reason was, that the Content-Disposition Header had a line break and the following intention was interpreted as part of the filename. After fixing this bug, I was able to get the right filename. I had to change "linesep_splitter" in "email.policy" to match the intention. Old Value: linesep_splitter = re.compile(r'\n\|\r') New Value: linesep_splitter = re.compile(r'\n\s+\|\r\s+')
msg358345 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-13 17:38
Thanks for the report. Can you provide an example that reproduces the problem? Per the RFC, lines may be broken before whitespace in certain places in certain headers, but that does not make the whitespace go away. Only the crlf sequence is removed when unfolding the header, per the RFC, so your proposed fix is incorrect. I suspect your example header is invalid, and the question will then become is there some sort of Postel-style error recovery we can and want to do in the function that parses the content-disposition header.
msg358350 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-13 18:34
The original filename is "Schulbesuchsbestättigung.pdf", but when I use the method "get_filename" I got "Schulbesuchsbestättigung. pdf" I removed some headers from the mail for privacy reasons
msg358351 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-13 18:36
The mail was sent from the GMail web interface
msg358352 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-13 19:44
That header is completely non-RFC compliant. If gmail generated that header there is something very wrong in google-land :( The RFC compliant formatting for that header looks like this: Content-Disposition: attachment; filename=utf-8''Schulbesuchsbest%C3%A4ttigung.pdf You will note that this is nothing like encoded word format. Encoded words are not valid inside quoted strings, and quoted strings can't be used in mime header attributes if there are non-ascii characters involved. Nor can encoded words. Now, all that said, there is an obvious rule that can be followed to understand what that header is trying to convey, and the current parser already implements most of it (you will find comments about it in the parser, as well as defects being registered). So, a patch to _header_value_parser to fix the error recovery will be accepted. I've looked at the code to remind myself, but not deeply enough to be sure* where the changes need to be made. There are two possibilities I see off the bat (and both may need fixing): get_bare_quoted_string and get_parameter. Either one or both of those may be forgetting that whitespace between encoded words should be dropped.
msg358378 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-14 06:12
thanks for your response. I have found the RFC https://tools.ietf.org/html/rfc2184 Gmail creates wrong Headers, which are not rfc-compliant. The problem is, that many people are using gmail and emails, which were sent from Gmail might be wrong. How can we solve this problem? It is not a Python problem. We can create workarrounds. But in my opinion Google has to fix the bug.
msg358384 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-14 07:58
RFC-2184 was obsoleted by RFC-2231 (https://www.rfc-editor.org/rfc/rfc2231.html) There are also no quoted strings, like google uses.
msg358393 - (view)	Author: Manfred Kaiser (mkaiser) *	Date: 2019-12-14 14:53
as you mentioned, rfc-2047 forbidds encoded words in quoted strings. Source: https://tools.ietf.org/html/rfc2047 - Chapter 5/3 I have tested a few web mail clients and they have the same issue. According to the RFCs, this is not allowed, but I think it is widely used. Should we fix this problem?
msg358395 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-14 15:28
Yes, google should fix their bug. However, the python email package tries very hard to interpret even RFC-non-compliant emails when there is a way to do so. As I said, the package already tries to interpret headers such as google is generating, it's just that there is a bug in that interpretation: it is keeping the blank between then encoded words when it should not be. That bug can be fixed, in get_raw_encoded_word and/or get_parameter, in email._header_value_parser.
msg358396 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-14 15:31
And you are right that this is a very common bug in email programs. So common that I suspect the RFC folks will eventually have to accept it as a de-facto standard. So we do need to support it in the python email library.
msg358458 - (view)	Author: Abhilash Raj (maxking) *	Date: 2019-12-15 23:05
I tried to take a look at the code to see where the fix needs to be and I probably need some help. I looked at the parse tree for the header and it looks something like this: ContentDisposition([Token([ValueTerminal('attachment')]), ValueTerminal(';'), MimeParameters([Parameter([Attribute([CFWSList([WhiteSpaceTerminal(' ')]), ValueTerminal('filename')]), ValueTerminal('='), Value([QuotedString([BareQuotedString([EncodedWord([ValueTerminal('Schulbesuchsbestättigung.')]), WhiteSpaceTerminal(' '), EncodedWord([ValueTerminal('pdf')])])])])])])]) The offending piece of code, which seems to be working as designed is get_bare_quoted_string() in email/_header_value_parser.py. while value and value[0] != '"': if value[0] in WSP: token, value = get_fws(value) elif value[:2] == '=?': try: token, value = get_encoded_word(value) bare_quoted_string.defects.append(errors.InvalidHeaderDefect( "encoded word inside quoted string")) except errors.HeaderParseError: token, value = get_qcontent(value) else: token, value = get_qcontent(value) bare_quoted_string.append(token) It just loops and parses the values. We cannot ignore the FWS until we know that the atom before and after the FWS are encoded words. I can't seem to find a clean way to look-ahead (which can perhaps be used in get_parameters()) or look-back (which can be used after parsing the entire bare_quoted_string?) in the parse tree to delete the offending whitespace. Any example of such kind of parse-tree manipulation in the code base would be awesome!
msg358460 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-16 01:33
The example you want to look at is get_unstructured. That shows both lookback and modification of the parse tree to handle the whitespace between encoded words.
msg358463 - (view)	Author: Abhilash Raj (maxking) *	Date: 2019-12-16 02:53
Thanks for the pointer, David! I created a PR for the fix, would you be able to review it please?
msg358500 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-16 18:19
In general your solution looks good, just a few naming comments and an additional test request.
msg358530 - (view)	Author: Abhilash Raj (maxking) *	Date: 2019-12-17 03:30
Thanks David! I applied the fixes as per your comments, can you please take another look?
msg358573 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-17 20:32
One more tweak to the test and we'll be good to go.
msg358608 - (view)	Author: Abhilash Raj (maxking) *	Date: 2019-12-18 02:20
Sure, fixed as per your comments in the PR.
msg358853 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-12-24 18:17
I don't see the change to the test in the PR. Did you miss a push or is github doing something wonky with the review? (I haven't used github review in a while and I had forgetten how hard it is to use...)
msg358857 - (view)	Author: Abhilash Raj (maxking) *	Date: 2019-12-24 19:41
I double checked, there should be 4 commits in the PR and last 2 have the changes that you asked for in the test case and NEWS entry. Your previous comment will point at the old diff, you might have to look at the full diff here: https://github.com/python/cpython/pull/17620/files or if you want, this is the diff for the 2 commits with the changes you requested: 016ceb3ef00b3b940993d35d539ce63d68437d4f">https://github.com/python/cpython/pull/17620/files/bf2cb76009d72869d9df6550b473b5818ceab311..016ceb3ef00b3b940993d35d539ce63d68437d4f
msg370276 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2020-05-29 00:05
New changeset 21017ed904f734be9f195ae1274eb81426a9e776 by Abhilash Raj in branch 'master': bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620) https://github.com/python/cpython/commit/21017ed904f734be9f195ae1274eb81426a9e776
msg370298 - (view)	Author: miss-islington (miss-islington)	Date: 2020-05-29 11:43
New changeset a6ae02d7e91cfe63c9b65b803ae24a40d2864bc0 by Miss Islington (bot) in branch '3.9': bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620) https://github.com/python/cpython/commit/a6ae02d7e91cfe63c9b65b803ae24a40d2864bc0
msg370299 - (view)	Author: miss-islington (miss-islington)	Date: 2020-05-29 11:43
New changeset 6381ee077d3c69d2f947f7bf87d8ec76e0caf189 by Miss Islington (bot) in branch '3.8': bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620) https://github.com/python/cpython/commit/6381ee077d3c69d2f947f7bf87d8ec76e0caf189
msg370300 - (view)	Author: miss-islington (miss-islington)	Date: 2020-05-29 11:43
New changeset 5f977e09e8a29dbd5972ad79c4fd17a394d1857f by Miss Islington (bot) in branch '3.7': bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620) https://github.com/python/cpython/commit/5f977e09e8a29dbd5972ad79c4fd17a394d1857f

History
Date	User	Action	Args
2022-04-11 14:59:24	admin	set	github: 83221
2020-05-29 11:43:51	miss-islington	set	messages: + msg370300
2020-05-29 11:43:30	miss-islington	set	messages: + msg370299
2020-05-29 11:43:09	miss-islington	set	messages: + msg370298
2020-05-29 00:05:29	miss-islington	set	pull_requests: + pull_request19752
2020-05-29 00:05:22	miss-islington	set	pull_requests: + pull_request19751
2020-05-29 00:05:13	miss-islington	set	nosy: + miss-islington pull_requests: + pull_request19750
2020-05-29 00:05:04	r.david.murray	set	messages: + msg370276
2019-12-24 19:41:53	maxking	set	messages: + msg358857
2019-12-24 18:17:11	r.david.murray	set	messages: + msg358853
2019-12-18 02:20:21	maxking	set	messages: + msg358608
2019-12-17 20:32:29	r.david.murray	set	messages: + msg358573
2019-12-17 03:30:01	maxking	set	messages: + msg358530
2019-12-16 18:19:37	r.david.murray	set	messages: + msg358500
2019-12-16 02:53:25	maxking	set	messages: + msg358463
2019-12-16 02:51:30	maxking	set	pull_requests: + pull_request17090
2019-12-16 01:33:18	r.david.murray	set	messages: + msg358460
2019-12-15 23:05:17	maxking	set	messages: + msg358458 versions: + Python 3.9, - Python 3.5, Python 3.6
2019-12-14 15:38:35	xtreak	set	nosy: + maxking
2019-12-14 15:31:14	r.david.murray	set	messages: + msg358396
2019-12-14 15:28:49	r.david.murray	set	messages: + msg358395
2019-12-14 14:53:17	mkaiser	set	messages: + msg358393
2019-12-14 07:58:48	mkaiser	set	messages: + msg358384
2019-12-14 06:12:46	mkaiser	set	messages: + msg358378
2019-12-13 19:44:51	r.david.murray	set	messages: + msg358352
2019-12-13 18:52:06	mkaiser	set	files: + original_mail_from_gmail.eml
2019-12-13 18:36:11	mkaiser	set	messages: + msg358351
2019-12-13 18:34:12	mkaiser	set	files: + error.eml messages: + msg358350
2019-12-13 18:29:20	mkaiser	set	files: + testscript.py
2019-12-13 17:38:36	r.david.murray	set	messages: + msg358345
2019-12-13 17:08:59	mkaiser	set	title: Wrong filename in mail when mime header was too long -> Wrong attachement filename when mail mime header was too long
2019-12-13 17:08:35	mkaiser	set	title: Wrong filename in when mime header was too long -> Wrong filename in mail when mime header was too long
2019-12-13 17:04:32	mkaiser	set	keywords: + patch stage: patch review pull_requests: + pull_request17062
2019-12-13 16:59:51	mkaiser	create