Issue 40359: email.parse part.get_filename() fails to unwrap long attachment file names (legacy API)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84539

classification

Title:	email.parse part.get_filename() fails to unwrap long attachment file names (legacy API)
Type:		Stage:
Components:	email	Versions:	Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, matt-davis, r.david.murray
Priority:	normal	Keywords:

Created on 2020-04-22 00:26 by matt-davis, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
mwe.py	matt-davis, 2020-04-22 00:26	python script with minimal working example

Messages (6)
msg366963 - (view)	Author: Matthew Davis (matt-davis)	Date: 2020-04-22 00:26
# Summary When parsing emails with long attachment file names, part.get_filename() often returns \n or \r\n. It should strip those characters out. # Steps to reproduce I have attached a minimal working example. The relevant part of the raw email is: --_004_D6CEDE1EBD6645898F5643C0C6878005examplecom_ Content-Type: text/plain; name="an attachment with a very very very long super long file name which has many words and just keeps on going and going.txt" # Expected output: attachments = ["an attachment with a very very very long super long file name which has many words and just keeps on going and going.txt"] Maybe I'm reading the email RFC spec wrong. My interpretation is that the parser should do something like: raw = raw.replace('\r\n ', ' ').replace('\n ', ' ') # Actual output attachments = ["an attachment with a very very very long super long file name which\n has many words and just keeps on going and going.txt"] Note that I have seen other examples where the output includes \r\n not just \n # Impact I'm trying to write an email bot which saves attachments to a database, and also forwards on the emails. My both thinks that the filename includes a line break. This inevitably causes failures in my subsequent code. # Relevant links: The RFC for email spec is here: https://tools.ietf.org/html/rfc2822.html#section-2.2.3 This Stack Overflow answer seems relevant: https://stackoverflow.com/questions/3050298/parsing-email-with-python/3050374#3050374 Issue 3601 may be relevant, but doesn't seem exactly the same. It seems to be the reverse, constructing emails with long headers. My issue is parsing emails with long headers.
msg366965 - (view)	Author: Matthew Davis (matt-davis)	Date: 2020-04-22 01:20
Ah woops, I mistyped the relevant ticket. It's issue 36401 https://bugs.python.org/issue36041
msg366966 - (view)	Author: Matthew Davis (matt-davis)	Date: 2020-04-22 01:20
36041
msg367117 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2020-04-23 14:33
Yeah, that looks like a bug in the old API. If you try the new API, it does the right thing. To do that, import email.policy and make your message_as_string call: email.message_from_string(raw, policy=email.policy.default) Note, however, that you really ought to be using message_from_bytes. Serialized email messages are bytes, not unicode, and using message_from_string will get you in to other trouble. I don't know if it is worth fixing the old API.
msg367493 - (view)	Author: Matthew Davis (matt-davis)	Date: 2020-04-28 04:26
Ah, yes that workaround works. Thanks! So what's the exact status of this policy? It's called the default policy, but it's not used by default? If I download the latest version of python, will this be parsed correctly without explicitly setting the policy? i.e. Is this still something that should be changed in the code? (Yes, I already use message_from_bytes in my real application. I just used message_from_string in the MWE, because I could only attach one file in this web page, so I embedded the email body as a string in the script.)
msg367523 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2020-04-28 12:27
As far as I know you currently still have to specify the policy. It was, yes, intended that 'default' become the actual default. I could have sworn there was an open issue for doing this, but I can't find it. I remember having a conversation with someone who said they were going to work on getting it done, but unfortunately I don't remember who :( I'm not very active in the python community currently so I can't really drive it, but it should definitely happen.

History
Date	User	Action	Args
2022-04-11 14:59:29	admin	set	github: 84539
2020-04-28 12:27:30	r.david.murray	set	messages: + msg367523
2020-04-28 04:26:56	matt-davis	set	messages: + msg367493
2020-04-23 14:33:02	r.david.murray	set	messages: + msg367117 title: email.parse part.get_filename() fails to unwrap long attachment file names -> email.parse part.get_filename() fails to unwrap long attachment file names (legacy API)
2020-04-22 01:20:37	matt-davis	set	messages: + msg366966
2020-04-22 01:20:05	matt-davis	set	messages: + msg366965
2020-04-22 00:26:27	matt-davis	create