This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email.parse part.get_filename() fails to unwrap long attachment file names (legacy API)
Type: Stage:
Components: email Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, matt-davis, r.david.murray
Priority: normal Keywords:

Created on 2020-04-22 00:26 by matt-davis, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
mwe.py matt-davis, 2020-04-22 00:26 python script with minimal working example
Messages (6)
msg366963 - (view) Author: Matthew Davis (matt-davis) Date: 2020-04-22 00:26
# Summary

When parsing emails with long attachment file names, part.get_filename() often returns \n or \r\n.
It should strip those characters out.

# Steps to reproduce

I have attached a minimal working example.

The relevant part of the raw email is:

--_004_D6CEDE1EBD6645898F5643C0C6878005examplecom_
Content-Type: text/plain;
	name="an attachment with a very very very long super long file name which has
 many words and just keeps on going and going.txt"

# Expected output:

attachments = ["an attachment with a very very very long super long file name which has many words and just keeps on going and going.txt"]

Maybe I'm reading the email RFC spec wrong. My interpretation is that the parser should do something like:

raw = raw.replace('\r\n ', ' ').replace('\n ', ' ')

# Actual output

attachments = ["an attachment with a very very very long super long file name which\n has many words and just keeps on going and going.txt"]

Note that I have seen other examples where the output includes \r\n not just \n

# Impact

I'm trying to write an email bot which saves attachments to a database, and also forwards on the emails.
My both thinks that the filename includes a line break. This inevitably causes failures in my subsequent code.

# Relevant links:

The RFC for email spec is here: https://tools.ietf.org/html/rfc2822.html#section-2.2.3

This Stack Overflow answer seems relevant: https://stackoverflow.com/questions/3050298/parsing-email-with-python/3050374#3050374

Issue 3601 may be relevant, but doesn't seem exactly the same. It seems to be the reverse, *constructing* emails with long headers. My issue is *parsing* emails with long headers.
msg366965 - (view) Author: Matthew Davis (matt-davis) Date: 2020-04-22 01:20
Ah woops, I mistyped the relevant ticket.

It's issue 36401

https://bugs.python.org/issue36041
msg366966 - (view) Author: Matthew Davis (matt-davis) Date: 2020-04-22 01:20
36041
msg367117 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2020-04-23 14:33
Yeah, that looks like a bug in the old API.  If you try the new API, it does the right thing.  To do that, import email.policy and make your message_as_string call:

  email.message_from_string(raw, policy=email.policy.default)

Note, however, that you really ought to be using message_from_bytes.  Serialized email messages are bytes, not unicode, and using message_from_string will get you in to other trouble.

I don't know if it is worth fixing the old API.
msg367493 - (view) Author: Matthew Davis (matt-davis) Date: 2020-04-28 04:26
Ah, yes that workaround works. Thanks!

So what's the exact status of this policy? It's called the default policy, but it's not used by default?

If I download the latest version of python, will this be parsed correctly without explicitly setting the policy?

i.e. Is this still something that should be changed in the code?

(Yes, I already use message_from_bytes in my real application. I just used message_from_string in the MWE, because I could only attach one file in this web page, so I embedded the email body as a string in the script.)
msg367523 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2020-04-28 12:27
As far as I know you currently still have to specify the policy.  It was, yes, intended that 'default' become the actual default.  I could have sworn there was an open issue for doing this, but I can't find it.  I remember having a conversation with someone who said they were going to work on getting it done, but unfortunately I don't remember who :(

I'm not very active in the python community currently so I can't really drive it, but it should definitely happen.
History
Date User Action Args
2022-04-11 14:59:29adminsetgithub: 84539
2020-04-28 12:27:30r.david.murraysetmessages: + msg367523
2020-04-28 04:26:56matt-davissetmessages: + msg367493
2020-04-23 14:33:02r.david.murraysetmessages: + msg367117
title: email.parse part.get_filename() fails to unwrap long attachment file names -> email.parse part.get_filename() fails to unwrap long attachment file names (legacy API)
2020-04-22 01:20:37matt-davissetmessages: + msg366966
2020-04-22 01:20:05matt-davissetmessages: + msg366965
2020-04-22 00:26:27matt-daviscreate