Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong attachement filename when mail mime header was too long #83221

Closed
manfred-kaiser mannequin opened this issue Dec 13, 2019 · 27 comments
Closed

Wrong attachement filename when mail mime header was too long #83221

manfred-kaiser mannequin opened this issue Dec 13, 2019 · 27 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-email type-bug An unexpected behavior, bug, or error

Comments

@manfred-kaiser
Copy link
Mannequin

manfred-kaiser mannequin commented Dec 13, 2019

BPO 39040
Nosy @warsaw, @bitdancer, @maxking, @miss-islington, @manfred-kaiser
PRs
  • bpo-39040: added whitespaced to linesep_splitter in email.policy #17590
  • bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. #17620
  • [3.9] bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620) #20504
  • [3.8] bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620) #20505
  • [3.7] bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620) #20506
  • Files
  • testscript.py: Script to read filenames
  • error.eml: Mail with broken filename
  • original_mail_from_gmail.eml: downloaded mail from gmail (web interface). I only edited the the mail addresses for privacy reasons. Same problem with this mail
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2019-12-13.16:59:51.619>
    labels = ['3.8', 'type-bug', '3.7', 'expert-email', '3.9']
    title = 'Wrong attachement filename when mail mime header was too long'
    updated_at = <Date 2020-05-29.11:43:51.599>
    user = 'https://github.com/manfred-kaiser'

    bugs.python.org fields:

    activity = <Date 2020-05-29.11:43:51.599>
    actor = 'miss-islington'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['email']
    creation = <Date 2019-12-13.16:59:51.619>
    creator = 'mkaiser'
    dependencies = []
    files = ['48775', '48776', '48777']
    hgrepos = []
    issue_num = 39040
    keywords = ['patch']
    message_count = 23.0
    messages = ['358343', '358345', '358350', '358351', '358352', '358378', '358384', '358393', '358395', '358396', '358458', '358460', '358463', '358500', '358530', '358573', '358608', '358853', '358857', '370276', '370298', '370299', '370300']
    nosy_count = 5.0
    nosy_names = ['barry', 'r.david.murray', 'maxking', 'miss-islington', 'mkaiser']
    pr_nums = ['17590', '17620', '20504', '20505', '20506']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue39040'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @manfred-kaiser
    Copy link
    Mannequin Author

    manfred-kaiser mannequin commented Dec 13, 2019

    I'm working on a mailfilter in python and used the method "get_filename" of the "EmailMessage" class.

    In some cases a wrong filename was returned. The reason was, that the Content-Disposition Header had a line break and the following intention was interpreted as part of the filename.

    After fixing this bug, I was able to get the right filename.

    I had to change "linesep_splitter" in "email.policy" to match the intention.

    Old Value:

    linesep_splitter = re.compile(r'\n|\r')

    New Value:

    linesep_splitter = re.compile(r'\n\s+|\r\s+')

    @manfred-kaiser manfred-kaiser mannequin added 3.7 (EOL) end of life 3.8 only security fixes topic-email type-bug An unexpected behavior, bug, or error labels Dec 13, 2019
    @manfred-kaiser manfred-kaiser mannequin changed the title Wrong filename in when mime header was too long Wrong filename in mail when mime header was too long Dec 13, 2019
    @manfred-kaiser manfred-kaiser mannequin changed the title Wrong filename in when mime header was too long Wrong filename in mail when mime header was too long Dec 13, 2019
    @manfred-kaiser manfred-kaiser mannequin changed the title Wrong filename in mail when mime header was too long Wrong attachement filename when mail mime header was too long Dec 13, 2019
    @manfred-kaiser manfred-kaiser mannequin changed the title Wrong filename in mail when mime header was too long Wrong attachement filename when mail mime header was too long Dec 13, 2019
    @bitdancer
    Copy link
    Member

    Thanks for the report. Can you provide an example that reproduces the problem?

    Per the RFC, lines may be broken before whitespace in certain places in certain headers, but that does not make the whitespace go away. Only the crlf sequence is removed when unfolding the header, per the RFC, so your proposed fix is incorrect. I suspect your example header is invalid, and the question will then become is there some sort of Postel-style error recovery we can and want to do in the function that parses the content-disposition header.

    @manfred-kaiser
    Copy link
    Mannequin Author

    manfred-kaiser mannequin commented Dec 13, 2019

    The original filename is "Schulbesuchsbestättigung.pdf", but when I use the method "get_filename" I got "Schulbesuchsbestättigung. pdf"

    I removed some headers from the mail for privacy reasons

    @manfred-kaiser
    Copy link
    Mannequin Author

    manfred-kaiser mannequin commented Dec 13, 2019

    The mail was sent from the GMail web interface

    @bitdancer
    Copy link
    Member

    That header is *completely* non-RFC compliant. If gmail generated that header there is something very wrong in google-land :(

    The RFC compliant formatting for that header looks like this:

    Content-Disposition: attachment;
    filename*=utf-8''Schulbesuchsbest%C3%A4ttigung.pdf

    You will note that this is nothing like encoded word format. Encoded words are not valid inside quoted strings, and quoted strings can't be used in mime header attributes if there are non-ascii characters involved. Nor can encoded words.

    Now, all that said, there is an obvious rule that can be followed to understand what that header is trying to convey, and the current parser already implements most of it (you will find comments about it in the parser, as well as defects being registered). So, a patch to _header_value_parser to fix the error recovery will be accepted. I've looked at the code to remind myself, but not deeply enough to be *sure* where the changes need to be made. There are two possibilities I see off the bat (and both may need fixing): get_bare_quoted_string and get_parameter. Either one or both of those may be forgetting that whitespace between encoded words should be dropped.

    @manfred-kaiser
    Copy link
    Mannequin Author

    manfred-kaiser mannequin commented Dec 14, 2019

    thanks for your response. I have found the RFC https://tools.ietf.org/html/rfc2184

    Gmail creates wrong Headers, which are not rfc-compliant.
    The problem is, that many people are using gmail and emails, which were sent from Gmail might be wrong.

    How can we solve this problem? It is not a Python problem. We can create workarrounds. But in my opinion Google has to fix the bug.

    @manfred-kaiser
    Copy link
    Mannequin Author

    manfred-kaiser mannequin commented Dec 14, 2019

    RFC-2184 was obsoleted by RFC-2231 (https://www.rfc-editor.org/rfc/rfc2231.html)

    There are also no quoted strings, like google uses.

    @manfred-kaiser
    Copy link
    Mannequin Author

    manfred-kaiser mannequin commented Dec 14, 2019

    as you mentioned, rfc-2047 forbidds encoded words in quoted strings.

    Source: https://tools.ietf.org/html/rfc2047 - Chapter 5/3

    I have tested a few web mail clients and they have the same issue. According to the RFCs, this is not allowed, but I think it is widely used.

    Should we fix this problem?

    @bitdancer
    Copy link
    Member

    Yes, google should fix their bug. However, the python email package tries very hard to interpret even RFC-non-compliant emails when there is a way to do so. As I said, the package already tries to interpret headers such as google is generating, it's just that there is a bug in that interpretation: it is keeping the blank between then encoded words when it should not be. That bug can be fixed, in get_raw_encoded_word and/or get_parameter, in email._header_value_parser.

    @bitdancer
    Copy link
    Member

    And you are right that this is a very common bug in email programs. So common that I suspect the RFC folks will eventually have to accept it as a de-facto standard. So we do need to support it in the python email library.

    @maxking
    Copy link
    Contributor

    maxking commented Dec 15, 2019

    I tried to take a look at the code to see where the fix needs to be and I probably need some help.

    I looked at the parse tree for the header and it looks something like this:

    ContentDisposition([Token([ValueTerminal('attachment')]), ValueTerminal(';'), MimeParameters([Parameter([Attribute([CFWSList([WhiteSpaceTerminal(' ')]), ValueTerminal('filename')]), ValueTerminal('='), Value([QuotedString([BareQuotedString([EncodedWord([ValueTerminal('Schulbesuchsbestättigung.')]), WhiteSpaceTerminal(' '), EncodedWord([ValueTerminal('pdf')])])])])])])])

    The offending piece of code, which seems to be working as designed is get_bare_quoted_string() in email/_header_value_parser.py.

        while value and value[0] != '"':
            if value[0] in WSP:
                token, value = get_fws(value)
            elif value[:2] == '=?':
                try:
                    token, value = get_encoded_word(value)
                    bare_quoted_string.defects.append(errors.InvalidHeaderDefect(
                        "encoded word inside quoted string"))
                except errors.HeaderParseError:
                    token, value = get_qcontent(value)
            else:
                token, value = get_qcontent(value)
            bare_quoted_string.append(token)

    It just loops and parses the values. We cannot ignore the FWS until we know that the atom before and after the FWS are encoded words. I can't seem to find a clean way to look-ahead (which can perhaps be used in get_parameters()) or look-back (which can be used after parsing the entire bare_quoted_string?) in the parse tree to delete the offending whitespace.

    Any example of such kind of parse-tree manipulation in the code base would be awesome!

    @maxking maxking added 3.9 only security fixes labels Dec 15, 2019
    @bitdancer
    Copy link
    Member

    The example you want to look at is get_unstructured. That shows both lookback and modification of the parse tree to handle the whitespace between encoded words.

    @maxking
    Copy link
    Contributor

    maxking commented Dec 16, 2019

    Thanks for the pointer, David! I created a PR for the fix, would you be able to review it please?

    @bitdancer
    Copy link
    Member

    In general your solution looks good, just a few naming comments and an additional test request.

    @maxking
    Copy link
    Contributor

    maxking commented Dec 17, 2019

    Thanks David! I applied the fixes as per your comments, can you please take another look?

    @bitdancer
    Copy link
    Member

    One more tweak to the test and we'll be good to go.

    @maxking
    Copy link
    Contributor

    maxking commented Dec 18, 2019

    Sure, fixed as per your comments in the PR.

    @bitdancer
    Copy link
    Member

    I don't see the change to the test in the PR. Did you miss a push or is github doing something wonky with the review? (I haven't used github review in a while and I had forgetten how hard it is to use...)

    @maxking
    Copy link
    Contributor

    maxking commented Dec 24, 2019

    I double checked, there should be 4 commits in the PR and last 2 have the changes that you asked for in the test case and NEWS entry.

    Your previous comment will point at the old diff, you might have to look at the full diff here: https://github.com/python/cpython/pull/17620/files or if you want, this is the diff for the 2 commits with the changes you requested: https://github.com/python/cpython/pull/17620/files/bf2cb76009d72869d9df6550b473b5818ceab311..016ceb3ef00b3b940993d35d539ce63d68437d4f

    @bitdancer
    Copy link
    Member

    New changeset 21017ed by Abhilash Raj in branch 'master':
    bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620)
    21017ed

    @miss-islington
    Copy link
    Contributor

    New changeset a6ae02d by Miss Islington (bot) in branch '3.9':
    bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620)
    a6ae02d

    @miss-islington
    Copy link
    Contributor

    New changeset 6381ee0 by Miss Islington (bot) in branch '3.8':
    bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620)
    6381ee0

    @miss-islington
    Copy link
    Contributor

    New changeset 5f977e0 by Miss Islington (bot) in branch '3.7':
    bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (gh-17620)
    5f977e0

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @jikamens
    Copy link

    Strange that this hasn't been fixed when a fix was submitted literally years ago.

    @bitdancer
    Copy link
    Member

    It looks like it was and the issue just wasn't closed.

    @jikamens
    Copy link

    jikamens commented Jun 20, 2022

    @bitdancer It does not appear to have been fixed. I just ran into it with python 3.10.4:

    $ cat /tmp/testbug.py 
    #!/usr/bin/env python3
    
    import email
    
    message = email.message_from_string("""\
    MIME-Version: 1.0
    Content-Type: text/plain; name="This File Name Is
     Folded.pdf"
    Content-Disposition: attachment; filename="This File Name Is
     Folded.pdf"
    Content-Transfer-Encoding: 8bit
    
    Foo
    """)
    
    print(message.get_filename())
    $ python3 /tmp/testbug.py
    This File Name Is
     Folded.pdf
    $ python3 -V
    3.10.4
    

    @bitdancer
    Copy link
    Member

    That's a different bug, and it is in the legacy API. The new API handles it correctly:

    rdmurray@pydev:~/cpython/master[master]>cat temp.py
    #!/usr/bin/env python3
    
    import email, email.policy
    
    message = email.message_from_string("""\
    MIME-Version: 1.0
    Content-Type: text/plain; name="This File Name Is
     Folded.pdf"
    Content-Disposition: attachment; filename="This File Name Is
     Folded.pdf"
    Content-Transfer-Encoding: 8bit
    
    Foo
    """, policy=email.policy.default)
    
    print(message.get_filename())
    rdmurray@pydev:~/cpython/master[master]>./python temp.py
    This File Name Is Folded.pdf
    rdmurray@pydev:~/cpython/master[master]>./python -V
    Python 3.12.0a0
    rdmurray@pydev:~/cpython/master[master]>
    

    I don't think this is worth trying to fix in the legacy api; the reason it happens there is probably a result of some deeply embedded assumptions in that code and may not be easy to fix. (What would be worth fixing is making the new API truly be the default policy, but I don't currently have time to shepherd that process).

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-email type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants