classification
Title: email: get_boundary (and get_filename) invokes unquote twice
Type: behavior Stage:
Components: email Versions: Python 3.4, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Eric Lafontaine, barry, bpoaugust, r.david.murray
Priority: normal Keywords: patch

Created on 2016-12-12 12:54 by bpoaugust, last changed 2017-01-03 12:21 by vstinner.

Files
File name Uploaded Description Edit
issue_28945.patch Eric Lafontaine, 2016-12-22 15:38 utils.collapse_rfc2231_value doesn't unquote on this one review
Messages (15)
msg282991 - (view) Author: (bpoaugust) Date: 2016-12-12 12:54
get_boundary calls get_param('boundary') which unquotes the value.
It then calls utils.collapse_rfc2231_value which also calls unquote.

This causes problems for boundaries that have two sets of quotes.
For example, I have seen the following in the wild:

Content-Type: multipart/mixed; boundary="<<001-3e1dcd5a-119e>>"

Both "" and <> are treated as quotes by unquote.

The result is that both "< and >" are stripped off.
This means that the boundaries no longer match.
msg283651 - (view) Author: Eric Lafontaine (Eric Lafontaine) * Date: 2016-12-19 21:38
Hi all,

I believe this is the right behavior and what ever generated the boundary "<<>>" is the problem ; 


RFC  2046 page 22:
_____________________
The only mandatory global parameter for the "multipart" media type is the boundary parameter, which consists of 1 to 70 characters from a set of characters known to be very robust through mail gateways, and NOT ending with white space. (If a boundary delimiter line appears to end with white space, the white space must be presumed to have been added by a gateway, and must be deleted.)  It is formally specified by the following BNF:

     boundary := 0*69<bchars> bcharsnospace

     bchars := bcharsnospace / " "

     bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
                      "+" / "_" / "," / "-" / "." /
                      "/" / ":" / "=" / "?"
_____________________
In other words, the only valid boundaries characters are :
01234567890 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'()+_,-./:=?

Any other character should be removed to get the boundary right.  I believe the issue is that it wasn't removed in the first place.  It is a bug in my opinion, but the other way around :).

Funny thing is that the unquote function only remove the first&last character it sees... either '<' and the '"'...

def unquote(str):
    """Remove quotes from a string."""
    if len(str) > 1:
        if str.startswith('"') and str.endswith('"'):
            return str[1:-1].replace('\\\\', '\\').replace('\\"', '"')
        if str.startswith('<') and str.endswith('>'):
            return str[1:-1]
    return str

Now, if I modify unquote to only keep the list of character above, would I break something? Probably. 
I haven't found any other defining RFC about boundaries that tells me what was the format supported.  Can someone help me on that?

This is what the function should look like :

import string
def get_boundary(str):
    """ return the valid boundary parameter as per RFC 2046 page 22. """
    if len(str) > 1:
        import re
        return re.sub('[^'+
                      string.ascii_letters +
                      string.digits +
                      r""" '()+_,-./:=?]|="""
                      ,'',str
                      ).rstrip(' ')
    return str

import unittest

class boundary_tester(unittest.TestCase):
    def test_get_boundary(self):
        boundary1 = """ abc def gh< 123 >!@ %!%' """
        ref_boundary1 = """ abc def gh 123  '""" # this is the valid Boundary
        ret_value = get_boundary(boundary1)
        self.assertEqual(ret_value,ref_boundary1)

    def test_get_boundary2(self):
        boundary1 = ''.join((' ',string.printable))
        ref_boundary1 = ' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'()+,-./:?_' # this is the valid Boundary
        ret_value = get_boundary(boundary1)
        self.assertEqual(ret_value,ref_boundary1)


I believe this should be added to the email.message.Message get_boundary function.  

Regards,
Eric Lafontaine
msg283656 - (view) Author: (bpoaugust) Date: 2016-12-20 00:20
I agree that strictly speaking the boundary is invalid.
However:
'Be strict in what you generate, be liberal in what you accept'
The mail package should never create such boundaries.
However it should process them if possible.

If the boundary definition is mangled by stripping out all the invalid characters, then it won't match the markers. So it won't solve the issue.

Whereas ensuring that only a single level of quotes is removed does fix the issue.

This is what works for me:

def get_boundary(self, failobj=None):
    missing = object()
    # get_param removes the outer quotes
    boundary = self.get_param('boundary', missing)
    if boundary is missing:
        return failobj
    # Don't unquote again in collapse_rfc2231_value
    if not isinstance(boundary, tuple) or len(boundary) != 3:
        return boundary
    # RFC 2046 says that boundaries may begin but not end in w/s
    return utils.collapse_rfc2231_value(boundary).rstrip()

I think the bug is actually in collapse_rfc2231_value - that should not do any unquoting, as the value will be passed to it already unquoted (at least in this case). However there might perhaps be some cases where collapse_rfc2231_value is called with a quoted value, so to fix the immediate problem it's safer to fix get_boundary. (I could have re-quoted the value instead, and let collapse_rfc2231_value do its thing.)

unquote is correct as it stands - it should only remove the outer quotes. There may be a need to quote strings that just happen to be enclosed in quote chars.
msg283678 - (view) Author: (bpoaugust) Date: 2016-12-20 10:38
It looks like a simpler alternative is to just change

boundary = self.get_param('boundary', missing)
to
boundary = self.get_param('boundary', missing, unquote=False)

and let collapse_rfc2231_value do the unquoting
msg283680 - (view) Author: (bpoaugust) Date: 2016-12-20 11:07
According to RFC822, a quoted-string should only be wrapped in double-quotes. 

So I'm not sure why unquote treats <> as quotes. If it did not, then again this issue would not arise.

However maybe utils.unquote is needed by other code that uses <>, so it cannot just be updated without further analysis.
msg283693 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-12-20 14:47
I'm pretty sure you are correct in your guess that the utility is used by other code that depends on the <> unquoting.

The new email policies (that is, the new header parsing code) probably do a better job of this, but get_param and friends haven't been wired up to use them when available yet.
msg283702 - (view) Author: Eric Lafontaine (Eric Lafontaine) * Date: 2016-12-20 16:21
Hi all,

I hate this proposition.  I feel it's a "victory" for the people who don't want to follow RFC standard and allow "triple"-quoting on things that aren't supposed to...

Now that my opinion is said, I made 2 test case that should be added to the test_email file of the test library to support the change :

_________________________________________________________
    def test_rfc2231_multiple_quote_boundary(self):
        m = '''\
Content-Type: multipart/alternative;
\tboundary*0*="<<This%20is%20even%20more%20";
\tboundary*1*="%2A%2A%2Afun%2A%2A%2A%20";
\tboundary*2="is it not.pdf>>"

'''
        msg = email.message_from_string(m)
        self.assertEqual(msg.get_boundary(),
                         '<<This is even more ***fun*** is it not.pdf>>')
    def test_rfc2231_multiple_quote_boundary2(self):
        m = '''\
Content-Type: multipart/alternative;
\tboundary="<<This is even more >>";

'''
        msg = email.message_from_string(m)
        self.assertEqual(msg.get_boundary(),
                         '<<This is even more >>')
_______________________________________________________

The problem however does lie within collapse function as you've mentionned (because you've taken the code from there haven't you? :P) :
def collapse_rfc2231_value(value, errors='replace',
                           fallback_charset='us-ascii'):
    if not isinstance(value, tuple) or len(value) != 3:
        return value                                     <=======removed unquote on value
        
 This doesn't seem to have broken any of the 1580 tests case of test_email (Python3.7), but I can't help but to feel weird about it.  Could someone explain why we were unquoting there? and why use those weird condition (not isintance(value,tuple) or len(value) !=3)....?
 
 Regards,
 Eric Lafontaine
msg283703 - (view) Author: Eric Lafontaine (Eric Lafontaine) * Date: 2016-12-20 16:28
N.B., I've tried to do the keyword parameter on the get_param inside the get_boundary as well as I though it was a good Idea as well, but it was breaking some test cases (multiple ones).  So I didn't pursue.

N.B.  I'm guessing (guessing) unquote must've been used for email adresses as well i.e. "Eric Lafontaine <eric.lafontaine1@gmail.com>"

I also didn't had the time to read all answer since yesterday, sorry for not adding it to my earlier answer.
msg283704 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-12-20 16:32
The RFC compliance battle in email was lost long ago.  We have to do our best to process what we get, and to produce only RFC correct output (unless retransmitting without change).  That is Postel's law, and you can't successfully process Internet email without following the first part.

I haven't looked at the proposed solution yet.  Not sure when I'll get time, since I should try to think and research the questions (why was it done the way it is currently done?), which will take quite a bit of time.
msg283850 - (view) Author: Eric Lafontaine (Eric Lafontaine) * Date: 2016-12-22 15:38
Hi,

I would like to have a bigger set of user testing this patch before it gets approve... I really don't know all the extent to which it's a good/bad idea.  

The proposition was to not do the unquote in the rfc2231 collapse method.  It doesn't break anything on my machine on all the email tests... but I can't feel safe.

I also added the 2 test case for future support (multiline boundary and single line boundary).  It's basically the code I pasted on my answer on the 2016-12-20.

Let me know what you think.  Is it there we should stop doing it?

Regards,
Eric Lafontaine
msg283851 - (view) Author: Eric Lafontaine (Eric Lafontaine) * Date: 2016-12-22 15:41
Hi,

I would like to thank you for keeping up with me.  I may not be easy at times, but please make me understand if it doesn't make sense :).

Thanks bpoaugust and David,
Eric Lafontaine
msg283852 - (view) Author: (bpoaugust) Date: 2016-12-22 17:16
I have just discovered the same problem with get_filename.
Not surprising as its code is basically the same as get_boundary.

Unix paths can contain anything, so it's not correct to remove special characters. [It's up to the receiving file system to decide how to deal with chars that are not valid for it; the original name must be passed unchanged]

If the quoting/unquoting is fixed for filenames, then it should be OK for the boundary as well.

I think collapse_rfc2231_value should assume that any unquoting has already been done, and should therefore not call utils.unquote at all.

The get_param() method by default unquotes both single strings and encoded triplets, so it's certainly the case that get_boundary and get_filename will pass an unquoted value to rfc2231, as will any other caller that calls get_param with the default parameters.
msg283853 - (view) Author: (bpoaugust) Date: 2016-12-22 17:25
Note: it's easy to create test e-mails with attachments using mutt.

echo test | mutt -s "test" -a files... -- user@domain

I did some testing with the following names:

<>
><
""
"<abc>"
<"abc">
>abc<
">abc<"
>"abc"<

There are no doubt other awkward names one can devise; in particular it would be useful to generate similar names with non-ASCII characters in them to force the use of value triples.
msg284294 - (view) Author: (bpoaugust) Date: 2016-12-29 21:00
This is actually a bug in collapse_rfc2231_value: issue29020
msg284502 - (view) Author: (bpoaugust) Date: 2017-01-02 20:57
The patch is incomplete.

There is another unquote call:

336 return unquote(text)
History
Date User Action Args
2017-01-03 12:21:15vstinnersettitle: get_boundary (and get_filename) invokes unquote twice -> email: get_boundary (and get_filename) invokes unquote twice
2017-01-02 20:57:20bpoaugustsetmessages: + msg284502
2016-12-29 21:00:53bpoaugustsetmessages: + msg284294
2016-12-22 17:25:02bpoaugustsetmessages: + msg283853
2016-12-22 17:16:31bpoaugustsetmessages: + msg283852
title: get_boundary invokes unquote twice -> get_boundary (and get_filename) invokes unquote twice
2016-12-22 15:41:20Eric Lafontainesetmessages: + msg283851
2016-12-22 15:38:41Eric Lafontainesetfiles: + issue_28945.patch
keywords: + patch
messages: + msg283850
2016-12-20 16:32:57r.david.murraysetmessages: + msg283704
2016-12-20 16:28:44Eric Lafontainesetmessages: + msg283703
2016-12-20 16:21:22Eric Lafontainesetmessages: + msg283702
2016-12-20 14:47:39r.david.murraysetmessages: + msg283693
2016-12-20 11:07:43bpoaugustsetmessages: + msg283680
2016-12-20 10:38:51bpoaugustsetmessages: + msg283678
2016-12-20 00:20:28bpoaugustsetmessages: + msg283656
2016-12-19 21:38:33Eric Lafontainesetnosy: + Eric Lafontaine
messages: + msg283651
2016-12-12 12:54:57bpoaugustcreate