Issue 28945: email: get_boundary (and get_filename) invokes unquote twice

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73131

classification

Title:	email: get_boundary (and get_filename) invokes unquote twice
Type:	behavior	Stage:
Components:	email	Versions:	Python 3.4, Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Eric Lafontaine, barry, bpoaugust, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2016-12-12 12:54 by bpoaugust, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
issue_28945.patch	Eric Lafontaine, 2016-12-22 15:38	utils.collapse_rfc2231_value doesn't unquote on this one	review

Messages (15)
msg282991 - (view)	Author: (bpoaugust)	Date: 2016-12-12 12:54
get_boundary calls get_param('boundary') which unquotes the value. It then calls utils.collapse_rfc2231_value which also calls unquote. This causes problems for boundaries that have two sets of quotes. For example, I have seen the following in the wild: Content-Type: multipart/mixed; boundary="<<001-3e1dcd5a-119e>>" Both "" and <> are treated as quotes by unquote. The result is that both "< and >" are stripped off. This means that the boundaries no longer match.
msg283651 - (view)	Author: Eric Lafontaine (Eric Lafontaine) *	Date: 2016-12-19 21:38
Hi all, I believe this is the right behavior and what ever generated the boundary "<<>>" is the problem ; RFC 2046 page 22: _____________________ The only mandatory global parameter for the "multipart" media type is the boundary parameter, which consists of 1 to 70 characters from a set of characters known to be very robust through mail gateways, and NOT ending with white space. (If a boundary delimiter line appears to end with white space, the white space must be presumed to have been added by a gateway, and must be deleted.) It is formally specified by the following BNF: boundary := 0*69<bchars> bcharsnospace bchars := bcharsnospace / " " bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" / "+" / "_" / "," / "-" / "." / "/" / ":" / "=" / "?" _____________________ In other words, the only valid boundaries characters are : 01234567890 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'()+_,-./:=? Any other character should be removed to get the boundary right. I believe the issue is that it wasn't removed in the first place. It is a bug in my opinion, but the other way around :). Funny thing is that the unquote function only remove the first&last character it sees... either '<' and the '"'... def unquote(str): """Remove quotes from a string.""" if len(str) > 1: if str.startswith('"') and str.endswith('"'): return str[1:-1].replace('\\\\', '\\').replace('\\"', '"') if str.startswith('<') and str.endswith('>'): return str[1:-1] return str Now, if I modify unquote to only keep the list of character above, would I break something? Probably. I haven't found any other defining RFC about boundaries that tells me what was the format supported. Can someone help me on that? This is what the function should look like : import string def get_boundary(str): """ return the valid boundary parameter as per RFC 2046 page 22. """ if len(str) > 1: import re return re.sub('[^'+ string.ascii_letters + string.digits + r""" '()+_,-./:=?]\|=""" ,'',str ).rstrip(' ') return str import unittest class boundary_tester(unittest.TestCase): def test_get_boundary(self): boundary1 = """ abc def gh< 123 >!@ %!%' """ ref_boundary1 = """ abc def gh 123 '""" # this is the valid Boundary ret_value = get_boundary(boundary1) self.assertEqual(ret_value,ref_boundary1) def test_get_boundary2(self): boundary1 = ''.join((' ',string.printable)) ref_boundary1 = ' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'()+,-./:?_' # this is the valid Boundary ret_value = get_boundary(boundary1) self.assertEqual(ret_value,ref_boundary1) I believe this should be added to the email.message.Message get_boundary function. Regards, Eric Lafontaine
msg283656 - (view)	Author: (bpoaugust)	Date: 2016-12-20 00:20
I agree that strictly speaking the boundary is invalid. However: 'Be strict in what you generate, be liberal in what you accept' The mail package should never create such boundaries. However it should process them if possible. If the boundary definition is mangled by stripping out all the invalid characters, then it won't match the markers. So it won't solve the issue. Whereas ensuring that only a single level of quotes is removed does fix the issue. This is what works for me: def get_boundary(self, failobj=None): missing = object() # get_param removes the outer quotes boundary = self.get_param('boundary', missing) if boundary is missing: return failobj # Don't unquote again in collapse_rfc2231_value if not isinstance(boundary, tuple) or len(boundary) != 3: return boundary # RFC 2046 says that boundaries may begin but not end in w/s return utils.collapse_rfc2231_value(boundary).rstrip() I think the bug is actually in collapse_rfc2231_value - that should not do any unquoting, as the value will be passed to it already unquoted (at least in this case). However there might perhaps be some cases where collapse_rfc2231_value is called with a quoted value, so to fix the immediate problem it's safer to fix get_boundary. (I could have re-quoted the value instead, and let collapse_rfc2231_value do its thing.) unquote is correct as it stands - it should only remove the outer quotes. There may be a need to quote strings that just happen to be enclosed in quote chars.
msg283678 - (view)	Author: (bpoaugust)	Date: 2016-12-20 10:38
It looks like a simpler alternative is to just change boundary = self.get_param('boundary', missing) to boundary = self.get_param('boundary', missing, unquote=False) and let collapse_rfc2231_value do the unquoting
msg283680 - (view)	Author: (bpoaugust)	Date: 2016-12-20 11:07
According to RFC822, a quoted-string should only be wrapped in double-quotes. So I'm not sure why unquote treats <> as quotes. If it did not, then again this issue would not arise. However maybe utils.unquote is needed by other code that uses <>, so it cannot just be updated without further analysis.
msg283693 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-12-20 14:47
I'm pretty sure you are correct in your guess that the utility is used by other code that depends on the <> unquoting. The new email policies (that is, the new header parsing code) probably do a better job of this, but get_param and friends haven't been wired up to use them when available yet.
msg283702 - (view)	Author: Eric Lafontaine (Eric Lafontaine) *	Date: 2016-12-20 16:21
Hi all, I hate this proposition. I feel it's a "victory" for the people who don't want to follow RFC standard and allow "triple"-quoting on things that aren't supposed to... Now that my opinion is said, I made 2 test case that should be added to the test_email file of the test library to support the change : _________________________________________________________ def test_rfc2231_multiple_quote_boundary(self): m = '''\ Content-Type: multipart/alternative; \tboundary0="<<This%20is%20even%20more%20"; \tboundary1="%2A%2A%2Afun%2A%2A%2A%20"; \tboundary2="is it not.pdf>>" ''' msg = email.message_from_string(m) self.assertEqual(msg.get_boundary(), '<<This is even more fun* is it not.pdf>>') def test_rfc2231_multiple_quote_boundary2(self): m = '''\ Content-Type: multipart/alternative; \tboundary="<<This is even more >>"; ''' msg = email.message_from_string(m) self.assertEqual(msg.get_boundary(), '<<This is even more >>') _______________________________________________________ The problem however does lie within collapse function as you've mentionned (because you've taken the code from there haven't you? :P) : def collapse_rfc2231_value(value, errors='replace', fallback_charset='us-ascii'): if not isinstance(value, tuple) or len(value) != 3: return value <=======removed unquote on value This doesn't seem to have broken any of the 1580 tests case of test_email (Python3.7), but I can't help but to feel weird about it. Could someone explain why we were unquoting there? and why use those weird condition (not isintance(value,tuple) or len(value) !=3)....? Regards, Eric Lafontaine
msg283703 - (view)	Author: Eric Lafontaine (Eric Lafontaine) *	Date: 2016-12-20 16:28
N.B., I've tried to do the keyword parameter on the get_param inside the get_boundary as well as I though it was a good Idea as well, but it was breaking some test cases (multiple ones). So I didn't pursue. N.B. I'm guessing (guessing) unquote must've been used for email adresses as well i.e. "Eric Lafontaine <eric.lafontaine1@gmail.com>" I also didn't had the time to read all answer since yesterday, sorry for not adding it to my earlier answer.
msg283704 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-12-20 16:32
The RFC compliance battle in email was lost long ago. We have to do our best to process what we get, and to produce only RFC correct output (unless retransmitting without change). That is Postel's law, and you can't successfully process Internet email without following the first part. I haven't looked at the proposed solution yet. Not sure when I'll get time, since I should try to think and research the questions (why was it done the way it is currently done?), which will take quite a bit of time.
msg283850 - (view)	Author: Eric Lafontaine (Eric Lafontaine) *	Date: 2016-12-22 15:38
Hi, I would like to have a bigger set of user testing this patch before it gets approve... I really don't know all the extent to which it's a good/bad idea. The proposition was to not do the unquote in the rfc2231 collapse method. It doesn't break anything on my machine on all the email tests... but I can't feel safe. I also added the 2 test case for future support (multiline boundary and single line boundary). It's basically the code I pasted on my answer on the 2016-12-20. Let me know what you think. Is it there we should stop doing it? Regards, Eric Lafontaine
msg283851 - (view)	Author: Eric Lafontaine (Eric Lafontaine) *	Date: 2016-12-22 15:41
Hi, I would like to thank you for keeping up with me. I may not be easy at times, but please make me understand if it doesn't make sense :). Thanks bpoaugust and David, Eric Lafontaine
msg283852 - (view)	Author: (bpoaugust)	Date: 2016-12-22 17:16
I have just discovered the same problem with get_filename. Not surprising as its code is basically the same as get_boundary. Unix paths can contain anything, so it's not correct to remove special characters. [It's up to the receiving file system to decide how to deal with chars that are not valid for it; the original name must be passed unchanged] If the quoting/unquoting is fixed for filenames, then it should be OK for the boundary as well. I think collapse_rfc2231_value should assume that any unquoting has already been done, and should therefore not call utils.unquote at all. The get_param() method by default unquotes both single strings and encoded triplets, so it's certainly the case that get_boundary and get_filename will pass an unquoted value to rfc2231, as will any other caller that calls get_param with the default parameters.
msg283853 - (view)	Author: (bpoaugust)	Date: 2016-12-22 17:25
Note: it's easy to create test e-mails with attachments using mutt. echo test \| mutt -s "test" -a files... -- user@domain I did some testing with the following names: <> >< "" "<abc>" <"abc"> >abc< ">abc<" >"abc"< There are no doubt other awkward names one can devise; in particular it would be useful to generate similar names with non-ASCII characters in them to force the use of value triples.
msg284294 - (view)	Author: (bpoaugust)	Date: 2016-12-29 21:00
This is actually a bug in collapse_rfc2231_value: issue29020
msg284502 - (view)	Author: (bpoaugust)	Date: 2017-01-02 20:57
The patch is incomplete. There is another unquote call: 336 return unquote(text)

History
Date	User	Action	Args
2022-04-11 14:58:40	admin	set	github: 73131
2017-01-03 12:21:15	vstinner	set	title: get_boundary (and get_filename) invokes unquote twice -> email: get_boundary (and get_filename) invokes unquote twice
2017-01-02 20:57:20	bpoaugust	set	messages: + msg284502
2016-12-29 21:00:53	bpoaugust	set	messages: + msg284294
2016-12-22 17:25:02	bpoaugust	set	messages: + msg283853
2016-12-22 17:16:31	bpoaugust	set	messages: + msg283852 title: get_boundary invokes unquote twice -> get_boundary (and get_filename) invokes unquote twice
2016-12-22 15:41:20	Eric Lafontaine	set	messages: + msg283851
2016-12-22 15:38:41	Eric Lafontaine	set	files: + issue_28945.patch keywords: + patch messages: + msg283850
2016-12-20 16:32:57	r.david.murray	set	messages: + msg283704
2016-12-20 16:28:44	Eric Lafontaine	set	messages: + msg283703
2016-12-20 16:21:22	Eric Lafontaine	set	messages: + msg283702
2016-12-20 14:47:39	r.david.murray	set	messages: + msg283693
2016-12-20 11:07:43	bpoaugust	set	messages: + msg283680
2016-12-20 10:38:51	bpoaugust	set	messages: + msg283678
2016-12-20 00:20:28	bpoaugust	set	messages: + msg283656
2016-12-19 21:38:33	Eric Lafontaine	set	nosy: + Eric Lafontaine messages: + msg283651
2016-12-12 12:54:57	bpoaugust	create