classification
Title: cgi.FieldStorage should not call read_multi on files
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: cboos, jshields, mbordas, orsenthil, patrick.vrijlandt, r.david.murray, v+python
Priority: normal Keywords:

Created on 2012-08-06 09:31 by patrick.vrijlandt, last changed 2017-03-13 17:58 by jshields.

Files
File name Uploaded Description Edit
cgibug.py patrick.vrijlandt, 2012-08-11 13:03 Testscript (requires bottle) and exception traceback
test_cgi4.py patrick.vrijlandt, 2012-08-13 09:40
Messages (10)
msg167548 - (view) Author: patrick vrijlandt (patrick.vrijlandt) Date: 2012-08-06 09:31
.mht is an archive format created by Microsoft IE 8 when saving a webpage. It is essentially a mime multipart message.

My problem occurred when I uploaded such a file to a cgi-based server. The posted data would be fed to cgi.FieldStorage. (I can't post the file unfortunately)

As it turns out, cgi.FieldStorage tries to recursively parse the postdata, thereby splitting up the uploaded file; this fails. However, this (automatic) recursive behaviour seems unwanted for an uploaded file.

My proposal is thus to adapt cgi.py (line number for Python 3.2), so that in FieldStorage.__init__, line 542, read_multi would not be invoked in this case.

Currently it says:

    elif ctype[:10] == 'multipart/':
        self.read_multi(environ, keep_blank_values, strict_parsing)

Change this to:

    elif ctype[:10] == 'multipart/' and not self.filename: 
        self.read_multi(environ, keep_blank_values, strict_parsing)

(I apologise for not submitting a test case. When trying to create it, it is either very complicated, or not easily recognizable as valid. Moreover, my server used a 3rd party software (bottlypy.org: bottle.py))
msg167921 - (view) Author: Glenn Linderman (v+python) * Date: 2012-08-10 22:33
So the issue you perceive is that a correctly MIME-typed .mht file has a MIME type of multipart/related -- but that for the purposes of uploading the file, you don't want to treat it as that MIME type, but rather as an opaque data file.

Just give it a different MIME type at the time of upload, like application/octet-stream. That is appropriate, if your application wants to treat the data as an opaque data stream.

But, you say, none of the browsers support user-specified or user-selectable MIME types, but rather they infer the MIME type from the file extension.  So that sounds like a bug in the browsers... but also gives an out... change the name of the file before uploading it.

The only bug I see here is your comment that the parsing fails.
msg167953 - (view) Author: patrick vrijlandt (patrick.vrijlandt) Date: 2012-08-11 13:03
I would not know how to set the MIME-type of a file during upload. This is apparently set by the browser based on the filename (extension). Even (or: especially) if this is a bug in all the current browsers, python should provide the tools to adapt to this situation.

I could perhaps request the whole form to be "application/octet-stream", but the current "multipart/form-data" is appropriate for a form.

You are right about renaming. The innocent test file "test2.txt" can be uploaded, but the same file renamed to "test2.mht" causes an exception.

Below is a dump of the posted data (using Chrome in this case); attached a script (requiring bottle.py - www.bottlepy.org or PyPI) that demonstrates the problem.

There is no doubt that parsing fails; an exception cannot be the result of successful parsing. The input may be wrong, but python should offer the flexibility to handle wrong input.

Instead, are you sure it is appropriate to *automatically* dissect a file? It should be fairly easy to handle for the scripter if he really wants to dig deeper.

Headers

Origin: http://localhost:10080
Referer: http://localhost:10080/url-get
Content-Length: 349
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cache-Control: max-age=0
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.75 Safari/537.1
Host: localhost:10080
Accept-Encoding: gzip,deflate,sdch
Accept-Language: nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4,en-GB;q=0.2
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryBsBVBYDTxou89uBj

Body

------WebKitFormBoundaryBsBVBYDTxou89uBj
Content-Disposition: form-data; name="data"; filename="test2.mht"
Content-Type: multipart/related

# dit is een test
Dit is een regel
Dit is het einde.
#


------WebKitFormBoundaryBsBVBYDTxou89uBj
Content-Disposition: form-data; name="value"

abc123
------WebKitFormBoundaryBsBVBYDTxou89uBj--
msg168007 - (view) Author: Glenn Linderman (v+python) * Date: 2012-08-11 20:34
I didn't call the current behaviour of browsers in assigning MIME types automatically based on file extension a bug; I would consider it more of a missing capability, an oversight due to the rareness of attempts to upload MHTML files. This is similar to the situation of email clients automatically choosing the Content-Disposition for attachments (which is just a recommendation) about whether to suggest they be displayed inline, or provided as attachments to be saved. Most automatically select a Content-Disposition based on their own capability to deal with an attachment of a particular MIME type, rather than the (unknown) capability of the email client of the ultimate recipient. I think in both cases, the default behavior works well enough for a large enough subset of cases, that there has been little demand for increased functionality, even though one can contrive reasonable sounding cases for that functionality.

As a point of discussion, my perception is that MHTML files have two uses: to email an image of a web page (something typically done implicitly by bundled email/web-browser client software, and not generally explicit in the creation of a standalone MHTML file), and to archive a web page for local reference. Neither of these uses involves upload MHTML files to web sites, although saving a web page, and then attempting to email it to a friend as an attachment via a web mail client might encounter the same difficulty you are having.

Another use I have heard discussed (but I've forgotten where, so have no references), is as a source for custom browsers to prepackage responses for particular WEB forms.  In that case, I think it would be the custom browser's responsibility to supply the MHTML file content as a response to the form request, rather than to supply it as an uploaded file, expecting the server to dissect it... 

I think it is obvious that my personal, first reaction is that the parsing problem should be fixed... if the MIME type states it is multipart, it should dissected into its parts... and if that is not the desired behavior, then the MIME type should be different.  Email standards, the source of MIME type specifications, certainly use and support nested multipart dissection, although various email software performs it in various manners and to various levels. Naturally, if the content syntax of the multipart file is incorrect, it should produce an exception, the same as if the multipart content a (buggy) browser produced from an HTML form were syntactically incorrect.

Given a lack of capability of browser to allow specification of MIME type (this is .mht, but treat it as application/octet-stream rather than multipart/related), it does seem that web server toolkits such as cgi.FieldStorage might want to offer an option or hook to allow an application to disable the otherwise automatic parsing of multipart/* files.

This is a rather murky area, indeed. Research into whether and how other web toolkits handle such a situation would be interesting in deciding how to proceed. While there is no need for Python to slavishly follow the lead of any other particular web toolkit, it would be interesting to know if any actually successfully parse such files, and it would be interesting to know if any ignore the MIME type for uploaded files, and it would be interesting to know if any support options for handling uploaded files with multipart/* MIME types.
msg168016 - (view) Author: Glenn Linderman (v+python) * Date: 2012-08-12 02:01
I forgot to mention that the file you provided in your test doesn't look like a well-formed MHTML file, and so an exception would be expected in this case.
msg168021 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-12 04:49
I'd like to weigh in on this, but I need time to do research on the question first.  It may be a bit before I get that time.
msg168073 - (view) Author: patrick vrijlandt (patrick.vrijlandt) Date: 2012-08-13 09:40
I must admit my usage case is a hack, but the summary is: view a page on one computer, process it on another computer; like sending the page to a friend, with friend -> self and send -> upload.

I found one other victim in python (https://groups.google.com/d/topic/web2py/ixeUUWryZh0/discussion) but only an occasional reference to other languages; most posts relate to security issues with mht files.

My previous example only served to show that the mime-type is a necessary condition for the problem to occur; you are right that this input would be expected to throw an exception.

So I went on and created a complete testcase/example (attached). The PatchedFieldStorage class parses the mht file correctly into parts. However, the names of the parts are in "content-location" headers inside  
the mht file and get lost. Also the code is ugly.

Trying to better re-use existing code like in ExperimentalFieldStorage was not succesful so far: The MIME-prologue is parsed as one of the parts, and the outerboundary is not respected, losing a dataelement "next to" the file. The print() calls show that the next line may be valuable (like a header) or not so much (like a boundary), but so far the class has no provision for look-ahead I think.

email.message_from_binary_file correctly parses my mht-files; so a completely different approach might be to more rely on that package for parsing MIME encoded data.
msg178750 - (view) Author: Christian Boos (cboos) * Date: 2013-01-01 20:10
I think that reverting to a read_single() when the read_multi() fails could do the trick here. At least this approach seems to work for uploading .mht files. See also http://trac.edgewall.org/ticket/9880.
msg254852 - (view) Author: mbordas (mbordas) Date: 2015-11-18 19:12
Was this ever addressed or resolved? I just ran into this bug and it looks like there's a solution, but was never fixed?
msg289545 - (view) Author: Joshua Shields (jshields) * Date: 2017-03-13 17:58
I ran into this issue as well. I think it is something cgi.py will need to handle correctly when this type of file is uploaded from a browser's file input.
History
Date User Action Args
2017-03-13 17:58:31jshieldssetnosy: + jshields
messages: + msg289545
2015-11-18 19:12:10mbordassetnosy: + mbordas

messages: + msg254852
versions: + Python 2.7, - Python 3.2
2013-01-01 20:10:22cboossetnosy: + cboos
messages: + msg178750
2012-08-13 09:40:12patrick.vrijlandtsetfiles: + test_cgi4.py

messages: + msg168073
2012-08-12 04:49:10r.david.murraysetmessages: + msg168021
2012-08-12 02:01:02v+pythonsetmessages: + msg168016
2012-08-11 20:34:05v+pythonsetmessages: + msg168007
2012-08-11 13:03:47patrick.vrijlandtsetfiles: + cgibug.py

messages: + msg167953
2012-08-10 22:33:20v+pythonsetmessages: + msg167921
2012-08-10 20:18:01orsenthilsetnosy: + orsenthil
2012-08-10 20:03:59v+pythonsetnosy: + v+python
2012-08-06 13:21:08r.david.murraysetnosy: + r.david.murray
2012-08-06 09:31:45patrick.vrijlandtcreate