classification
Title: cgi memory usage
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: r.david.murray, v+python
Priority: normal Keywords:

Created on 2011-01-10 08:34 by v+python, last changed 2011-01-25 00:25 by v+python.

Messages (5)
msg125884 - (view) Author: Glenn Linderman (v+python) * Date: 2011-01-10 08:34
In attempting to review issue 4953, I discovered a conundrum in handling of multipart/formdata.

cgi.py has claimed for some time (at least since 2.4) that it "handles" file storage for uploading large files.  I looked at the code in 2.6 that handles such, and it uses the rfc822.Message method, which parses headers from any object supporting readline().  In particular, it doesn't attempt to read message bodies, and there is code in cgi.py to perform that.

There is still code in 3.2 cgi.py to read message bodies, but... rfc822 has gone away, and been replaced with the email package.  Theoretically this is good, but the cgi FieldStorage read_multi method now parses the whole CGI input and then iteration parcels out items to FieldStorage instances.  There is a significant difference here: email reads everything into memory (if I understand it correctly).  That will never work to upload large or many files when combined with a Web server that launches CGI programs with memory limits.

I see several possible actions that could be taken:
1) Documentation.  While it is doubtful that any is using 3.x CGI, and this makes it more doubtful, the present code does not match the documentation, because while the documenteation claims to handle file uploads as files, rather than in-memory blobs, the current code does not do that.

2) If there is a method in the email package that corresponds to rfc822.Message, parsing only headers, I couldn't find it.  Perhaps it is possible to feed just headers to BytesFeedParser, and stop, and get the same sort of effect.  However, this is not the way the cgi.py presently is coded.  And if there is a better API, for parsing only headers, that is or could be exposed by email, that might be handy.

3) The 2.6 cgi.py does not claim to support nested multipart/ stuff, only one level.  I'm not sure if any present or planned web browsers use nested multipart/ stuff... I guess it would require a nested <form> tag? which is illegal HTML last I checked.  So perhaps the general logic flow of 2.6 cgi.py could be reinstated, with a technique to feed only headers to BytesFeedParser, together with reinstating the MIME body parsing in cgi.py,b and this could make a solution that works.

I discovered this, beacuase I couldn't figure out where a bunch of the methods in cgi.py were called from, particularly read_lines_to_outerboundary, and make_file.  They seemed to be called much too late in the process.  It wasn't until I looked back at 2.6 code that I could see that there was a transition from using rfc822 only for headers to using email for parsing the whole data stream, and that that was the cause of the documentation not seeming to match the code logic.  I have no idea if this problem is in 2.7, as I don't have it installed here for easy reference, and I'm personally much more interested in 3.2.
msg125888 - (view) Author: Glenn Linderman (v+python) * Date: 2011-01-10 09:45
Trying to code some of this, it would be handy if BytesFeedParser.feed would return a status, indicating if it has seen the end of the headers yet. But that would only work if it is parsing as it goes, rather than just buffering, with all the real parsing work being done at .close time.
msg125902 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-10 13:40
The email package does have a 'parser headers only' mode, but it doesn't do what you want, since it reads the remainder of the file and sets it as the payload of the single, un-nested Message object it returns.

Adding a flag to tell it to stop parsing instead of doing that will probably be fairly simple, but is a feature request.

However, I'm not clear on how that helps.  Doesn't FieldStorage also load everything into memory?

There's an open feature request for providing a way to use alternate backing stores for the bodies of message parts in the email package, which *would* address this issue.
msg125923 - (view) Author: Glenn Linderman (v+python) * Date: 2011-01-10 20:17
R. David said:
However, I'm not clear on how that helps.  Doesn't FieldStorage also load everything into memory?

I say:
FieldStorage in 2.x (for x <= 6, at least) copies incoming file data to a file, using limited size read/write operations.  Non-file data is buffered in memory.

In 3.x, FieldStorage doesn't work.  The code that is there, though, for multipart/ data, would call email to do all the parsing, which would happen to include file data, which always comes in as part of a multipart/ data stream.  This would prevent cgi from being used to accept large files in a limited environment.  Sadly, there is code is place that would the copy the memory buffers to files, and act like they were buffered... but process limits do not care that the memory usage is only temporary...
msg126968 - (view) Author: Glenn Linderman (v+python) * Date: 2011-01-25 00:25
Issue 4953 has somewhat resolved this issue by using email only for parsing headers (more like 2.x did).  So this issue could be closed, or could be left open to point out the required additional features needed from email before cgi.py can use it for handling body parts as well as headers.
History
Date User Action Args
2011-01-25 00:25:15v+pythonsetnosy: v+python, r.david.murray
messages: + msg126968
2011-01-10 20:17:18v+pythonsetnosy: v+python, r.david.murray
messages: + msg125923
2011-01-10 13:41:07r.david.murraysetnosy: v+python, r.david.murray
versions: - Python 3.1, Python 3.2
2011-01-10 13:40:55r.david.murraysettype: enhancement
messages: + msg125902
nosy: v+python, r.david.murray
2011-01-10 09:45:45v+pythonsetnosy: v+python, r.david.murray
messages: + msg125888
2011-01-10 08:34:34v+pythoncreate