classification
Title: cgi.py multipart/form-data
Type: performance Stage: patch review
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Chui.Tey, flox, hynek, ishimoto, orsenthil, pitrou, r.david.murray, rishi.maker.forum, serhiy.storchaka, teyc
Priority: normal Keywords: patch

Created on 2006-12-07 09:18 by teyc, last changed 2019-04-26 17:29 by BreamoreBoy.

Files
File name Uploaded Description Edit
issue1610654.patch rishi.maker.forum, 2014-10-13 08:49 review
issue1610654_1.patch rishi.maker.forum, 2014-10-15 00:45 review
issue1610654_2.patch rishi.maker.forum, 2014-11-04 13:50 review
issue1610654_3.patch rishi.maker.forum, 2014-11-08 14:44 review
issue1610654_4.patch rishi.maker.forum, 2014-12-09 14:44 review
issue1610654_5.patch rishi.maker.forum, 2014-12-09 14:53 review
Messages (21)
msg61046 - (view) Author: Chui Tey (teyc) * Date: 2006-12-07 09:18
Uploading large binary files using multipart/form-data can be very inefficient because LF character may occur too frequently, resulting in the read_line_to_outer_boundary looping too many times.

*** cgi.py.Py24	Thu Dec  7 18:46:13 2006
--- cgi.py	Thu Dec  7 16:38:04 2006
***************
*** 707,713 ****
          last = next + "--"
          delim = ""
          while 1:
!             line = self.fp.readline()
              if not line:
                  self.done = -1
                  break
--- 703,709 ----
          last = next + "--"
          delim = ""
          while 1:
!             line = self.fp_readline()
              if not line:
                  self.done = -1
                  break
***************
*** 729,734 ****
--- 730,753 ----
                  delim = ""
              self.__write(odelim + line)
  
+     def fp_readline(self):
+ 
+         tell   = self.fp.tell()
+         buffer = self.fp.read(1 << 17)
+         parts  = buffer.split("\n")
+         retlst = []
+         for part in parts:
+             if part.startswith("--"):
+                 if retlst:
+                     retval = "\n".join(retlst) + "\n"
+                 else:
+                     retval = part + "\n"
+                 self.fp.seek(tell + len(retval))
+                 return retval
+             else:
+                 retlst.append(part)
+         return buffer
+ 
      def skip_lines(self):
          """Internal: skip lines until outer boundary if defined."""
          if not self.outerboundary or self.done:


The patch reads the file in larger increments. For my test file of 138 Mb, it reduced parsing time from 168 seconds to 19 seconds.

#------------ test script --------------------
import cgi
import cgi
import os
import profile
import stat

def run():
    filename = 'body.txt'
    size = os.stat(filename)[stat.ST_SIZE]
    fp = open(filename,'rb')
    environ = {}
    environ["CONTENT_TYPE"]   = open('content_type.txt','rb').read()
    environ["REQUEST_METHOD"] = "POST"
    environ["CONTENT_LENGTH"] = str(size)

    fieldstorage = cgi.FieldStorage(fp, None, environ=environ)
    return fieldstorage

import hotshot, hotshot.stats
import time
if 1:
    t1 = time.time()
    prof = hotshot.Profile("bug1718.prof")
    # hotshot profiler will crash with the 
    # patch applied on windows xp
    #prof_results = prof.runcall(run)
    prof_results  = run()
    prof.close()
    t2 = time.time()
    print t2-t1
    if 0:
      for key in prof_results.keys():
        if len(prof_results[key].value)> 100:
            print key, prof_results[key].value[:80] + "..."
        else:
            print key, prof_results[key]

content_type.txt
----------------------------
multipart/form-data; boundary=----------ThIs_Is_tHe_bouNdaRY_$
msg110090 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-12 14:35
Chui Tey does this issue still apply?  If yes, could you please provide a patch according to the guidelines here.
python.org/dev/patches
msg114915 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-25 14:32
No reply to msg110090.
msg119535 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-25 02:24
I don't think it was appropriate to close this issue.
msg166021 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-07-21 13:14
It needs tests to demonstrate the issue in 3.x, and an updated patch.
msg178292 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-12-27 10:28
It would be great if someone could port this patch to Python 3.4 and verify its effectiveness.
msg222416 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-06 19:53
@Hynek could you port the patch as you've shown some interest in it?
msg222557 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2014-07-08 12:05
I would have long ago if I had any domain knowlege on this topic, but alas….
msg222591 - (view) Author: Chui Tey (Chui.Tey) Date: 2014-07-08 23:46
Hi,

I'm still available. There's a test case in the patch, would you like
me to separate that to another file? Would that help with assessing
it?

Best,

On 8 July 2014 22:05, Hynek Schlawack <report@bugs.python.org> wrote:
>
> Hynek Schlawack added the comment:
>
> I would have long ago if I had any domain knowlege on this topic, but alas….
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1610654>
> _______________________________________
msg222952 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-07-13 16:35
To move this issue along we need someone to convert it into our standard patch format (a unified diff against either the 3.4 or the default branch, preferrably produced via an 'hg diff' command without the --git option), with the test included as a unit test added to the appropriate test file (tests/test_cgi.py).
msg229235 - (view) Author: Rishi (rishi.maker.forum) * Date: 2014-10-13 08:49
My observation is that a file with more than normal (exact numbers below) line-feed characters takes way too long. 

I tried porting the above patch to my default branch, but it has some boundary and CRLF/LF issues, but more importantly it relies on seeking the file-object, which in the real world is stdin for web browsers and hence is illegal in that environment.

I have attached a patch which is based on the same principle as Chui mentioned, ie reading a large buffer, but this patch does not deal with line feeds at all. It instead searches the entire boundary in a large buffer.

The cgi module file-object only relies on readline and read functionality - so I created a wrapper class around read and readline to introduce buffering (attached as patch).
 
When multipart boundaries are being searched, the patch fills a huge buffer, like in the original solution. It searches for the entire boundary and returns a large chunk of the payload in one call, rather than line by line.

To search, there are corner cases ( when boundary is overlapping between buffers) and CRLF issues. A boundary in itself could have repeating characters causing more search complexity. 
To overcome this, the patch uses simple regular exressions without any expanding or wild characters. If a boundary is not found, it returns the chunk - length of the buffer - CRLF prefixes, to ensure that no boundary is overlapping between two consecutive buffers. The expressions take care of CRLF issues. 

When read and readline are called, the patch looks for data in the buffer and returns appropriately.

There is a overall performance improvement in cases of large files, and very significant in case of files with very high number of LF characters.

To begin with I created a 20MB file with 20% of the file filled with LineFeeds. 

File - 20MB.bin
size - 20MB
description - file filled with 20% (~4MB) '\n'
Parse time with default cgi module - 53 seconds
Parse time with patch - 0.4s

This time increases linearly with the number of LFs for the default module.ie keeping the size same at 20MB and doubling the number of LFs to 40% would double the parse time. 

I tried with a normal large binary file that I found on my machine.
size: 88mb
description - binary executable on my machine,
              binary image has 140k lfs.
Parse time with default cgi module - 2.7s
Parse time with patch- 0.7s

I have tested with a few other files and noticed time is cut by atleast half for large files.


Note: 
These numbers are consitent over multiple observations.
I tested this using the script attached, and also on my localhost server.
The time taken is obtained by running the following code.

t1=time.time()
cProfile.run("fs = cgi.FieldStorage()")
print(str(len(fs['datafile'].value)))
t2 = time.time()
print(str(t2 - t1))

I have tried to keep the patch compatible with the current module. However I have introduced a ValueError excepiton in the module when boundary is very large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes.
msg229262 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-13 18:59
Rishi, thanks for the patch. I was going to give a review but first I have to ask: is so much support code necessary for this?

Another approach would be to wrap self.fp in a io.BufferedReader (if it's not already buffered) and then use the peek() method to find the boundary without advancing the file pointer.
msg229265 - (view) Author: Rishi (rishi.maker.forum) * Date: 2014-10-13 19:34
Antoine, I will upload a patch that relies on BufferedReader. As you mentioned, it will get rid of supporting the buffer and reduce a lot of code.
The only issue is that it helps me to know if the current buffer is at EOF (the documentation of peek does not mention  guaranteeing Eof if buffer returned is less than what we expect), because patterns at EOF are different, but I think I can work around that.
msg229285 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-14 10:31
I doubt we can use io.BufferedReader or handmade buffering here. Current code doesn't read more bytes than necessary. Buffered reader will read ahead, and there is no way to return read bytes back to the stream in general case (an exception is seekable streams). It can be blocked in attempt to fill a buffer with unnecessary bytes.

I think that the user of the cgi module is responsible for wrapping a stream in io.BufferedReader (if it is acceptable), the cgi module shouldn't do this itself.

But we can implement special cases for buffered or seekable streams. However this optimization will not work in general case (e.g. for stdin or socket).
msg229389 - (view) Author: Rishi (rishi.maker.forum) * Date: 2014-10-15 00:45
I have recreated the patch(issue1610654_1.patch) and it performs more or less like the earlier patch

Serhiy,
I agree we cannot use handmade buffering here, without seeking ahead.
I believe, we can make optimizations for streams which are buffered and non-seekable.
Cgi modules default value for file object is the BufferedReader of sys.stdin, so the solution is fairly generic.

I have removed handmade buffering. Neither do I create a Buffered* object.
We rely on user to create the buffered object. The sys.stdin that cgi module has a decent buffer underneath that
works well on apache.

The patch attached does not seek, nor does it read ahead. It only looks ahead.
As Antoine suggests, it peeks the buffer and determines through a fast lookup if the buffer has a bounary or not.
It moves forward only if it is convinced that the current buffer is completely within the next boundary.


The issue is that the current implementation deals with lines and not chunks.
Even when a savy user wraps sys.stdin around a large BufferredReader there is little to no peformance improvement in 
the current implementation for large files in my observation. It does not solve the bug mentioned either.
The difference in extreme cases like Chui's is 53s against 0.7s and even otherwise for larger files the patch
is 3 times faster than the current implementation.
I have tested this on Apache2 server where the sys.stdin is buffered.
msg230170 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-28 19:24
Thanks for the updated patch. I'll take a look soon if no-one beats me to it.
msg230620 - (view) Author: Rishi (rishi.maker.forum) * Date: 2014-11-04 13:50
Patch updated from review comments. Also added a few corner test cases.
msg230860 - (view) Author: Rishi (rishi.maker.forum) * Date: 2014-11-08 14:44
Hi,
I have created a new patch with a small design change. The change is that in situations where I don't find the boundary instead of keeping the last x bytes in the buffer I simply drain the whole data and call a readline(). 
This seems like the right thing to do also. I managed to get rid of the two obfuscated helper functions keep_x_buffer and remove_x_buffer that I had and the code should look familiar (I hope) to a module owner.
This also helped me get rid of quite a few class member variables that I could move to locals of my main function(multi_read).
I still need to maintain an overlap, but only for the trailing CRLF boundary. 
Ran all the new and old tests and tested on apache with the ubuntu iso server image. Without the patch ubuntu iso server image took 93seconds .. with the patch it took 25seconds.
msg231554 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-23 11:03
New test fail with non-modified code. Either there is a bug in current code or tests are wrong.
msg232377 - (view) Author: Rishi (rishi.maker.forum) * Date: 2014-12-09 14:44
There is indeed a test failure that occurs without the patch. This is a new test I had added. 
The reason is that in the existing implementation, when a boundary does not exist, the implementation does not include the trailing CRLF, LF or for that matter CR as part of the payload. I think that is not correct. 

However, to keep this patch compatible with behavior of existing implementation I have updated the patch to strip a single CRLF, LR or CR from the payload if a boundary is not found.
msg232379 - (view) Author: Rishi (rishi.maker.forum) * Date: 2014-12-09 14:53
One of my comments shot the wrapped line limit. Also changed the test in question to check the lengths of the expected and actual buffer to checking the contents of the respective buffers.
History
Date User Action Args
2019-04-26 17:29:15BreamoreBoysetnosy: - BreamoreBoy
2014-12-09 14:53:05rishi.maker.forumsetfiles: + issue1610654_5.patch

messages: + msg232379
2014-12-09 14:44:04rishi.maker.forumsetfiles: + issue1610654_4.patch

messages: + msg232377
2014-11-23 11:03:05serhiy.storchakasetmessages: + msg231554
2014-11-22 16:11:09serhiy.storchakasetassignee: serhiy.storchaka
2014-11-08 14:44:28rishi.maker.forumsetfiles: + issue1610654_3.patch

messages: + msg230860
2014-11-04 13:50:48rishi.maker.forumsetfiles: + issue1610654_2.patch

messages: + msg230620
2014-10-28 19:24:24pitrousetmessages: + msg230170
stage: needs patch -> patch review
2014-10-15 00:45:52rishi.maker.forumsetfiles: + issue1610654_1.patch

messages: + msg229389
2014-10-14 10:31:21serhiy.storchakasetkeywords: - easy

messages: + msg229285
stage: patch review -> needs patch
2014-10-13 19:34:59rishi.maker.forumsetmessages: + msg229265
2014-10-13 18:59:28pitrousetversions: + Python 3.5, - Python 3.4
nosy: + serhiy.storchaka

messages: + msg229262

stage: needs patch -> patch review
2014-10-13 08:49:28rishi.maker.forumsetfiles: + issue1610654.patch

nosy: + rishi.maker.forum
messages: + msg229235

keywords: + patch
2014-07-13 16:35:17r.david.murraysetmessages: + msg222952
2014-07-08 23:46:29Chui.Teysetnosy: + Chui.Tey
messages: + msg222591
2014-07-08 12:05:24hyneksetmessages: + msg222557
2014-07-06 19:53:04BreamoreBoysetnosy: + BreamoreBoy
messages: + msg222416
2012-12-27 10:28:14hyneksetkeywords: + easy, - patch

stage: test needed -> needs patch
messages: + msg178292
versions: + Python 3.4, - Python 3.2, Python 3.3
2012-07-30 14:46:10ishimotosetnosy: + ishimoto
2012-07-21 13:42:41pitrousetnosy: + orsenthil
2012-07-21 13:14:52floxsetversions: + Python 3.3
nosy: + pitrou, hynek

messages: + msg166021

stage: patch review -> test needed
2010-10-25 02:24:47r.david.murraysetstatus: closed -> open
versions: + Python 3.2, - Python 3.1, Python 2.7
nosy: + r.david.murray, - BreamoreBoy
messages: + msg119535

resolution: wont fix ->
stage: test needed -> patch review
2010-08-27 03:10:20floxsetnosy: + flox
2010-08-25 14:32:16BreamoreBoysetstatus: open -> closed
resolution: wont fix
messages: + msg114915
2010-07-12 14:35:15BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110090
2009-03-30 17:39:32ajaksu2setkeywords: + patch
stage: test needed
type: performance
components: + Library (Lib), - Interpreter Core
versions: + Python 3.1, Python 2.7
2006-12-07 09:18:18teyccreate