Issue 1610654: cgi.py multipart/form-data

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44313

classification

Title:	cgi.py multipart/form-data
Type:	performance	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	Chui.Tey, flox, hynek, ishimoto, orsenthil, pitrou, r.david.murray, rishi.maker.forum, serhiy.storchaka, teyc
Priority:	normal	Keywords:	patch

Created on 2006-12-07 09:18 by teyc, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
issue1610654.patch	rishi.maker.forum, 2014-10-13 08:49		review
issue1610654_1.patch	rishi.maker.forum, 2014-10-15 00:45		review
issue1610654_2.patch	rishi.maker.forum, 2014-11-04 13:50		review
issue1610654_3.patch	rishi.maker.forum, 2014-11-08 14:44		review
issue1610654_4.patch	rishi.maker.forum, 2014-12-09 14:44		review
issue1610654_5.patch	rishi.maker.forum, 2014-12-09 14:53		review

Messages (21)
msg61046 - (view)	Author: Chui Tey (teyc) *	Date: 2006-12-07 09:18
Uploading large binary files using multipart/form-data can be very inefficient because LF character may occur too frequently, resulting in the read_line_to_outer_boundary looping too many times. * cgi.py.Py24 Thu Dec 7 18:46:13 2006 --- cgi.py Thu Dec 7 16:38:04 2006 *********** * 707,713 ** last = next + "--" delim = "" while 1: ! line = self.fp.readline() if not line: self.done = -1 break --- 703,709 ---- last = next + "--" delim = "" while 1: ! line = self.fp_readline() if not line: self.done = -1 break *********** * 729,734 **** --- 730,753 ---- delim = "" self.__write(odelim + line) + def fp_readline(self): + + tell = self.fp.tell() + buffer = self.fp.read(1 << 17) + parts = buffer.split("\n") + retlst = [] + for part in parts: + if part.startswith("--"): + if retlst: + retval = "\n".join(retlst) + "\n" + else: + retval = part + "\n" + self.fp.seek(tell + len(retval)) + return retval + else: + retlst.append(part) + return buffer + def skip_lines(self): """Internal: skip lines until outer boundary if defined.""" if not self.outerboundary or self.done: The patch reads the file in larger increments. For my test file of 138 Mb, it reduced parsing time from 168 seconds to 19 seconds. #------------ test script -------------------- import cgi import cgi import os import profile import stat def run(): filename = 'body.txt' size = os.stat(filename)[stat.ST_SIZE] fp = open(filename,'rb') environ = {} environ["CONTENT_TYPE"] = open('content_type.txt','rb').read() environ["REQUEST_METHOD"] = "POST" environ["CONTENT_LENGTH"] = str(size) fieldstorage = cgi.FieldStorage(fp, None, environ=environ) return fieldstorage import hotshot, hotshot.stats import time if 1: t1 = time.time() prof = hotshot.Profile("bug1718.prof") # hotshot profiler will crash with the # patch applied on windows xp #prof_results = prof.runcall(run) prof_results = run() prof.close() t2 = time.time() print t2-t1 if 0: for key in prof_results.keys(): if len(prof_results[key].value)> 100: print key, prof_results[key].value[:80] + "..." else: print key, prof_results[key] content_type.txt ---------------------------- multipart/form-data; boundary=----------ThIs_Is_tHe_bouNdaRY_$
msg110090 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-12 14:35
Chui Tey does this issue still apply? If yes, could you please provide a patch according to the guidelines here. python.org/dev/patches
msg114915 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-25 14:32
No reply to msg110090.
msg119535 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-10-25 02:24
I don't think it was appropriate to close this issue.
msg166021 - (view)	Author: Florent Xicluna (flox) *	Date: 2012-07-21 13:14
It needs tests to demonstrate the issue in 3.x, and an updated patch.
msg178292 - (view)	Author: Hynek Schlawack (hynek) *	Date: 2012-12-27 10:28
It would be great if someone could port this patch to Python 3.4 and verify its effectiveness.
msg222416 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2014-07-06 19:53
@Hynek could you port the patch as you've shown some interest in it?
msg222557 - (view)	Author: Hynek Schlawack (hynek) *	Date: 2014-07-08 12:05
I would have long ago if I had any domain knowlege on this topic, but alas….
msg222591 - (view)	Author: Chui Tey (Chui.Tey)	Date: 2014-07-08 23:46
Hi, I'm still available. There's a test case in the patch, would you like me to separate that to another file? Would that help with assessing it? Best, On 8 July 2014 22:05, Hynek Schlawack <report@bugs.python.org> wrote: > > Hynek Schlawack added the comment: > > I would have long ago if I had any domain knowlege on this topic, but alas…. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue1610654> > _______________________________________
msg222952 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-07-13 16:35
To move this issue along we need someone to convert it into our standard patch format (a unified diff against either the 3.4 or the default branch, preferrably produced via an 'hg diff' command without the --git option), with the test included as a unit test added to the appropriate test file (tests/test_cgi.py).
msg229235 - (view)	Author: Rishi (rishi.maker.forum) *	Date: 2014-10-13 08:49
My observation is that a file with more than normal (exact numbers below) line-feed characters takes way too long. I tried porting the above patch to my default branch, but it has some boundary and CRLF/LF issues, but more importantly it relies on seeking the file-object, which in the real world is stdin for web browsers and hence is illegal in that environment. I have attached a patch which is based on the same principle as Chui mentioned, ie reading a large buffer, but this patch does not deal with line feeds at all. It instead searches the entire boundary in a large buffer. The cgi module file-object only relies on readline and read functionality - so I created a wrapper class around read and readline to introduce buffering (attached as patch). When multipart boundaries are being searched, the patch fills a huge buffer, like in the original solution. It searches for the entire boundary and returns a large chunk of the payload in one call, rather than line by line. To search, there are corner cases ( when boundary is overlapping between buffers) and CRLF issues. A boundary in itself could have repeating characters causing more search complexity. To overcome this, the patch uses simple regular exressions without any expanding or wild characters. If a boundary is not found, it returns the chunk - length of the buffer - CRLF prefixes, to ensure that no boundary is overlapping between two consecutive buffers. The expressions take care of CRLF issues. When read and readline are called, the patch looks for data in the buffer and returns appropriately. There is a overall performance improvement in cases of large files, and very significant in case of files with very high number of LF characters. To begin with I created a 20MB file with 20% of the file filled with LineFeeds. File - 20MB.bin size - 20MB description - file filled with 20% (~4MB) '\n' Parse time with default cgi module - 53 seconds Parse time with patch - 0.4s This time increases linearly with the number of LFs for the default module.ie keeping the size same at 20MB and doubling the number of LFs to 40% would double the parse time. I tried with a normal large binary file that I found on my machine. size: 88mb description - binary executable on my machine, binary image has 140k lfs. Parse time with default cgi module - 2.7s Parse time with patch- 0.7s I have tested with a few other files and noticed time is cut by atleast half for large files. Note: These numbers are consitent over multiple observations. I tested this using the script attached, and also on my localhost server. The time taken is obtained by running the following code. t1=time.time() cProfile.run("fs = cgi.FieldStorage()") print(str(len(fs['datafile'].value))) t2 = time.time() print(str(t2 - t1)) I have tried to keep the patch compatible with the current module. However I have introduced a ValueError excepiton in the module when boundary is very large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes.
msg229262 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-10-13 18:59
Rishi, thanks for the patch. I was going to give a review but first I have to ask: is so much support code necessary for this? Another approach would be to wrap self.fp in a io.BufferedReader (if it's not already buffered) and then use the peek() method to find the boundary without advancing the file pointer.
msg229265 - (view)	Author: Rishi (rishi.maker.forum) *	Date: 2014-10-13 19:34
Antoine, I will upload a patch that relies on BufferedReader. As you mentioned, it will get rid of supporting the buffer and reduce a lot of code. The only issue is that it helps me to know if the current buffer is at EOF (the documentation of peek does not mention guaranteeing Eof if buffer returned is less than what we expect), because patterns at EOF are different, but I think I can work around that.
msg229285 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-14 10:31
I doubt we can use io.BufferedReader or handmade buffering here. Current code doesn't read more bytes than necessary. Buffered reader will read ahead, and there is no way to return read bytes back to the stream in general case (an exception is seekable streams). It can be blocked in attempt to fill a buffer with unnecessary bytes. I think that the user of the cgi module is responsible for wrapping a stream in io.BufferedReader (if it is acceptable), the cgi module shouldn't do this itself. But we can implement special cases for buffered or seekable streams. However this optimization will not work in general case (e.g. for stdin or socket).
msg229389 - (view)	Author: Rishi (rishi.maker.forum) *	Date: 2014-10-15 00:45
I have recreated the patch(issue1610654_1.patch) and it performs more or less like the earlier patch Serhiy, I agree we cannot use handmade buffering here, without seeking ahead. I believe, we can make optimizations for streams which are buffered and non-seekable. Cgi modules default value for file object is the BufferedReader of sys.stdin, so the solution is fairly generic. I have removed handmade buffering. Neither do I create a Buffered* object. We rely on user to create the buffered object. The sys.stdin that cgi module has a decent buffer underneath that works well on apache. The patch attached does not seek, nor does it read ahead. It only looks ahead. As Antoine suggests, it peeks the buffer and determines through a fast lookup if the buffer has a bounary or not. It moves forward only if it is convinced that the current buffer is completely within the next boundary. The issue is that the current implementation deals with lines and not chunks. Even when a savy user wraps sys.stdin around a large BufferredReader there is little to no peformance improvement in the current implementation for large files in my observation. It does not solve the bug mentioned either. The difference in extreme cases like Chui's is 53s against 0.7s and even otherwise for larger files the patch is 3 times faster than the current implementation. I have tested this on Apache2 server where the sys.stdin is buffered.
msg230170 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-10-28 19:24
Thanks for the updated patch. I'll take a look soon if no-one beats me to it.
msg230620 - (view)	Author: Rishi (rishi.maker.forum) *	Date: 2014-11-04 13:50
Patch updated from review comments. Also added a few corner test cases.
msg230860 - (view)	Author: Rishi (rishi.maker.forum) *	Date: 2014-11-08 14:44
Hi, I have created a new patch with a small design change. The change is that in situations where I don't find the boundary instead of keeping the last x bytes in the buffer I simply drain the whole data and call a readline(). This seems like the right thing to do also. I managed to get rid of the two obfuscated helper functions keep_x_buffer and remove_x_buffer that I had and the code should look familiar (I hope) to a module owner. This also helped me get rid of quite a few class member variables that I could move to locals of my main function(multi_read). I still need to maintain an overlap, but only for the trailing CRLF boundary. Ran all the new and old tests and tested on apache with the ubuntu iso server image. Without the patch ubuntu iso server image took 93seconds .. with the patch it took 25seconds.
msg231554 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-23 11:03
New test fail with non-modified code. Either there is a bug in current code or tests are wrong.
msg232377 - (view)	Author: Rishi (rishi.maker.forum) *	Date: 2014-12-09 14:44
There is indeed a test failure that occurs without the patch. This is a new test I had added. The reason is that in the existing implementation, when a boundary does not exist, the implementation does not include the trailing CRLF, LF or for that matter CR as part of the payload. I think that is not correct. However, to keep this patch compatible with behavior of existing implementation I have updated the patch to strip a single CRLF, LR or CR from the payload if a boundary is not found.
msg232379 - (view)	Author: Rishi (rishi.maker.forum) *	Date: 2014-12-09 14:53
One of my comments shot the wrapped line limit. Also changed the test in question to check the lengths of the expected and actual buffer to checking the contents of the respective buffers.

History
Date	User	Action	Args
2022-04-11 14:56:21	admin	set	github: 44313
2019-04-26 17:29:15	BreamoreBoy	set	nosy: - BreamoreBoy
2014-12-09 14:53:05	rishi.maker.forum	set	files: + issue1610654_5.patch messages: + msg232379
2014-12-09 14:44:04	rishi.maker.forum	set	files: + issue1610654_4.patch messages: + msg232377
2014-11-23 11:03:05	serhiy.storchaka	set	messages: + msg231554
2014-11-22 16:11:09	serhiy.storchaka	set	assignee: serhiy.storchaka
2014-11-08 14:44:28	rishi.maker.forum	set	files: + issue1610654_3.patch messages: + msg230860
2014-11-04 13:50:48	rishi.maker.forum	set	files: + issue1610654_2.patch messages: + msg230620
2014-10-28 19:24:24	pitrou	set	messages: + msg230170 stage: needs patch -> patch review
2014-10-15 00:45:52	rishi.maker.forum	set	files: + issue1610654_1.patch messages: + msg229389
2014-10-14 10:31:21	serhiy.storchaka	set	keywords: - easy messages: + msg229285 stage: patch review -> needs patch
2014-10-13 19:34:59	rishi.maker.forum	set	messages: + msg229265
2014-10-13 18:59:28	pitrou	set	versions: + Python 3.5, - Python 3.4 nosy: + serhiy.storchaka messages: + msg229262 stage: needs patch -> patch review
2014-10-13 08:49:28	rishi.maker.forum	set	files: + issue1610654.patch nosy: + rishi.maker.forum messages: + msg229235 keywords: + patch
2014-07-13 16:35:17	r.david.murray	set	messages: + msg222952
2014-07-08 23:46:29	Chui.Tey	set	nosy: + Chui.Tey messages: + msg222591
2014-07-08 12:05:24	hynek	set	messages: + msg222557
2014-07-06 19:53:04	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg222416
2012-12-27 10:28:14	hynek	set	keywords: + easy, - patch stage: test needed -> needs patch messages: + msg178292 versions: + Python 3.4, - Python 3.2, Python 3.3
2012-07-30 14:46:10	ishimoto	set	nosy: + ishimoto
2012-07-21 13:42:41	pitrou	set	nosy: + orsenthil
2012-07-21 13:14:52	flox	set	versions: + Python 3.3 nosy: + pitrou, hynek messages: + msg166021 stage: patch review -> test needed
2010-10-25 02:24:47	r.david.murray	set	status: closed -> open versions: + Python 3.2, - Python 3.1, Python 2.7 nosy: + r.david.murray, - BreamoreBoy messages: + msg119535 resolution: wont fix -> stage: test needed -> patch review
2010-08-27 03:10:20	flox	set	nosy: + flox
2010-08-25 14:32:16	BreamoreBoy	set	status: open -> closed resolution: wont fix messages: + msg114915
2010-07-12 14:35:15	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg110090
2009-03-30 17:39:32	ajaksu2	set	keywords: + patch stage: test needed type: performance components: + Library (Lib), - Interpreter Core versions: + Python 3.1, Python 2.7
2006-12-07 09:18:18	teyc	create