Message 229235 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rishi.maker.forum
Recipients	BreamoreBoy, Chui.Tey, flox, hynek, ishimoto, orsenthil, pitrou, r.david.murray, rishi.maker.forum, teyc
Date	2014-10-13.08:49:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1413190168.61.0.28664751365.issue1610654@psf.upfronthosting.co.za>
In-reply-to

Content
My observation is that a file with more than normal (exact numbers below) line-feed characters takes way too long. I tried porting the above patch to my default branch, but it has some boundary and CRLF/LF issues, but more importantly it relies on seeking the file-object, which in the real world is stdin for web browsers and hence is illegal in that environment. I have attached a patch which is based on the same principle as Chui mentioned, ie reading a large buffer, but this patch does not deal with line feeds at all. It instead searches the entire boundary in a large buffer. The cgi module file-object only relies on readline and read functionality - so I created a wrapper class around read and readline to introduce buffering (attached as patch). When multipart boundaries are being searched, the patch fills a huge buffer, like in the original solution. It searches for the entire boundary and returns a large chunk of the payload in one call, rather than line by line. To search, there are corner cases ( when boundary is overlapping between buffers) and CRLF issues. A boundary in itself could have repeating characters causing more search complexity. To overcome this, the patch uses simple regular exressions without any expanding or wild characters. If a boundary is not found, it returns the chunk - length of the buffer - CRLF prefixes, to ensure that no boundary is overlapping between two consecutive buffers. The expressions take care of CRLF issues. When read and readline are called, the patch looks for data in the buffer and returns appropriately. There is a overall performance improvement in cases of large files, and very significant in case of files with very high number of LF characters. To begin with I created a 20MB file with 20% of the file filled with LineFeeds. File - 20MB.bin size - 20MB description - file filled with 20% (~4MB) '\n' Parse time with default cgi module - 53 seconds Parse time with patch - 0.4s This time increases linearly with the number of LFs for the default module.ie keeping the size same at 20MB and doubling the number of LFs to 40% would double the parse time. I tried with a normal large binary file that I found on my machine. size: 88mb description - binary executable on my machine, binary image has 140k lfs. Parse time with default cgi module - 2.7s Parse time with patch- 0.7s I have tested with a few other files and noticed time is cut by atleast half for large files. Note: These numbers are consitent over multiple observations. I tested this using the script attached, and also on my localhost server. The time taken is obtained by running the following code. t1=time.time() cProfile.run("fs = cgi.FieldStorage()") print(str(len(fs['datafile'].value))) t2 = time.time() print(str(t2 - t1)) I have tried to keep the patch compatible with the current module. However I have introduced a ValueError excepiton in the module when boundary is very large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes.

My observation is that a file with more than normal (exact numbers below) line-feed characters takes way too long.

I tried porting the above patch to my default branch, but it has some boundary and CRLF/LF issues, but more importantly it relies on seeking the file-object, which in the real world is stdin for web browsers and hence is illegal in that environment.

I have attached a patch which is based on the same principle as Chui mentioned, ie reading a large buffer, but this patch does not deal with line feeds at all. It instead searches the entire boundary in a large buffer.

The cgi module file-object only relies on readline and read functionality - so I created a wrapper class around read and readline to introduce buffering (attached as patch).

When multipart boundaries are being searched, the patch fills a huge buffer, like in the original solution. It searches for the entire boundary and returns a large chunk of the payload in one call, rather than line by line.

To search, there are corner cases ( when boundary is overlapping between buffers) and CRLF issues. A boundary in itself could have repeating characters causing more search complexity.
To overcome this, the patch uses simple regular exressions without any expanding or wild characters. If a boundary is not found, it returns the chunk - length of the buffer - CRLF prefixes, to ensure that no boundary is overlapping between two consecutive buffers. The expressions take care of CRLF issues.

When read and readline are called, the patch looks for data in the buffer and returns appropriately.

There is a overall performance improvement in cases of large files, and very significant in case of files with very high number of LF characters.

To begin with I created a 20MB file with 20% of the file filled with LineFeeds.

File - 20MB.bin
size - 20MB
description - file filled with 20% (~4MB) '\n'
Parse time with default cgi module - 53 seconds
Parse time with patch - 0.4s

This time increases linearly with the number of LFs for the default module.ie keeping the size same at 20MB and doubling the number of LFs to 40% would double the parse time.

I tried with a normal large binary file that I found on my machine.
size: 88mb
description - binary executable on my machine,
binary image has 140k lfs.
Parse time with default cgi module - 2.7s
Parse time with patch- 0.7s

I have tested with a few other files and noticed time is cut by atleast half for large files.

Note:
These numbers are consitent over multiple observations.
I tested this using the script attached, and also on my localhost server.
The time taken is obtained by running the following code.

t1=time.time()
cProfile.run("fs = cgi.FieldStorage()")
print(str(len(fs['datafile'].value)))
t2 = time.time()
print(str(t2 - t1))

I have tried to keep the patch compatible with the current module. However I have introduced a ValueError excepiton in the module when boundary is very large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes.

History
Date	User	Action	Args
2014-10-13 08:49:29	rishi.maker.forum	set	recipients: + rishi.maker.forum, ishimoto, orsenthil, pitrou, teyc, r.david.murray, flox, BreamoreBoy, hynek, Chui.Tey
2014-10-13 08:49:28	rishi.maker.forum	set	messageid: <1413190168.61.0.28664751365.issue1610654@psf.upfronthosting.co.za>
2014-10-13 08:49:28	rishi.maker.forum	link	issue1610654 messages
2014-10-13 08:49:28	rishi.maker.forum	create