Issue 1281707: Speed up gzip.readline (~40%)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/42342

classification

Title:	Speed up gzip.readline (~40%)
Type:		Stage:
Components:	Library (Lib)	Versions:

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	bob.ippolito	Nosy List:	april, bob.ippolito
Priority:	normal	Keywords:	patch

Created on 2005-09-04 17:53 by april, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
gzip.diff	april, 2005-09-04 17:53	Patch to gzip.readline
gzip.txt	april, 2005-09-04 17:57	Detailed description (since SF mangles the text)
gzip-1281707-r46074.diff	bob.ippolito, 2006-05-22 15:54	Revised patch

Messages (12)
msg48706 - (view)	Author: April King (april)	Date: 2005-09-04 17:53
See bug 849046 for history. This patch passes both the regression test and the standard test. Hopefully the extra information below won't be too difficult to read. I can attach this info to the bug, if need be. Fixed: - Add self.min_readsize to __init__. Follows the principal that lines are likely to be the same length in size, and doesn't start over at a minimum length string every call to readline() - Rewriting of assignment for readsize and size at the beginning of function. Eliminates almost all calls to min() - Change bufs to a string, and not an array. No point in using an array when all you do with it is "".join(bufs). Uses string addition instead. - Remove extra assignments to bufs (in return()) - Changes readline() to be much more readable (loop reordering, more comments) Recommendations: - Delete _unread() function. It is used _only_ by readline(), and moving its functionality into readline() itself saves the function call overhead. _unread() is only 3 lines long. Testing shows that removing it speeds readline() up by about 3%. Backwards compatibility concerns? Testing results: test_append (__main__.TestGzip) ... ok test_many_append (__main__.TestGzip) ... ok test_mode (__main__.TestGzip) ... ok test_read (__main__.TestGzip) ... ok test_readline (__main__.TestGzip) ... ok test_readlines (__main__.TestGzip) ... ok test_seek_read (__main__.TestGzip) ... ok test_seek_write (__main__.TestGzip) ... ok test_write (__main__.TestGzip) ... ok ---------------------------------------------------------------------- Ran 9 tests in 0.331s Regression tests: python regrtest.py -g test_gzip.py test_gzip 1 test OK. --- Profiling Results (performed on a common compressed log file - 200748 lines). With patch... 1213961 function calls in 12.188 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :0(close) 1159 0.020 0.000 0.020 0.000 :0(crc32) 1158 0.100 0.000 0.100 0.000 :0(decompress) 1 0.000 0.000 0.000 0.000 :0(decompressobj) 200774 0.812 0.000 0.812 0.000 :0(find) 403865 0.902 0.000 0.902 0.000 :0(len) 1183 0.000 0.000 0.000 0.000 :0(min) 2 0.000 0.000 0.000 0.000 :0(ord) 1173 0.000 0.000 0.000 0.000 :0(read) 12 0.000 0.000 0.000 0.000 :0(seek) 1 0.000 0.000 0.000 0.000 :0(setprofile) 18 0.000 0.000 0.000 0.000 :0(tell) 2 0.000 0.000 0.000 0.000 :0(unpack) 1 0.000 0.000 12.188 12.188 <string>:1(?) 1 0.000 0.000 0.000 0.000 gzip_new.py:156(_init_read) 1 0.000 0.000 0.000 0.000 gzip_new.py:160(_read_gzip_header) 3 0.000 0.000 0.000 0.000 gzip_new.py:18(U32) 200774 2.453 0.000 2.593 0.000 gzip_new.py:207(read) 200749 2.894 0.000 3.796 0.000 gzip_new.py:239(_unread) 1166 0.010 0.000 0.140 0.000 gzip_new.py:244(_read) 1 0.000 0.000 0.000 0.000 gzip_new.py:27(LOWU32) 1158 0.010 0.000 0.030 0.000 gzip_new.py:294(_add_read_data) 1 0.000 0.000 0.000 0.000 gzip_new.py:300(_read_eof) 1 0.000 0.000 0.000 0.000 gzip_new.py:314(close) 1 0.000 0.000 0.000 0.000 gzip_new.py:327(__del__) 200749 3.916 0.000 11.117 0.000 gzip_new.py:384(readline) 2 0.000 0.000 0.000 0.000 gzip_new.py:39(read32) 1 0.000 0.000 0.000 0.000 gzip_new.py:42(open) 1 0.000 0.000 0.000 0.000 gzip_new.py:60(__init__) 1 0.000 0.000 12.188 12.188 profile:0(gunzip_gzip_new_open()) 0 0.000 0.000 profile:0(profiler) 1 1.071 1.071 12.188 12.188 test_gzip_speed.py:14(gunzip_gzip_new_open) Without patch... 2073328 function calls in 18.597 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 243820 0.735 0.000 0.735 0.000 :0(append) 1 0.000 0.000 0.000 0.000 :0(close) 1159 0.040 0.000 0.040 0.000 :0(crc32) 1158 0.100 0.000 0.100 0.000 :0(decompress) 1 0.000 0.000 0.000 0.000 :0(decompressobj) 243820 0.960 0.000 0.960 0.000 :0(find) 200749 0.801 0.000 0.801 0.000 :0(join) 489958 1.330 0.000 1.330 0.000 :0(len) 243820 0.791 0.000 0.791 0.000 :0(min) 2 0.000 0.000 0.000 0.000 :0(ord) 1173 0.030 0.000 0.030 0.000 :0(read) 6 0.000 0.000 0.000 0.000 :0(seek) 1 0.000 0.000 0.000 0.000 :0(setprofile) 6 0.000 0.000 0.000 0.000 :0(tell) 2 0.000 0.000 0.000 0.000 :0(unpack) 1 0.000 0.000 18.597 18.597 <string>:1(?) 1 0.000 0.000 0.000 0.000 gzip.py:154(_init_read) 1 0.000 0.000 0.000 0.000 gzip.py:158(_read_gzip_header) 3 0.000 0.000 0.000 0.000 gzip.py:18(U32) 243820 2.711 0.000 2.921 0.000 gzip.py:205(read) 200749 3.083 0.000 4.143 0.000 gzip.py:237(_unread) 1160 0.010 0.000 0.210 0.000 gzip.py:242(_read) 1 0.000 0.000 0.000 0.000 gzip.py:27(LOWU32) 1158 0.030 0.000 0.070 0.000 gzip.py:292(_add_read_data) 1 0.000 0.000 0.000 0.000 gzip.py:298(_read_eof) 1 0.000 0.000 0.000 0.000 gzip.py:312(close) 1 0.000 0.000 0.000 0.000 gzip.py:325(__del__) 200749 6.934 0.000 17.555 0.000 gzip.py:379(readline) 2 0.000 0.000 0.000 0.000 gzip.py:39(read32) 1 0.000 0.000 0.000 0.000 gzip.py:42(open) 1 0.000 0.000 0.000 0.000 gzip.py:59(__init__) 1 0.000 0.000 18.597 18.597 profile:0(gunzip_gzip_open()) 0 0.000 0.000 profile:0(profiler) 1 1.042 1.042 18.597 18.597 test_gzip_speed.py:7(gunzip_gzip_open) Using popen + gunzip -c... 200754 function calls in 4.338 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :0(popen) 200749 3.578 0.000 3.578 0.000 :0(readline) 1 0.000 0.000 0.000 0.000 :0(setprofile) 1 0.240 0.240 4.338 4.338 <string>:1(?) 1 0.000 0.000 4.338 4.338 profile:0(gunzip_popen()) 0 0.000 0.000 profile:0(profiler) 1 0.520 0.520 4.098 4.098 test_gzip_speed.py:21(gunzip_popen)
msg48707 - (view)	Author: April King (april)	Date: 2005-09-04 17:57
Logged In: YES user_id=747439 See attached text file for the detailed description (that's much more readable).
msg48708 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2006-05-22 14:11
Logged In: YES user_id=139309 This patch is about a 30% win on Mac OS X i386 using this benchmark: http://svn.python.org/view/sandbox/trunk/gzipbench/gzipbench.py I'm going to look to see if there's any other low hanging fruit in there before I commit.
msg48709 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2006-05-22 14:32
Logged In: YES user_id=139309 Applied in revision 46070
msg48710 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2006-05-22 15:12
Logged In: YES user_id=139309 I'm reopening this patch -- it seems that these changes have made parsing Apache style log files MUCH slower (4x on some samples).
msg48711 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2006-05-22 15:29
Logged In: YES user_id=139309 It turns out the performance difference was due to some.. interesting characteristics for that particular log file. >>> import gzip >>> lengths = [len(line) for line in gzip.GzipFile('TEST.LOG')] >>> sum(lengths) / float(len(lengths)) 45.60349675165147 >>> max(lengths) 117989 >>> min(lengths) 1 The str style buffer in this particular example is going to fail miserably reading that one long line.
msg48712 - (view)	Author: April King (april)	Date: 2006-05-22 15:39
Logged In: YES user_id=747439 Actually, the slow speed in that specific circumstance has nothing to do with the fact that it uses a string style buffer, which should always be faster than what it used before (an array of strings that was constantly appended to.) The problem with that particular file is how the gzip.readline function auto-optimizes it's read size. if readsize > self.min_readsize: self.min_readsize = readsize So, it optimizes it's read size to the length of the largest line that it has seen so far. The assumption is that gzipped files are generally going to be a bunch of lines of similar length, and not wildly differing length.
msg48713 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2006-05-22 15:54
Logged In: YES user_id=139309 The attached patch uses a strategy that provides the same 30%-ish performance boost for the benchmark, and also provides a small performance boost (about 9% or so) for the very strange log file. The key is to use lists for buffering, and to never allow the default buffer size to grow too large (512-ish starting point seems to be a sweet spot). This defends against working with large strings more often than necessary.
msg48714 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2006-05-22 15:59
Logged In: YES user_id=139309 Applied in revision 46075
msg48715 - (view)	Author: April King (april)	Date: 2006-05-22 16:08
Logged In: YES user_id=747439 There was generally a 5-10% speed improvement for using a string. This is because the cost of recreating the string by appending was less than the cost of creating an array, appending the the array, and then joining it back together. I would recommend trying leaving it as a string, but changing this: if readsize > self.min_readsize: self.min_readsize = int(self.min_readsize * 1.25) (or some kind of scaling factor)
msg48716 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2006-05-22 16:40
Logged In: YES user_id=139309 Using a string is over 4x slower (at least on this platform) if the strings get large, that's not acceptable. Using a list is a compromise that provides good (but not optimal) performance when dealing with lines of arbitrary length.
msg48717 - (view)	Author: April King (april)	Date: 2006-05-22 16:47
Logged In: YES user_id=747439 Okie dokie. 30% is still a welcome speedup. :)

History
Date	User	Action	Args
2022-04-11 14:56:12	admin	set	github: 42342
2005-09-04 17:53:43	april	create