This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

classification
Title: Speed up gzip.readline (~40%)
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: bob.ippolito Nosy List: april, bob.ippolito
Priority: normal Keywords: patch

Created on 2005-09-04 17:53 by april, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
gzip.diff april, 2005-09-04 17:53 Patch to gzip.readline
gzip.txt april, 2005-09-04 17:57 Detailed description (since SF mangles the text)
gzip-1281707-r46074.diff bob.ippolito, 2006-05-22 15:54 Revised patch
Messages (12)
msg48706 - (view) Author: April King (april) Date: 2005-09-04 17:53
See bug 849046 for history.  This patch passes both the
regression test and the standard test.  Hopefully the
extra information below won't be too difficult to read.
 I can attach this info to the bug, if need be.

Fixed:
  - Add self.min_readsize to __init__.
    Follows the principal that lines are likely to be
the same length in size,
    and doesn't start over at a minimum length string
every call to readline()
  - Rewriting of assignment for readsize and size at
the beginning of function.
    Eliminates almost all calls to min()
  - Change bufs to a string, and not an array.  No
point in using an array when
    all you do with it is "".join(bufs).  Uses string
addition instead.
  - Remove extra assignments to bufs (in return())
  - Changes readline() to be much more readable (loop
reordering, more comments)

Recommendations:
  - Delete _unread() function.  It is used _only_ by
readline(), and moving its
    functionality into readline() itself saves the
function call overhead.
    _unread() is only 3 lines long.  Testing shows that
removing it speeds
    readline() up by about 3%.  Backwards compatibility
concerns?

Testing results:
test_append (__main__.TestGzip) ... ok
test_many_append (__main__.TestGzip) ... ok
test_mode (__main__.TestGzip) ... ok
test_read (__main__.TestGzip) ... ok
test_readline (__main__.TestGzip) ... ok
test_readlines (__main__.TestGzip) ... ok
test_seek_read (__main__.TestGzip) ... ok
test_seek_write (__main__.TestGzip) ... ok
test_write (__main__.TestGzip) ... ok

----------------------------------------------------------------------
Ran 9 tests in 0.331s

Regression tests:
python regrtest.py -g test_gzip.py
test_gzip
1 test OK.

---

Profiling Results (performed on a common compressed log
file - 200748 lines).

With patch...

         1213961 function calls in 12.188 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall
filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(close)
     1159    0.020    0.000    0.020    0.000 :0(crc32)
     1158    0.100    0.000    0.100    0.000
:0(decompress)
        1    0.000    0.000    0.000    0.000
:0(decompressobj)
   200774    0.812    0.000    0.812    0.000 :0(find)
   403865    0.902    0.000    0.902    0.000 :0(len)
     1183    0.000    0.000    0.000    0.000 :0(min)
        2    0.000    0.000    0.000    0.000 :0(ord)
     1173    0.000    0.000    0.000    0.000 :0(read)
       12    0.000    0.000    0.000    0.000 :0(seek)
        1    0.000    0.000    0.000    0.000
:0(setprofile)
       18    0.000    0.000    0.000    0.000 :0(tell)
        2    0.000    0.000    0.000    0.000 :0(unpack)
        1    0.000    0.000   12.188   12.188 <string>:1(?)
        1    0.000    0.000    0.000    0.000
gzip_new.py:156(_init_read)
        1    0.000    0.000    0.000    0.000
gzip_new.py:160(_read_gzip_header)
        3    0.000    0.000    0.000    0.000
gzip_new.py:18(U32)
   200774    2.453    0.000    2.593    0.000
gzip_new.py:207(read)
   200749    2.894    0.000    3.796    0.000
gzip_new.py:239(_unread)
     1166    0.010    0.000    0.140    0.000
gzip_new.py:244(_read)
        1    0.000    0.000    0.000    0.000
gzip_new.py:27(LOWU32)
     1158    0.010    0.000    0.030    0.000
gzip_new.py:294(_add_read_data)
        1    0.000    0.000    0.000    0.000
gzip_new.py:300(_read_eof)
        1    0.000    0.000    0.000    0.000
gzip_new.py:314(close)
        1    0.000    0.000    0.000    0.000
gzip_new.py:327(__del__)
   200749    3.916    0.000   11.117    0.000
gzip_new.py:384(readline)
        2    0.000    0.000    0.000    0.000
gzip_new.py:39(read32)
        1    0.000    0.000    0.000    0.000
gzip_new.py:42(open)
        1    0.000    0.000    0.000    0.000
gzip_new.py:60(__init__)
        1    0.000    0.000   12.188   12.188
profile:0(gunzip_gzip_new_open())
        0    0.000             0.000         
profile:0(profiler)
        1    1.071    1.071   12.188   12.188
test_gzip_speed.py:14(gunzip_gzip_new_open)

Without patch...

         2073328 function calls in 18.597 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall
filename:lineno(function)
   243820    0.735    0.000    0.735    0.000 :0(append)
        1    0.000    0.000    0.000    0.000 :0(close)
     1159    0.040    0.000    0.040    0.000 :0(crc32)
     1158    0.100    0.000    0.100    0.000
:0(decompress)
        1    0.000    0.000    0.000    0.000
:0(decompressobj)
   243820    0.960    0.000    0.960    0.000 :0(find)
   200749    0.801    0.000    0.801    0.000 :0(join)
   489958    1.330    0.000    1.330    0.000 :0(len)
   243820    0.791    0.000    0.791    0.000 :0(min)
        2    0.000    0.000    0.000    0.000 :0(ord)
     1173    0.030    0.000    0.030    0.000 :0(read)
        6    0.000    0.000    0.000    0.000 :0(seek)
        1    0.000    0.000    0.000    0.000
:0(setprofile)
        6    0.000    0.000    0.000    0.000 :0(tell)
        2    0.000    0.000    0.000    0.000 :0(unpack)
        1    0.000    0.000   18.597   18.597 <string>:1(?)
        1    0.000    0.000    0.000    0.000
gzip.py:154(_init_read)
        1    0.000    0.000    0.000    0.000
gzip.py:158(_read_gzip_header)
        3    0.000    0.000    0.000    0.000
gzip.py:18(U32)
   243820    2.711    0.000    2.921    0.000
gzip.py:205(read)
   200749    3.083    0.000    4.143    0.000
gzip.py:237(_unread)
     1160    0.010    0.000    0.210    0.000
gzip.py:242(_read)
        1    0.000    0.000    0.000    0.000
gzip.py:27(LOWU32)
     1158    0.030    0.000    0.070    0.000
gzip.py:292(_add_read_data)
        1    0.000    0.000    0.000    0.000
gzip.py:298(_read_eof)
        1    0.000    0.000    0.000    0.000
gzip.py:312(close)
        1    0.000    0.000    0.000    0.000
gzip.py:325(__del__)
   200749    6.934    0.000   17.555    0.000
gzip.py:379(readline)
        2    0.000    0.000    0.000    0.000
gzip.py:39(read32)
        1    0.000    0.000    0.000    0.000
gzip.py:42(open)
        1    0.000    0.000    0.000    0.000
gzip.py:59(__init__)
        1    0.000    0.000   18.597   18.597
profile:0(gunzip_gzip_open())
        0    0.000             0.000         
profile:0(profiler)
        1    1.042    1.042   18.597   18.597
test_gzip_speed.py:7(gunzip_gzip_open)

Using popen + gunzip -c...

         200754 function calls in 4.338 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall
filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(popen)
   200749    3.578    0.000    3.578    0.000 :0(readline)
        1    0.000    0.000    0.000    0.000
:0(setprofile)
        1    0.240    0.240    4.338    4.338 <string>:1(?)
        1    0.000    0.000    4.338    4.338
profile:0(gunzip_popen())
        0    0.000             0.000         
profile:0(profiler)
        1    0.520    0.520    4.098    4.098
test_gzip_speed.py:21(gunzip_popen)
msg48707 - (view) Author: April King (april) Date: 2005-09-04 17:57
Logged In: YES 
user_id=747439

See attached text file for the detailed description (that's
much more readable).
msg48708 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2006-05-22 14:11
Logged In: YES 
user_id=139309

This patch is about a 30% win on Mac OS X i386 using this benchmark:
http://svn.python.org/view/sandbox/trunk/gzipbench/gzipbench.py

I'm going to look to see if there's any other low hanging fruit in there before I 
commit.
msg48709 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2006-05-22 14:32
Logged In: YES 
user_id=139309

Applied in revision 46070
msg48710 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2006-05-22 15:12
Logged In: YES 
user_id=139309

I'm reopening this patch -- it seems that these changes have made parsing 
Apache style log files MUCH slower (4x on some samples).
msg48711 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2006-05-22 15:29
Logged In: YES 
user_id=139309

It turns out the performance difference was due to some.. interesting 
characteristics for that particular log file.

>>> import gzip
>>> lengths = [len(line) for line in gzip.GzipFile('TEST.LOG')]
>>> sum(lengths) / float(len(lengths))
45.60349675165147
>>> max(lengths)
117989
>>> min(lengths)
1

The str style buffer in this particular example is going to fail miserably 
reading that one long line.
msg48712 - (view) Author: April King (april) Date: 2006-05-22 15:39
Logged In: YES 
user_id=747439

Actually, the slow speed in that specific circumstance has
nothing to do with the fact that it uses a string style
buffer, which should always be faster than what it used
before (an array of strings that was constantly appended to.)

The problem with that particular file is how the
gzip.readline function auto-optimizes it's read size.

if readsize > self.min_readsize:
  self.min_readsize = readsize

So, it optimizes it's read size to the length of the largest
line that it has seen so far.  The assumption is that
gzipped files are generally going to be a bunch of lines of
similar length, and not wildly differing length.
msg48713 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2006-05-22 15:54
Logged In: YES 
user_id=139309

The attached patch uses a strategy that provides the same 30%-ish performance 
boost for the benchmark, and also provides a small performance boost (about 
9% or so) for the very strange log file.

The key is to use lists for buffering, and to never allow the default buffer size to 
grow too large (512-ish starting point seems to be a sweet spot). This defends 
against working with large strings more often than necessary.
msg48714 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2006-05-22 15:59
Logged In: YES 
user_id=139309

Applied in revision 46075
msg48715 - (view) Author: April King (april) Date: 2006-05-22 16:08
Logged In: YES 
user_id=747439

There was generally a 5-10% speed improvement for using a
string.  This is because the cost of recreating the string
by appending was less than the cost of creating an array,
appending the the array, and then joining it back together.

I would recommend trying leaving it as a string, but
changing this:
if readsize > self.min_readsize:
  self.min_readsize = int(self.min_readsize * 1.25)

(or some kind of scaling factor)
msg48716 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2006-05-22 16:40
Logged In: YES 
user_id=139309

Using a string is over 4x slower (at least on this platform) if the strings get 
large, that's not acceptable. Using a list is a compromise that provides good (but 
not optimal) performance when dealing with lines of arbitrary length.
msg48717 - (view) Author: April King (april) Date: 2006-05-22 16:47
Logged In: YES 
user_id=747439

Okie dokie.  30% is still a welcome speedup.  :)
History
Date User Action Args
2022-04-11 14:56:12adminsetgithub: 42342
2005-09-04 17:53:43aprilcreate