classification
Title: GZipFile.readline too slow
Type: performance Stage: committed/rejected
Components: Library (Lib) Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: asnakelover, brian.curtin, jackdied, nirai, pitrou
Priority: normal Keywords: patch

Created on 2009-12-10 16:11 by asnakelover, last changed 2010-01-03 22:40 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
gzip_7471_patch.diff nirai, 2009-12-13 09:38 Patch for Python 3.2
gzip_7471_io_patch.diff nirai, 2009-12-16 13:02 Patch for Python 2.7 using the io module.
gzip_7471_py27.diff nirai, 2009-12-20 07:34 Combined patch for Python 2.7
gzip_7471_py32.diff nirai, 2009-12-20 07:35 Combined patch for Python 3.2
Messages (27)
msg96204 - (view) Author: asnakelover (asnakelover) Date: 2009-12-10 16:11
It's not a big problem because we can just shell to zcat but compare
these two:

$ time python ./b.py >/dev/null

real    0m10.977s
user    0m7.128s
sys     0m0.888s
$ time python ./a.py >/dev/null

real    1m19.015s
user    1m18.185s
sys     0m0.072s

$ # Notice that the gzip module (a.py) had the benefit of the files
being in a disk cache by too now...

$ cat a.py 
import gzip
import os

apt_cache_dir = "/var/cache/apt/apt-file"

for apt_cache_file in os.listdir(apt_cache_dir):
    if not apt_cache_file.endswith(".gz"):
        continue
    f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
    for line in f:
        print line

$ cat b.py 
import os
import subprocess
from cStringIO import StringIO

apt_cache_dir = "/var/cache/apt/apt-file"

for apt_cache_file in os.listdir(apt_cache_dir):
    if not apt_cache_file.endswith(".gz"):
        continue
    p = subprocess.Popen(["zcat", os.path.join(apt_cache_dir,
apt_cache_file)],
        stdout = subprocess.PIPE)
    f = StringIO(p.communicate()[0])
    assert p.returncode == 0 
    for line in f:
        print line

Also tried this one just for "completeness":

$ cat c.py 
import gzip
import os
from cStringIO import StringIO

apt_cache_dir = "/var/cache/apt/apt-file"

for apt_cache_file in os.listdir(apt_cache_dir):
    if not apt_cache_file.endswith(".gz"):
        continue
    f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
    f = StringIO(f.read())
    for line in f:
        print line

But after it had ran (with some thrashing) for 3 and a half minutes I
killed it.
msg96205 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-10 16:13
How does the following compare?

f = gzip.open(...)
s = f.read()
for line in s.splitlines():
    print line
msg96206 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-10 16:15
(GZipFile.readline() is implemented in pure Python, which explains why
it's rather slow)
msg96217 - (view) Author: asnakelover (asnakelover) Date: 2009-12-10 22:27
Hope this reply works right, the python bug interface is a bit confusing
for this newbie, it doesn't say "Reply" anywhere - sorry if it goes FUBAR.

I tried the splitlines() version you suggested, it thrashed my machine
so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
about 1 minute so I didn't lose anything important. About half a minute
later the machine came back to life. In other words: the splitlines
version used way, way too much memory - far worse even than making a
cStringIO from a GzipFile instance.read().

It's not just a GzipFile.readline() issue either, c.py calls .read() and
tries to turn the result into a cStringIO and that was the worst one of
my three previous tests. I'm going to look at this purely from a
consumer angle and not even look at gzip module source, from this angle
(a consumer), zcat out performs it by a factor of 10 when gzip module is
used with .readline() and by a good deal more when I try to read the
whole gzip file as a string to turn into a cStringIO to emulate as
closely as possible what happens with forking a zcat process. When I
tried to splitlines() it was even worse. This is probably a RAM issue,
but it just brings us back to - should gzip module eat so much ram when
shelling out to zcat uses far less?
msg96219 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-10 22:38
> I tried the splitlines() version you suggested, it thrashed my machine
> so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
> about 1 minute so I didn't lose anything important.

This sounds very weird. How much memory do you have, and how large are
the .gz files you are reading? What is your system exactly?
msg96220 - (view) Author: asnakelover (asnakelover) Date: 2009-12-10 22:54
The gz in question is 17mb compressed and 247mb uncompressed. Calling
zcat the python process uses between 250 and 260 mb with the whole
string in memory using zcat as a fork. Numbers for the gzip module
aren't obtainable except for readline(), which doesn't use much memory
but is very slow. Other methods thrash the machine to death.

The machine has 300mb free RAM from a total of 1024mb.

It's not a new issue, I didn't find it when searching python bug
archives for some reason but google:

http://www.google.com/search?q=python+gzip+slow+zcat

It's an issue which people have stumbled into before but nobody seems to
have made headway any headway (I expect they just forked and called zcat
or gzip, like I did...).

It's a minor point since it's so easy to work around, in fact I have my
IDS system completed already - thank you so much python for being so RAD
friendly - yet the module is worthless if a user has zcat/gzip c
programs and knows they exist as it stands.
msg96221 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-10 23:03
> The gz in question is 17mb compressed and 247mb uncompressed. Calling
> zcat the python process uses between 250 and 260 mb with the whole
> string in memory using zcat as a fork. Numbers for the gzip module
> aren't obtainable except for readline(), which doesn't use much memory
> but is very slow. Other methods thrash the machine to death.
> 
> The machine has 300mb free RAM from a total of 1024mb.

That would be the explanation. Reading the whole file at once and then
doing splitlines() on the result consumes twice the memory, since a list
of lines must be constructed while the original data is still around. If
you had more than 600 MB free RAM the splitlines() solution would
probably be adequate :-)

Doing repeated calls to splitlines() on chunks of limited size (say 1MB)
would probably be fast enough without using too much memory. It would be
a bit less trivial to implement though, and it seems you are ok with the
subprocess solution.
msg96222 - (view) Author: asnakelover (asnakelover) Date: 2009-12-10 23:06
Yes, subprocess works fine and was the quickest to implement and
probably the fastest to run too.

How can I put this without being an ass? Hell, I'm no good at diplomacy
- the gzip module blows chunks - if I can shell out to a standard unix
util and it uses a tiny fraction of the memory to do the same job the
module is inherently broken no matter how pretty it's code looks.

You can of course tell me to STFU - but look around the web I won't be
the first or last user to bemoan "import gzip"... :)
msg96234 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-11 07:25
> How can I put this without being an ass? Hell, I'm no good at diplomacy
> - the gzip module blows chunks - if I can shell out to a standard unix
> util and it uses a tiny fraction of the memory to do the same job the
> module is inherently broken no matter how pretty it's code looks.

Well, let's say it is suboptimal. But it's certainly ok if you don't
have any tight performance constraints.
In any case, it won't improve until someone proposes a patch :-)
msg96272 - (view) Author: Jack Diederich (jackdied) * (Python committer) Date: 2009-12-11 20:32
I tried passing a size to readline to see if increasing the chunk helps
(test file was 120meg with 700k lines).  For values 1k-10k all took
around 30 seconds, with a value of 100 it took 80 seconds, with a value
of 100k it ran for several minutes before I killed it.  The default
starts at 100 and quickly maxes to 512, which seems to be a sweet spot
(thanks whomever figured that out!).

I profiled it and function overhead seems to be the real killer.  30% of
the time is spent in readline().  The next() function does almost
nothing and consumes 1/4th the time of readline().  Ditto for read() and
_unread().  Even lowly len() consumes 1/3rd the time of readline()
because it is called over 2million times.

There doesn't seem to be any way to speed this up without rewriting the
whole thing as a C module.  I'm closing the bug WONTFIX.
msg96326 - (view) Author: Nir Aides (nirai) Date: 2009-12-13 09:38
First patch, please forgive long comment :)

I submit a small patch which speeds up readline() on my data set - a 
74MB (5MB .gz) log file with 600K lines.

The speedup is 350%.

Source of slowness is that (~20KB) extrabuf is allocated/deallocated in 
read() and _unread() with each call to readline().

In the patch read() returns a slice from extrabuf and defers 
manipulation of extrabuf to _read().

In the following, the first timeit() corresponds to reading extrabuf 
slices while the second timeit() corresponds to read() and _unread() as 
they are done today:

>>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
0.25299811363220215

>>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 
10000").timeit()
5.843876838684082

Another speedup is achieved by doing a small shortcut in readline() for 
the typical case in which the entire line is already in extrabuf.

The patch only addresses the typical case of calling readline() with no 
arguments. It does not address other problems in readline() logic. In 
particular the current 512 chunk size is not a sweet spot. Regardless of 
the size argument passed to readline(), read() will continue to 
decompress just 1024 bytes with each call as the size of extrabuf swings 
around the target size argument as result of the interaction between 
_unread() and read().
msg96409 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-14 21:57
Ah, a patch. Now we're talking :)
msg96412 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-14 22:09
The patch doesn't apply against the SVN trunk (some parts are rejected).
I suppose it was done against 2.6 or earlier, but those versions are in
bug fixing-only mode (which excludes performance improvements), so
you'll have to regenerate it against the SVN trunk.

(for info about our SVN repository, see
http://www.python.org/dev/faq/#how-do-i-get-a-checkout-of-the-repository-read-only-and-read-write
)
msg96413 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-14 22:11
Ah, my bad, I hadn't seen that the patch is for 3.2. Sorry for the
confusion.
msg96416 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-14 22:40
I confirm that the patch gives good speedups. It would be nice if there
was a comment explaining what extrabuf, extrastart and extrasize are.

In 3.x, a better but more involved approached would be to rewrite the
gzip module so as to take advantage of the standard IO classes
(especially BufferedReader / BufferedWriter), which are written in C and
are therefore much faster.
msg96487 - (view) Author: Nir Aides (nirai) Date: 2009-12-16 13:02
Right, using the io module makes GzipFile as fast as zcat.

I submit a new patch this time for Python 2.7, however, it is not a 
module rewrite, but again minimal refactoring.

GzipFile is now derived of io.BufferedRandom, and as result the 
functionality of GzipFile changed since it used to behave differently 
than io classes. For example write() did not return number of bytes 
written, etc... Also, I hacked it a little to make it work with a raw 
stream which is either readable or writable but not both. If it is 
unacceptable, or if you prefer a module rewrite, I don't mind discussing 
and evolving this further.
msg96537 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-17 22:07
Thanks for the new patch. The problem with inheriting from
BufferedRandom, though, is that if you call e.g. write() on a read-only
gzipfile, it will appear to succeed because the bytes are buffered
internally.

I think the solution would be to use delegation rather than inheritance.
Something like:

    def __init__(self, ...)
        if 'w' in mode:
           self.buf = BufferedWriter(...)
        for meth in ('read', 'write', etc.):
            setattr(self, meth, getattr(self.buf, meth))

It would also be nice to add some tests for the issues I mentioned
earlier (check that IOError is raised when reading a write-only file,
and vice-versa).

By the way, we can't apply this approach to 2.x since
BufferedWriter/BufferedRandom won't accept unicode arguments for write()
and friends, and that would break compatibility. For 3.x it is fine though.
msg96538 - (view) Author: Nir Aides (nirai) Date: 2009-12-17 22:23
How about using the first patch with the slicing optimization and 
additionally enhancing GzipFile with the methods required to make it 
play nice as a raw stream to an io.BufferedReader object (readable(), 
writable(), readinto(), etc...).

This way we still get the 350% speed up and keep it fully backward 
compatible, and if someone needs the extra performance they can feed it 
into an io.BufferedReader object thus:

g = gzip.GzipFile(...)
r = io.BufferedReader(g)
for line in r:
    ...
msg96539 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-17 22:33
> How about using the first patch with the slicing optimization and 
> additionally enhancing GzipFile with the methods required to make it 
> play nice as a raw stream to an io.BufferedReader object (readable(), 
> writable(), readinto(), etc...).

That's fine with me.
msg96554 - (view) Author: Nir Aides (nirai) Date: 2009-12-18 08:47
Submitted combined patch for Python 2.7. 
If its good i'll send one for Python 3.2.
msg96565 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2009-12-18 14:36
In the test, should you verify that the correct data comes back from
io.BufferedReader? After the r.close() on line 90 of test_gzip.py,
adding something like "self.assertEqual("".join(lines), data1*50)" would
do the trick. 
Looks good.
msg96566 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-18 15:01
Two things:
- since it implements common IO operations, the GzipFile class could
inherit io.BufferedIOBase. It would also alleviate the need to
reimplement readinto(): BufferedIOBase has a default implementation
which should be sufficient.
- rather than `type(data) is memoryview`, the preferred idiom is
`isinstance(data, memoryview)`.
msg96643 - (view) Author: Nir Aides (nirai) Date: 2009-12-19 19:31
isatty() and __iter__() of io.BufferedIOBase raise on closed file and 
__enter__() raises ValueError with different (generic) message. 

Should we keep the original GzipFile methods or prefer the implementation 
of io.BufferedIOBase?
msg96650 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-19 21:19
> isatty() and __iter__() of io.BufferedIOBase raise on closed file and 
> __enter__() raises ValueError with different (generic) message. 
> 
> Should we keep the original GzipFile methods or prefer the implementation 
> of io.BufferedIOBase?

It's fine to use the BufferedIOBase implementation. There's no reason to
call isatty() on or iterate over a closed file.
msg96662 - (view) Author: Nir Aides (nirai) Date: 2009-12-20 07:34
uploaded updated patch for Python 2.7.
msg96663 - (view) Author: Nir Aides (nirai) Date: 2009-12-20 07:35
Uploaded patch for Python 3.2.
msg97182 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-01-03 22:40
The patches have been committed. Thank you!
History
Date User Action Args
2010-01-03 22:40:04pitrousetstatus: open -> closed
resolution: fixed
messages: + msg97182

stage: patch review -> committed/rejected
2009-12-20 07:35:33niraisetfiles: + gzip_7471_py32.diff

messages: + msg96663
2009-12-20 07:34:43niraisetfiles: + gzip_7471_py27.diff

messages: + msg96662
2009-12-20 07:33:01niraisetfiles: - gzip_7471_py27.diff
2009-12-19 21:19:48pitrousetmessages: + msg96650
2009-12-19 19:31:33niraisetmessages: + msg96643
2009-12-18 15:01:24pitrousetmessages: + msg96566
2009-12-18 14:36:01brian.curtinsetmessages: + msg96565
2009-12-18 08:47:24niraisetfiles: + gzip_7471_py27.diff

messages: + msg96554
2009-12-17 22:33:38pitrousetmessages: + msg96539
2009-12-17 22:23:33niraisetmessages: + msg96538
2009-12-17 22:07:13pitrousetmessages: + msg96537
2009-12-16 13:02:41niraisetfiles: + gzip_7471_io_patch.diff

messages: + msg96487
2009-12-14 22:40:53pitrousetmessages: + msg96416
2009-12-14 22:11:06pitrousetmessages: + msg96413
2009-12-14 22:09:04pitrousetmessages: + msg96412
2009-12-14 21:57:55pitrousetstatus: closed -> open
versions: + Python 2.7, Python 3.2, - Python 2.6
messages: + msg96409

resolution: wont fix -> (no value)
stage: patch review
2009-12-13 09:38:55niraisetfiles: + gzip_7471_patch.diff

nosy: + nirai
messages: + msg96326

keywords: + patch
2009-12-11 20:32:47jackdiedsetstatus: open -> closed

nosy: + jackdied
messages: + msg96272

resolution: wont fix
2009-12-11 07:25:12pitrousetmessages: + msg96234
2009-12-10 23:06:10asnakeloversetmessages: + msg96222
2009-12-10 23:03:21pitrousetmessages: + msg96221
2009-12-10 22:54:36asnakeloversetmessages: + msg96220
2009-12-10 22:38:06pitrousetmessages: + msg96219
2009-12-10 22:27:14asnakeloversetmessages: + msg96217
2009-12-10 20:50:24brian.curtinsetnosy: + brian.curtin
2009-12-10 16:15:49pitrousetpriority: normal

messages: + msg96206
title: gzip module too slow -> GZipFile.readline too slow
2009-12-10 16:13:44pitrousetnosy: + pitrou
messages: + msg96205
2009-12-10 16:11:23asnakelovercreate