GZipFile.readline too slow #51720

asnakelover · 2009-12-10T16:11:23Z

BPO	7471
Nosy	@pitrou, @jackdied, @briancurtin
Files	gzip_7471_patch.diff: Patch for Python 3.2 gzip_7471_io_patch.diff: Patch for Python 2.7 using the io module. gzip_7471_py27.diff: Combined patch for Python 2.7 gzip_7471_py32.diff: Combined patch for Python 3.2

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-01-03.22:40:04.182>
created_at = <Date 2009-12-10.16:11:23.030>
labels = ['library', 'performance']
title = 'GZipFile.readline too slow'
updated_at = <Date 2010-01-03.22:40:04.181>
user = 'https://bugs.python.org/asnakelover'

bugs.python.org fields:

activity = <Date 2010-01-03.22:40:04.181>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2010-01-03.22:40:04.182>
closer = 'pitrou'
components = ['Library (Lib)']
creation = <Date 2009-12-10.16:11:23.030>
creator = 'asnakelover'
dependencies = []
files = ['15536', '15575', '15619', '15620']
hgrepos = []
issue_num = 7471
keywords = ['patch']
message_count = 27.0
messages = ['96204', '96205', '96206', '96217', '96219', '96220', '96221', '96222', '96234', '96272', '96326', '96409', '96412', '96413', '96416', '96487', '96537', '96538', '96539', '96554', '96565', '96566', '96643', '96650', '96662', '96663', '97182']
nosy_count = 5.0
nosy_names = ['pitrou', 'jackdied', 'nirai', 'brian.curtin', 'asnakelover']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue7471'
versions = ['Python 2.7', 'Python 3.2']

asnakelover · 2009-12-10T16:11:22Z

It's not a big problem because we can just shell to zcat but compare
these two:

$ time python ./b.py >/dev/null

real 0m10.977s
user 0m7.128s
sys 0m0.888s
$ time python ./a.py >/dev/null

real 1m19.015s
user 1m18.185s
sys 0m0.072s

$ # Notice that the gzip module (a.py) had the benefit of the files
being in a disk cache by too now...

$ cat a.py 
import gzip
import os

apt_cache_dir = "/var/cache/apt/apt-file"

for apt_cache_file in os.listdir(apt_cache_dir):
    if not apt_cache_file.endswith(".gz"):
        continue
    f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
    for line in f:
        print line

$ cat b.py 
import os
import subprocess
from cStringIO import StringIO

apt_cache_dir = "/var/cache/apt/apt-file"

for apt_cache_file in os.listdir(apt_cache_dir):
    if not apt_cache_file.endswith(".gz"):
        continue
    p = subprocess.Popen(["zcat", os.path.join(apt_cache_dir,
apt_cache_file)],
        stdout = subprocess.PIPE)
    f = StringIO(p.communicate()[0])
    assert p.returncode == 0 
    for line in f:
        print line

Also tried this one just for "completeness":

$ cat c.py 
import gzip
import os
from cStringIO import StringIO

apt_cache_dir = "/var/cache/apt/apt-file"

for apt_cache_file in os.listdir(apt_cache_dir):
    if not apt_cache_file.endswith(".gz"):
        continue
    f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
    f = StringIO(f.read())
    for line in f:
        print line

But after it had ran (with some thrashing) for 3 and a half minutes I
killed it.

pitrou · 2009-12-10T16:13:45Z

How does the following compare?

f = gzip.open(...)
s = f.read()
for line in s.splitlines():
    print line

pitrou · 2009-12-10T16:15:49Z

(GZipFile.readline() is implemented in pure Python, which explains why
it's rather slow)

asnakelover · 2009-12-10T22:27:13Z

Hope this reply works right, the python bug interface is a bit confusing
for this newbie, it doesn't say "Reply" anywhere - sorry if it goes FUBAR.

I tried the splitlines() version you suggested, it thrashed my machine
so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
about 1 minute so I didn't lose anything important. About half a minute
later the machine came back to life. In other words: the splitlines
version used way, way too much memory - far worse even than making a
cStringIO from a GzipFile instance.read().

It's not just a GzipFile.readline() issue either, c.py calls .read() and
tries to turn the result into a cStringIO and that was the worst one of
my three previous tests. I'm going to look at this purely from a
consumer angle and not even look at gzip module source, from this angle
(a consumer), zcat out performs it by a factor of 10 when gzip module is
used with .readline() and by a good deal more when I try to read the
whole gzip file as a string to turn into a cStringIO to emulate as
closely as possible what happens with forking a zcat process. When I
tried to splitlines() it was even worse. This is probably a RAM issue,
but it just brings us back to - should gzip module eat so much ram when
shelling out to zcat uses far less?

pitrou · 2009-12-10T22:38:06Z

I tried the splitlines() version you suggested, it thrashed my machine
so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
about 1 minute so I didn't lose anything important.

This sounds very weird. How much memory do you have, and how large are
the .gz files you are reading? What is your system exactly?

asnakelover · 2009-12-10T22:54:35Z

The gz in question is 17mb compressed and 247mb uncompressed. Calling
zcat the python process uses between 250 and 260 mb with the whole
string in memory using zcat as a fork. Numbers for the gzip module
aren't obtainable except for readline(), which doesn't use much memory
but is very slow. Other methods thrash the machine to death.

The machine has 300mb free RAM from a total of 1024mb.

It's not a new issue, I didn't find it when searching python bug
archives for some reason but google:

http://www.google.com/search?q=python+gzip+slow+zcat

It's an issue which people have stumbled into before but nobody seems to
have made headway any headway (I expect they just forked and called zcat
or gzip, like I did...).

It's a minor point since it's so easy to work around, in fact I have my
IDS system completed already - thank you so much python for being so RAD
friendly - yet the module is worthless if a user has zcat/gzip c
programs and knows they exist as it stands.

pitrou · 2009-12-10T23:03:21Z

The gz in question is 17mb compressed and 247mb uncompressed. Calling
zcat the python process uses between 250 and 260 mb with the whole
string in memory using zcat as a fork. Numbers for the gzip module
aren't obtainable except for readline(), which doesn't use much memory
but is very slow. Other methods thrash the machine to death.

The machine has 300mb free RAM from a total of 1024mb.

That would be the explanation. Reading the whole file at once and then
doing splitlines() on the result consumes twice the memory, since a list
of lines must be constructed while the original data is still around. If
you had more than 600 MB free RAM the splitlines() solution would
probably be adequate :-)

Doing repeated calls to splitlines() on chunks of limited size (say 1MB)
would probably be fast enough without using too much memory. It would be
a bit less trivial to implement though, and it seems you are ok with the
subprocess solution.

asnakelover · 2009-12-10T23:06:10Z

Yes, subprocess works fine and was the quickest to implement and
probably the fastest to run too.

How can I put this without being an ass? Hell, I'm no good at diplomacy

the gzip module blows chunks - if I can shell out to a standard unix
util and it uses a tiny fraction of the memory to do the same job the
module is inherently broken no matter how pretty it's code looks.

You can of course tell me to STFU - but look around the web I won't be
the first or last user to bemoan "import gzip"... :)

pitrou · 2009-12-11T07:25:12Z

How can I put this without being an ass? Hell, I'm no good at diplomacy

the gzip module blows chunks - if I can shell out to a standard unix
util and it uses a tiny fraction of the memory to do the same job the
module is inherently broken no matter how pretty it's code looks.

Well, let's say it is suboptimal. But it's certainly ok if you don't
have any tight performance constraints.
In any case, it won't improve until someone proposes a patch :-)

jackdied · 2009-12-11T20:32:47Z

I tried passing a size to readline to see if increasing the chunk helps
(test file was 120meg with 700k lines). For values 1k-10k all took
around 30 seconds, with a value of 100 it took 80 seconds, with a value
of 100k it ran for several minutes before I killed it. The default
starts at 100 and quickly maxes to 512, which seems to be a sweet spot
(thanks whomever figured that out!).

I profiled it and function overhead seems to be the real killer. 30% of
the time is spent in readline(). The next() function does almost
nothing and consumes 1/4th the time of readline(). Ditto for read() and
_unread(). Even lowly len() consumes 1/3rd the time of readline()
because it is called over 2million times.

There doesn't seem to be any way to speed this up without rewriting the
whole thing as a C module. I'm closing the bug WONTFIX.

nirai · 2009-12-13T09:38:53Z

First patch, please forgive long comment :)

I submit a small patch which speeds up readline() on my data set - a
74MB (5MB .gz) log file with 600K lines.

The speedup is 350%.

Source of slowness is that (~20KB) extrabuf is allocated/deallocated in
read() and _unread() with each call to readline().

In the patch read() returns a slice from extrabuf and defers
manipulation of extrabuf to _read().

In the following, the first timeit() corresponds to reading extrabuf
slices while the second timeit() corresponds to read() and _unread() as
they are done today:

>>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
0.25299811363220215

>>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 
10000").timeit()
5.843876838684082

Another speedup is achieved by doing a small shortcut in readline() for
the typical case in which the entire line is already in extrabuf.

The patch only addresses the typical case of calling readline() with no
arguments. It does not address other problems in readline() logic. In
particular the current 512 chunk size is not a sweet spot. Regardless of
the size argument passed to readline(), read() will continue to
decompress just 1024 bytes with each call as the size of extrabuf swings
around the target size argument as result of the interaction between
_unread() and read().

pitrou · 2009-12-14T21:57:56Z

Ah, a patch. Now we're talking :)

pitrou · 2009-12-14T22:09:04Z

The patch doesn't apply against the SVN trunk (some parts are rejected).
I suppose it was done against 2.6 or earlier, but those versions are in
bug fixing-only mode (which excludes performance improvements), so
you'll have to regenerate it against the SVN trunk.

(for info about our SVN repository, see
http://www.python.org/dev/faq/#how-do-i-get-a-checkout-of-the-repository-read-only-and-read-write
)

pitrou · 2009-12-14T22:11:07Z

Ah, my bad, I hadn't seen that the patch is for 3.2. Sorry for the
confusion.

pitrou · 2009-12-14T22:40:54Z

I confirm that the patch gives good speedups. It would be nice if there
was a comment explaining what extrabuf, extrastart and extrasize are.

In 3.x, a better but more involved approached would be to rewrite the
gzip module so as to take advantage of the standard IO classes
(especially BufferedReader / BufferedWriter), which are written in C and
are therefore much faster.

nirai · 2009-12-16T13:02:39Z

Right, using the io module makes GzipFile as fast as zcat.

I submit a new patch this time for Python 2.7, however, it is not a
module rewrite, but again minimal refactoring.

GzipFile is now derived of io.BufferedRandom, and as result the
functionality of GzipFile changed since it used to behave differently
than io classes. For example write() did not return number of bytes
written, etc... Also, I hacked it a little to make it work with a raw
stream which is either readable or writable but not both. If it is
unacceptable, or if you prefer a module rewrite, I don't mind discussing
and evolving this further.

pitrou · 2009-12-17T22:07:13Z

Thanks for the new patch. The problem with inheriting from
BufferedRandom, though, is that if you call e.g. write() on a read-only
gzipfile, it will appear to succeed because the bytes are buffered
internally.

I think the solution would be to use delegation rather than inheritance.
Something like:

    def __init__(self, ...)
        if 'w' in mode:
           self.buf = BufferedWriter(...)
        for meth in ('read', 'write', etc.):
            setattr(self, meth, getattr(self.buf, meth))

It would also be nice to add some tests for the issues I mentioned
earlier (check that IOError is raised when reading a write-only file,
and vice-versa).

By the way, we can't apply this approach to 2.x since
BufferedWriter/BufferedRandom won't accept unicode arguments for write()
and friends, and that would break compatibility. For 3.x it is fine though.

nirai · 2009-12-17T22:23:33Z

How about using the first patch with the slicing optimization and
additionally enhancing GzipFile with the methods required to make it
play nice as a raw stream to an io.BufferedReader object (readable(),
writable(), readinto(), etc...).

This way we still get the 350% speed up and keep it fully backward
compatible, and if someone needs the extra performance they can feed it
into an io.BufferedReader object thus:

g = gzip.GzipFile(...)
r = io.BufferedReader(g)
for line in r:
    ...

pitrou · 2009-12-17T22:33:38Z

How about using the first patch with the slicing optimization and
additionally enhancing GzipFile with the methods required to make it
play nice as a raw stream to an io.BufferedReader object (readable(),
writable(), readinto(), etc...).

That's fine with me.

nirai · 2009-12-18T08:47:22Z

Submitted combined patch for Python 2.7.
If its good i'll send one for Python 3.2.

briancurtin · 2009-12-18T14:36:01Z

In the test, should you verify that the correct data comes back from
io.BufferedReader? After the r.close() on line 90 of test_gzip.py,
adding something like "self.assertEqual("".join(lines), data1*50)" would
do the trick.
Looks good.

pitrou · 2009-12-18T15:01:24Z

Two things:

since it implements common IO operations, the GzipFile class could
inherit io.BufferedIOBase. It would also alleviate the need to
reimplement readinto(): BufferedIOBase has a default implementation
which should be sufficient.
rather than type(data) is memoryview, the preferred idiom is
isinstance(data, memoryview).

nirai · 2009-12-19T19:31:33Z

isatty() and __iter__() of io.BufferedIOBase raise on closed file and
__enter__() raises ValueError with different (generic) message.

Should we keep the original GzipFile methods or prefer the implementation
of io.BufferedIOBase?

pitrou · 2009-12-19T21:19:48Z

isatty() and __iter__() of io.BufferedIOBase raise on closed file and
__enter__() raises ValueError with different (generic) message.

Should we keep the original GzipFile methods or prefer the implementation
of io.BufferedIOBase?

It's fine to use the BufferedIOBase implementation. There's no reason to
call isatty() on or iterate over a closed file.

nirai · 2009-12-20T07:34:38Z

uploaded updated patch for Python 2.7.

nirai · 2009-12-20T07:35:33Z

Uploaded patch for Python 3.2.

pitrou · 2010-01-03T22:40:04Z

The patches have been committed. Thank you!

asnakelover mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Dec 10, 2009

pitrou changed the title ~~gzip module too slow~~ GZipFile.readline too slow Dec 10, 2009

jackdied closed this as completed Dec 11, 2009

pitrou reopened this Dec 14, 2009

pitrou closed this as completed Jan 3, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GZipFile.readline too slow #51720

GZipFile.readline too slow #51720

asnakelover mannequin commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 10, 2009

pitrou commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 11, 2009

jackdied commented Dec 11, 2009

nirai mannequin commented Dec 13, 2009

pitrou commented Dec 14, 2009

pitrou commented Dec 14, 2009

pitrou commented Dec 14, 2009

pitrou commented Dec 14, 2009

nirai mannequin commented Dec 16, 2009

pitrou commented Dec 17, 2009

nirai mannequin commented Dec 17, 2009

pitrou commented Dec 17, 2009

nirai mannequin commented Dec 18, 2009

briancurtin commented Dec 18, 2009

pitrou commented Dec 18, 2009

nirai mannequin commented Dec 19, 2009

pitrou commented Dec 19, 2009

nirai mannequin commented Dec 20, 2009

nirai mannequin commented Dec 20, 2009

pitrou commented Jan 3, 2010

GZipFile.readline too slow #51720

GZipFile.readline too slow #51720

Comments

asnakelover mannequin commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 10, 2009

pitrou commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 10, 2009

asnakelover mannequin commented Dec 10, 2009

pitrou commented Dec 11, 2009

jackdied commented Dec 11, 2009

nirai mannequin commented Dec 13, 2009

pitrou commented Dec 14, 2009

pitrou commented Dec 14, 2009

pitrou commented Dec 14, 2009

pitrou commented Dec 14, 2009

nirai mannequin commented Dec 16, 2009

pitrou commented Dec 17, 2009

nirai mannequin commented Dec 17, 2009

pitrou commented Dec 17, 2009

nirai mannequin commented Dec 18, 2009

briancurtin commented Dec 18, 2009

pitrou commented Dec 18, 2009

nirai mannequin commented Dec 19, 2009

pitrou commented Dec 19, 2009

nirai mannequin commented Dec 20, 2009

nirai mannequin commented Dec 20, 2009

pitrou commented Jan 3, 2010