msg96204 - (view) |
Author: asnakelover (asnakelover) |
Date: 2009-12-10 16:11 |
It's not a big problem because we can just shell to zcat but compare
these two:
$ time python ./b.py >/dev/null
real 0m10.977s
user 0m7.128s
sys 0m0.888s
$ time python ./a.py >/dev/null
real 1m19.015s
user 1m18.185s
sys 0m0.072s
$ # Notice that the gzip module (a.py) had the benefit of the files
being in a disk cache by too now...
$ cat a.py
import gzip
import os
apt_cache_dir = "/var/cache/apt/apt-file"
for apt_cache_file in os.listdir(apt_cache_dir):
if not apt_cache_file.endswith(".gz"):
continue
f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
for line in f:
print line
$ cat b.py
import os
import subprocess
from cStringIO import StringIO
apt_cache_dir = "/var/cache/apt/apt-file"
for apt_cache_file in os.listdir(apt_cache_dir):
if not apt_cache_file.endswith(".gz"):
continue
p = subprocess.Popen(["zcat", os.path.join(apt_cache_dir,
apt_cache_file)],
stdout = subprocess.PIPE)
f = StringIO(p.communicate()[0])
assert p.returncode == 0
for line in f:
print line
Also tried this one just for "completeness":
$ cat c.py
import gzip
import os
from cStringIO import StringIO
apt_cache_dir = "/var/cache/apt/apt-file"
for apt_cache_file in os.listdir(apt_cache_dir):
if not apt_cache_file.endswith(".gz"):
continue
f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
f = StringIO(f.read())
for line in f:
print line
But after it had ran (with some thrashing) for 3 and a half minutes I
killed it.
|
msg96205 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-10 16:13 |
How does the following compare?
f = gzip.open(...)
s = f.read()
for line in s.splitlines():
print line
|
msg96206 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-10 16:15 |
(GZipFile.readline() is implemented in pure Python, which explains why
it's rather slow)
|
msg96217 - (view) |
Author: asnakelover (asnakelover) |
Date: 2009-12-10 22:27 |
Hope this reply works right, the python bug interface is a bit confusing
for this newbie, it doesn't say "Reply" anywhere - sorry if it goes FUBAR.
I tried the splitlines() version you suggested, it thrashed my machine
so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
about 1 minute so I didn't lose anything important. About half a minute
later the machine came back to life. In other words: the splitlines
version used way, way too much memory - far worse even than making a
cStringIO from a GzipFile instance.read().
It's not just a GzipFile.readline() issue either, c.py calls .read() and
tries to turn the result into a cStringIO and that was the worst one of
my three previous tests. I'm going to look at this purely from a
consumer angle and not even look at gzip module source, from this angle
(a consumer), zcat out performs it by a factor of 10 when gzip module is
used with .readline() and by a good deal more when I try to read the
whole gzip file as a string to turn into a cStringIO to emulate as
closely as possible what happens with forking a zcat process. When I
tried to splitlines() it was even worse. This is probably a RAM issue,
but it just brings us back to - should gzip module eat so much ram when
shelling out to zcat uses far less?
|
msg96219 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-10 22:38 |
> I tried the splitlines() version you suggested, it thrashed my machine
> so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
> about 1 minute so I didn't lose anything important.
This sounds very weird. How much memory do you have, and how large are
the .gz files you are reading? What is your system exactly?
|
msg96220 - (view) |
Author: asnakelover (asnakelover) |
Date: 2009-12-10 22:54 |
The gz in question is 17mb compressed and 247mb uncompressed. Calling
zcat the python process uses between 250 and 260 mb with the whole
string in memory using zcat as a fork. Numbers for the gzip module
aren't obtainable except for readline(), which doesn't use much memory
but is very slow. Other methods thrash the machine to death.
The machine has 300mb free RAM from a total of 1024mb.
It's not a new issue, I didn't find it when searching python bug
archives for some reason but google:
http://www.google.com/search?q=python+gzip+slow+zcat
It's an issue which people have stumbled into before but nobody seems to
have made headway any headway (I expect they just forked and called zcat
or gzip, like I did...).
It's a minor point since it's so easy to work around, in fact I have my
IDS system completed already - thank you so much python for being so RAD
friendly - yet the module is worthless if a user has zcat/gzip c
programs and knows they exist as it stands.
|
msg96221 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-10 23:03 |
> The gz in question is 17mb compressed and 247mb uncompressed. Calling
> zcat the python process uses between 250 and 260 mb with the whole
> string in memory using zcat as a fork. Numbers for the gzip module
> aren't obtainable except for readline(), which doesn't use much memory
> but is very slow. Other methods thrash the machine to death.
>
> The machine has 300mb free RAM from a total of 1024mb.
That would be the explanation. Reading the whole file at once and then
doing splitlines() on the result consumes twice the memory, since a list
of lines must be constructed while the original data is still around. If
you had more than 600 MB free RAM the splitlines() solution would
probably be adequate :-)
Doing repeated calls to splitlines() on chunks of limited size (say 1MB)
would probably be fast enough without using too much memory. It would be
a bit less trivial to implement though, and it seems you are ok with the
subprocess solution.
|
msg96222 - (view) |
Author: asnakelover (asnakelover) |
Date: 2009-12-10 23:06 |
Yes, subprocess works fine and was the quickest to implement and
probably the fastest to run too.
How can I put this without being an ass? Hell, I'm no good at diplomacy
- the gzip module blows chunks - if I can shell out to a standard unix
util and it uses a tiny fraction of the memory to do the same job the
module is inherently broken no matter how pretty it's code looks.
You can of course tell me to STFU - but look around the web I won't be
the first or last user to bemoan "import gzip"... :)
|
msg96234 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-11 07:25 |
> How can I put this without being an ass? Hell, I'm no good at diplomacy
> - the gzip module blows chunks - if I can shell out to a standard unix
> util and it uses a tiny fraction of the memory to do the same job the
> module is inherently broken no matter how pretty it's code looks.
Well, let's say it is suboptimal. But it's certainly ok if you don't
have any tight performance constraints.
In any case, it won't improve until someone proposes a patch :-)
|
msg96272 - (view) |
Author: Jack Diederich (jackdied) * |
Date: 2009-12-11 20:32 |
I tried passing a size to readline to see if increasing the chunk helps
(test file was 120meg with 700k lines). For values 1k-10k all took
around 30 seconds, with a value of 100 it took 80 seconds, with a value
of 100k it ran for several minutes before I killed it. The default
starts at 100 and quickly maxes to 512, which seems to be a sweet spot
(thanks whomever figured that out!).
I profiled it and function overhead seems to be the real killer. 30% of
the time is spent in readline(). The next() function does almost
nothing and consumes 1/4th the time of readline(). Ditto for read() and
_unread(). Even lowly len() consumes 1/3rd the time of readline()
because it is called over 2million times.
There doesn't seem to be any way to speed this up without rewriting the
whole thing as a C module. I'm closing the bug WONTFIX.
|
msg96326 - (view) |
Author: Nir Aides (nirai) |
Date: 2009-12-13 09:38 |
First patch, please forgive long comment :)
I submit a small patch which speeds up readline() on my data set - a
74MB (5MB .gz) log file with 600K lines.
The speedup is 350%.
Source of slowness is that (~20KB) extrabuf is allocated/deallocated in
read() and _unread() with each call to readline().
In the patch read() returns a slice from extrabuf and defers
manipulation of extrabuf to _read().
In the following, the first timeit() corresponds to reading extrabuf
slices while the second timeit() corresponds to read() and _unread() as
they are done today:
>>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
0.25299811363220215
>>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' *
10000").timeit()
5.843876838684082
Another speedup is achieved by doing a small shortcut in readline() for
the typical case in which the entire line is already in extrabuf.
The patch only addresses the typical case of calling readline() with no
arguments. It does not address other problems in readline() logic. In
particular the current 512 chunk size is not a sweet spot. Regardless of
the size argument passed to readline(), read() will continue to
decompress just 1024 bytes with each call as the size of extrabuf swings
around the target size argument as result of the interaction between
_unread() and read().
|
msg96409 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-14 21:57 |
Ah, a patch. Now we're talking :)
|
msg96412 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-14 22:09 |
The patch doesn't apply against the SVN trunk (some parts are rejected).
I suppose it was done against 2.6 or earlier, but those versions are in
bug fixing-only mode (which excludes performance improvements), so
you'll have to regenerate it against the SVN trunk.
(for info about our SVN repository, see
http://www.python.org/dev/faq/#how-do-i-get-a-checkout-of-the-repository-read-only-and-read-write
)
|
msg96413 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-14 22:11 |
Ah, my bad, I hadn't seen that the patch is for 3.2. Sorry for the
confusion.
|
msg96416 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-14 22:40 |
I confirm that the patch gives good speedups. It would be nice if there
was a comment explaining what extrabuf, extrastart and extrasize are.
In 3.x, a better but more involved approached would be to rewrite the
gzip module so as to take advantage of the standard IO classes
(especially BufferedReader / BufferedWriter), which are written in C and
are therefore much faster.
|
msg96487 - (view) |
Author: Nir Aides (nirai) |
Date: 2009-12-16 13:02 |
Right, using the io module makes GzipFile as fast as zcat.
I submit a new patch this time for Python 2.7, however, it is not a
module rewrite, but again minimal refactoring.
GzipFile is now derived of io.BufferedRandom, and as result the
functionality of GzipFile changed since it used to behave differently
than io classes. For example write() did not return number of bytes
written, etc... Also, I hacked it a little to make it work with a raw
stream which is either readable or writable but not both. If it is
unacceptable, or if you prefer a module rewrite, I don't mind discussing
and evolving this further.
|
msg96537 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-17 22:07 |
Thanks for the new patch. The problem with inheriting from
BufferedRandom, though, is that if you call e.g. write() on a read-only
gzipfile, it will appear to succeed because the bytes are buffered
internally.
I think the solution would be to use delegation rather than inheritance.
Something like:
def __init__(self, ...)
if 'w' in mode:
self.buf = BufferedWriter(...)
for meth in ('read', 'write', etc.):
setattr(self, meth, getattr(self.buf, meth))
It would also be nice to add some tests for the issues I mentioned
earlier (check that IOError is raised when reading a write-only file,
and vice-versa).
By the way, we can't apply this approach to 2.x since
BufferedWriter/BufferedRandom won't accept unicode arguments for write()
and friends, and that would break compatibility. For 3.x it is fine though.
|
msg96538 - (view) |
Author: Nir Aides (nirai) |
Date: 2009-12-17 22:23 |
How about using the first patch with the slicing optimization and
additionally enhancing GzipFile with the methods required to make it
play nice as a raw stream to an io.BufferedReader object (readable(),
writable(), readinto(), etc...).
This way we still get the 350% speed up and keep it fully backward
compatible, and if someone needs the extra performance they can feed it
into an io.BufferedReader object thus:
g = gzip.GzipFile(...)
r = io.BufferedReader(g)
for line in r:
...
|
msg96539 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-17 22:33 |
> How about using the first patch with the slicing optimization and
> additionally enhancing GzipFile with the methods required to make it
> play nice as a raw stream to an io.BufferedReader object (readable(),
> writable(), readinto(), etc...).
That's fine with me.
|
msg96554 - (view) |
Author: Nir Aides (nirai) |
Date: 2009-12-18 08:47 |
Submitted combined patch for Python 2.7.
If its good i'll send one for Python 3.2.
|
msg96565 - (view) |
Author: Brian Curtin (brian.curtin) * |
Date: 2009-12-18 14:36 |
In the test, should you verify that the correct data comes back from
io.BufferedReader? After the r.close() on line 90 of test_gzip.py,
adding something like "self.assertEqual("".join(lines), data1*50)" would
do the trick.
Looks good.
|
msg96566 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-18 15:01 |
Two things:
- since it implements common IO operations, the GzipFile class could
inherit io.BufferedIOBase. It would also alleviate the need to
reimplement readinto(): BufferedIOBase has a default implementation
which should be sufficient.
- rather than `type(data) is memoryview`, the preferred idiom is
`isinstance(data, memoryview)`.
|
msg96643 - (view) |
Author: Nir Aides (nirai) |
Date: 2009-12-19 19:31 |
isatty() and __iter__() of io.BufferedIOBase raise on closed file and
__enter__() raises ValueError with different (generic) message.
Should we keep the original GzipFile methods or prefer the implementation
of io.BufferedIOBase?
|
msg96650 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-19 21:19 |
> isatty() and __iter__() of io.BufferedIOBase raise on closed file and
> __enter__() raises ValueError with different (generic) message.
>
> Should we keep the original GzipFile methods or prefer the implementation
> of io.BufferedIOBase?
It's fine to use the BufferedIOBase implementation. There's no reason to
call isatty() on or iterate over a closed file.
|
msg96662 - (view) |
Author: Nir Aides (nirai) |
Date: 2009-12-20 07:34 |
uploaded updated patch for Python 2.7.
|
msg96663 - (view) |
Author: Nir Aides (nirai) |
Date: 2009-12-20 07:35 |
Uploaded patch for Python 3.2.
|
msg97182 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2010-01-03 22:40 |
The patches have been committed. Thank you!
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:55 | admin | set | github: 51720 |
2010-01-03 22:40:04 | pitrou | set | status: open -> closed resolution: fixed messages:
+ msg97182
stage: patch review -> resolved |
2009-12-20 07:35:33 | nirai | set | files:
+ gzip_7471_py32.diff
messages:
+ msg96663 |
2009-12-20 07:34:43 | nirai | set | files:
+ gzip_7471_py27.diff
messages:
+ msg96662 |
2009-12-20 07:33:01 | nirai | set | files:
- gzip_7471_py27.diff |
2009-12-19 21:19:48 | pitrou | set | messages:
+ msg96650 |
2009-12-19 19:31:33 | nirai | set | messages:
+ msg96643 |
2009-12-18 15:01:24 | pitrou | set | messages:
+ msg96566 |
2009-12-18 14:36:01 | brian.curtin | set | messages:
+ msg96565 |
2009-12-18 08:47:24 | nirai | set | files:
+ gzip_7471_py27.diff
messages:
+ msg96554 |
2009-12-17 22:33:38 | pitrou | set | messages:
+ msg96539 |
2009-12-17 22:23:33 | nirai | set | messages:
+ msg96538 |
2009-12-17 22:07:13 | pitrou | set | messages:
+ msg96537 |
2009-12-16 13:02:41 | nirai | set | files:
+ gzip_7471_io_patch.diff
messages:
+ msg96487 |
2009-12-14 22:40:53 | pitrou | set | messages:
+ msg96416 |
2009-12-14 22:11:06 | pitrou | set | messages:
+ msg96413 |
2009-12-14 22:09:04 | pitrou | set | messages:
+ msg96412 |
2009-12-14 21:57:55 | pitrou | set | status: closed -> open versions:
+ Python 2.7, Python 3.2, - Python 2.6 messages:
+ msg96409
resolution: wont fix -> (no value) stage: patch review |
2009-12-13 09:38:55 | nirai | set | files:
+ gzip_7471_patch.diff
nosy:
+ nirai messages:
+ msg96326
keywords:
+ patch |
2009-12-11 20:32:47 | jackdied | set | status: open -> closed
nosy:
+ jackdied messages:
+ msg96272
resolution: wont fix |
2009-12-11 07:25:12 | pitrou | set | messages:
+ msg96234 |
2009-12-10 23:06:10 | asnakelover | set | messages:
+ msg96222 |
2009-12-10 23:03:21 | pitrou | set | messages:
+ msg96221 |
2009-12-10 22:54:36 | asnakelover | set | messages:
+ msg96220 |
2009-12-10 22:38:06 | pitrou | set | messages:
+ msg96219 |
2009-12-10 22:27:14 | asnakelover | set | messages:
+ msg96217 |
2009-12-10 20:50:24 | brian.curtin | set | nosy:
+ brian.curtin
|
2009-12-10 16:15:49 | pitrou | set | priority: normal
messages:
+ msg96206 title: gzip module too slow -> GZipFile.readline too slow |
2009-12-10 16:13:44 | pitrou | set | nosy:
+ pitrou messages:
+ msg96205
|
2009-12-10 16:11:23 | asnakelover | create | |