Issue 7471: GZipFile.readline too slow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51720

classification

Title:	GZipFile.readline too slow
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	asnakelover, brian.curtin, jackdied, nirai, pitrou
Priority:	normal	Keywords:	patch

Created on 2009-12-10 16:11 by asnakelover, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
gzip_7471_patch.diff	nirai, 2009-12-13 09:38	Patch for Python 3.2
gzip_7471_io_patch.diff	nirai, 2009-12-16 13:02	Patch for Python 2.7 using the io module.
gzip_7471_py27.diff	nirai, 2009-12-20 07:34	Combined patch for Python 2.7
gzip_7471_py32.diff	nirai, 2009-12-20 07:35	Combined patch for Python 3.2

Messages (27)
msg96204 - (view)	Author: asnakelover (asnakelover)	Date: 2009-12-10 16:11
It's not a big problem because we can just shell to zcat but compare these two: $ time python ./b.py >/dev/null real 0m10.977s user 0m7.128s sys 0m0.888s $ time python ./a.py >/dev/null real 1m19.015s user 1m18.185s sys 0m0.072s $ # Notice that the gzip module (a.py) had the benefit of the files being in a disk cache by too now... $ cat a.py import gzip import os apt_cache_dir = "/var/cache/apt/apt-file" for apt_cache_file in os.listdir(apt_cache_dir): if not apt_cache_file.endswith(".gz"): continue f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file)) for line in f: print line $ cat b.py import os import subprocess from cStringIO import StringIO apt_cache_dir = "/var/cache/apt/apt-file" for apt_cache_file in os.listdir(apt_cache_dir): if not apt_cache_file.endswith(".gz"): continue p = subprocess.Popen(["zcat", os.path.join(apt_cache_dir, apt_cache_file)], stdout = subprocess.PIPE) f = StringIO(p.communicate()[0]) assert p.returncode == 0 for line in f: print line Also tried this one just for "completeness": $ cat c.py import gzip import os from cStringIO import StringIO apt_cache_dir = "/var/cache/apt/apt-file" for apt_cache_file in os.listdir(apt_cache_dir): if not apt_cache_file.endswith(".gz"): continue f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file)) f = StringIO(f.read()) for line in f: print line But after it had ran (with some thrashing) for 3 and a half minutes I killed it.
msg96205 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-10 16:13
How does the following compare? f = gzip.open(...) s = f.read() for line in s.splitlines(): print line
msg96206 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-10 16:15
(GZipFile.readline() is implemented in pure Python, which explains why it's rather slow)
msg96217 - (view)	Author: asnakelover (asnakelover)	Date: 2009-12-10 22:27
Hope this reply works right, the python bug interface is a bit confusing for this newbie, it doesn't say "Reply" anywhere - sorry if it goes FUBAR. I tried the splitlines() version you suggested, it thrashed my machine so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after about 1 minute so I didn't lose anything important. About half a minute later the machine came back to life. In other words: the splitlines version used way, way too much memory - far worse even than making a cStringIO from a GzipFile instance.read(). It's not just a GzipFile.readline() issue either, c.py calls .read() and tries to turn the result into a cStringIO and that was the worst one of my three previous tests. I'm going to look at this purely from a consumer angle and not even look at gzip module source, from this angle (a consumer), zcat out performs it by a factor of 10 when gzip module is used with .readline() and by a good deal more when I try to read the whole gzip file as a string to turn into a cStringIO to emulate as closely as possible what happens with forking a zcat process. When I tried to splitlines() it was even worse. This is probably a RAM issue, but it just brings us back to - should gzip module eat so much ram when shelling out to zcat uses far less?
msg96219 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-10 22:38
> I tried the splitlines() version you suggested, it thrashed my machine > so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after > about 1 minute so I didn't lose anything important. This sounds very weird. How much memory do you have, and how large are the .gz files you are reading? What is your system exactly?
msg96220 - (view)	Author: asnakelover (asnakelover)	Date: 2009-12-10 22:54
The gz in question is 17mb compressed and 247mb uncompressed. Calling zcat the python process uses between 250 and 260 mb with the whole string in memory using zcat as a fork. Numbers for the gzip module aren't obtainable except for readline(), which doesn't use much memory but is very slow. Other methods thrash the machine to death. The machine has 300mb free RAM from a total of 1024mb. It's not a new issue, I didn't find it when searching python bug archives for some reason but google: http://www.google.com/search?q=python+gzip+slow+zcat It's an issue which people have stumbled into before but nobody seems to have made headway any headway (I expect they just forked and called zcat or gzip, like I did...). It's a minor point since it's so easy to work around, in fact I have my IDS system completed already - thank you so much python for being so RAD friendly - yet the module is worthless if a user has zcat/gzip c programs and knows they exist as it stands.
msg96221 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-10 23:03
> The gz in question is 17mb compressed and 247mb uncompressed. Calling > zcat the python process uses between 250 and 260 mb with the whole > string in memory using zcat as a fork. Numbers for the gzip module > aren't obtainable except for readline(), which doesn't use much memory > but is very slow. Other methods thrash the machine to death. > > The machine has 300mb free RAM from a total of 1024mb. That would be the explanation. Reading the whole file at once and then doing splitlines() on the result consumes twice the memory, since a list of lines must be constructed while the original data is still around. If you had more than 600 MB free RAM the splitlines() solution would probably be adequate :-) Doing repeated calls to splitlines() on chunks of limited size (say 1MB) would probably be fast enough without using too much memory. It would be a bit less trivial to implement though, and it seems you are ok with the subprocess solution.
msg96222 - (view)	Author: asnakelover (asnakelover)	Date: 2009-12-10 23:06
Yes, subprocess works fine and was the quickest to implement and probably the fastest to run too. How can I put this without being an ass? Hell, I'm no good at diplomacy - the gzip module blows chunks - if I can shell out to a standard unix util and it uses a tiny fraction of the memory to do the same job the module is inherently broken no matter how pretty it's code looks. You can of course tell me to STFU - but look around the web I won't be the first or last user to bemoan "import gzip"... :)
msg96234 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-11 07:25
> How can I put this without being an ass? Hell, I'm no good at diplomacy > - the gzip module blows chunks - if I can shell out to a standard unix > util and it uses a tiny fraction of the memory to do the same job the > module is inherently broken no matter how pretty it's code looks. Well, let's say it is suboptimal. But it's certainly ok if you don't have any tight performance constraints. In any case, it won't improve until someone proposes a patch :-)
msg96272 - (view)	Author: Jack Diederich (jackdied) *	Date: 2009-12-11 20:32
I tried passing a size to readline to see if increasing the chunk helps (test file was 120meg with 700k lines). For values 1k-10k all took around 30 seconds, with a value of 100 it took 80 seconds, with a value of 100k it ran for several minutes before I killed it. The default starts at 100 and quickly maxes to 512, which seems to be a sweet spot (thanks whomever figured that out!). I profiled it and function overhead seems to be the real killer. 30% of the time is spent in readline(). The next() function does almost nothing and consumes 1/4th the time of readline(). Ditto for read() and _unread(). Even lowly len() consumes 1/3rd the time of readline() because it is called over 2million times. There doesn't seem to be any way to speed this up without rewriting the whole thing as a C module. I'm closing the bug WONTFIX.
msg96326 - (view)	Author: Nir Aides (nirai)	Date: 2009-12-13 09:38
First patch, please forgive long comment :) I submit a small patch which speeds up readline() on my data set - a 74MB (5MB .gz) log file with 600K lines. The speedup is 350%. Source of slowness is that (~20KB) extrabuf is allocated/deallocated in read() and _unread() with each call to readline(). In the patch read() returns a slice from extrabuf and defers manipulation of extrabuf to _read(). In the following, the first timeit() corresponds to reading extrabuf slices while the second timeit() corresponds to read() and _unread() as they are done today: >>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit() 0.25299811363220215 >>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 10000").timeit() 5.843876838684082 Another speedup is achieved by doing a small shortcut in readline() for the typical case in which the entire line is already in extrabuf. The patch only addresses the typical case of calling readline() with no arguments. It does not address other problems in readline() logic. In particular the current 512 chunk size is not a sweet spot. Regardless of the size argument passed to readline(), read() will continue to decompress just 1024 bytes with each call as the size of extrabuf swings around the target size argument as result of the interaction between _unread() and read().
msg96409 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-14 21:57
Ah, a patch. Now we're talking :)
msg96412 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-14 22:09
The patch doesn't apply against the SVN trunk (some parts are rejected). I suppose it was done against 2.6 or earlier, but those versions are in bug fixing-only mode (which excludes performance improvements), so you'll have to regenerate it against the SVN trunk. (for info about our SVN repository, see http://www.python.org/dev/faq/#how-do-i-get-a-checkout-of-the-repository-read-only-and-read-write )
msg96413 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-14 22:11
Ah, my bad, I hadn't seen that the patch is for 3.2. Sorry for the confusion.
msg96416 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-14 22:40
I confirm that the patch gives good speedups. It would be nice if there was a comment explaining what extrabuf, extrastart and extrasize are. In 3.x, a better but more involved approached would be to rewrite the gzip module so as to take advantage of the standard IO classes (especially BufferedReader / BufferedWriter), which are written in C and are therefore much faster.
msg96487 - (view)	Author: Nir Aides (nirai)	Date: 2009-12-16 13:02
Right, using the io module makes GzipFile as fast as zcat. I submit a new patch this time for Python 2.7, however, it is not a module rewrite, but again minimal refactoring. GzipFile is now derived of io.BufferedRandom, and as result the functionality of GzipFile changed since it used to behave differently than io classes. For example write() did not return number of bytes written, etc... Also, I hacked it a little to make it work with a raw stream which is either readable or writable but not both. If it is unacceptable, or if you prefer a module rewrite, I don't mind discussing and evolving this further.
msg96537 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-17 22:07
Thanks for the new patch. The problem with inheriting from BufferedRandom, though, is that if you call e.g. write() on a read-only gzipfile, it will appear to succeed because the bytes are buffered internally. I think the solution would be to use delegation rather than inheritance. Something like: def __init__(self, ...) if 'w' in mode: self.buf = BufferedWriter(...) for meth in ('read', 'write', etc.): setattr(self, meth, getattr(self.buf, meth)) It would also be nice to add some tests for the issues I mentioned earlier (check that IOError is raised when reading a write-only file, and vice-versa). By the way, we can't apply this approach to 2.x since BufferedWriter/BufferedRandom won't accept unicode arguments for write() and friends, and that would break compatibility. For 3.x it is fine though.
msg96538 - (view)	Author: Nir Aides (nirai)	Date: 2009-12-17 22:23
How about using the first patch with the slicing optimization and additionally enhancing GzipFile with the methods required to make it play nice as a raw stream to an io.BufferedReader object (readable(), writable(), readinto(), etc...). This way we still get the 350% speed up and keep it fully backward compatible, and if someone needs the extra performance they can feed it into an io.BufferedReader object thus: g = gzip.GzipFile(...) r = io.BufferedReader(g) for line in r: ...
msg96539 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-17 22:33
> How about using the first patch with the slicing optimization and > additionally enhancing GzipFile with the methods required to make it > play nice as a raw stream to an io.BufferedReader object (readable(), > writable(), readinto(), etc...). That's fine with me.
msg96554 - (view)	Author: Nir Aides (nirai)	Date: 2009-12-18 08:47
Submitted combined patch for Python 2.7. If its good i'll send one for Python 3.2.
msg96565 - (view)	Author: Brian Curtin (brian.curtin) *	Date: 2009-12-18 14:36
In the test, should you verify that the correct data comes back from io.BufferedReader? After the r.close() on line 90 of test_gzip.py, adding something like "self.assertEqual("".join(lines), data1*50)" would do the trick. Looks good.
msg96566 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-18 15:01
Two things: - since it implements common IO operations, the GzipFile class could inherit io.BufferedIOBase. It would also alleviate the need to reimplement readinto(): BufferedIOBase has a default implementation which should be sufficient. - rather than `type(data) is memoryview`, the preferred idiom is `isinstance(data, memoryview)`.
msg96643 - (view)	Author: Nir Aides (nirai)	Date: 2009-12-19 19:31
isatty() and __iter__() of io.BufferedIOBase raise on closed file and __enter__() raises ValueError with different (generic) message. Should we keep the original GzipFile methods or prefer the implementation of io.BufferedIOBase?
msg96650 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-19 21:19
> isatty() and __iter__() of io.BufferedIOBase raise on closed file and > __enter__() raises ValueError with different (generic) message. > > Should we keep the original GzipFile methods or prefer the implementation > of io.BufferedIOBase? It's fine to use the BufferedIOBase implementation. There's no reason to call isatty() on or iterate over a closed file.
msg96662 - (view)	Author: Nir Aides (nirai)	Date: 2009-12-20 07:34
uploaded updated patch for Python 2.7.
msg96663 - (view)	Author: Nir Aides (nirai)	Date: 2009-12-20 07:35
Uploaded patch for Python 3.2.
msg97182 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-03 22:40
The patches have been committed. Thank you!

History
Date	User	Action	Args
2022-04-11 14:56:55	admin	set	github: 51720
2010-01-03 22:40:04	pitrou	set	status: open -> closed resolution: fixed messages: + msg97182 stage: patch review -> resolved
2009-12-20 07:35:33	nirai	set	files: + gzip_7471_py32.diff messages: + msg96663
2009-12-20 07:34:43	nirai	set	files: + gzip_7471_py27.diff messages: + msg96662
2009-12-20 07:33:01	nirai	set	files: - gzip_7471_py27.diff
2009-12-19 21:19:48	pitrou	set	messages: + msg96650
2009-12-19 19:31:33	nirai	set	messages: + msg96643
2009-12-18 15:01:24	pitrou	set	messages: + msg96566
2009-12-18 14:36:01	brian.curtin	set	messages: + msg96565
2009-12-18 08:47:24	nirai	set	files: + gzip_7471_py27.diff messages: + msg96554
2009-12-17 22:33:38	pitrou	set	messages: + msg96539
2009-12-17 22:23:33	nirai	set	messages: + msg96538
2009-12-17 22:07:13	pitrou	set	messages: + msg96537
2009-12-16 13:02:41	nirai	set	files: + gzip_7471_io_patch.diff messages: + msg96487
2009-12-14 22:40:53	pitrou	set	messages: + msg96416
2009-12-14 22:11:06	pitrou	set	messages: + msg96413
2009-12-14 22:09:04	pitrou	set	messages: + msg96412
2009-12-14 21:57:55	pitrou	set	status: closed -> open versions: + Python 2.7, Python 3.2, - Python 2.6 messages: + msg96409 resolution: wont fix -> (no value) stage: patch review
2009-12-13 09:38:55	nirai	set	files: + gzip_7471_patch.diff nosy: + nirai messages: + msg96326 keywords: + patch
2009-12-11 20:32:47	jackdied	set	status: open -> closed nosy: + jackdied messages: + msg96272 resolution: wont fix
2009-12-11 07:25:12	pitrou	set	messages: + msg96234
2009-12-10 23:06:10	asnakelover	set	messages: + msg96222
2009-12-10 23:03:21	pitrou	set	messages: + msg96221
2009-12-10 22:54:36	asnakelover	set	messages: + msg96220
2009-12-10 22:38:06	pitrou	set	messages: + msg96219
2009-12-10 22:27:14	asnakelover	set	messages: + msg96217
2009-12-10 20:50:24	brian.curtin	set	nosy: + brian.curtin
2009-12-10 16:15:49	pitrou	set	priority: normal messages: + msg96206 title: gzip module too slow -> GZipFile.readline too slow
2009-12-10 16:13:44	pitrou	set	nosy: + pitrou messages: + msg96205
2009-12-10 16:11:23	asnakelover	create