Issue 20962: Rather modest chunk size in gzip.GzipFile

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65161

classification

Title:	Rather modest chunk size in gzip.GzipFile
Type:	performance	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:	Limit decompressed data when reading from LZMAFile and BZ2File View: 23529
Assigned To:		Nosy List:	editor-buzzfeed, ezio.melotti, martin.panter, nadeem.vawda, neologix, pitrou, serhiy.storchaka, skip.montanaro, tiwilliam
Priority:	normal	Keywords:	easy, patch

Created on 2014-03-17 18:58 by skip.montanaro, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
gzipseek.py	skip.montanaro, 2014-03-17 18:58
gzip.diff	skip.montanaro, 2014-04-21 00:10		review
20962_benchmark.py	tiwilliam, 2014-04-24 23:53
20962_default-buffer-size.patch	tiwilliam, 2014-04-28 13:20		review

Messages (17)
msg213883 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2014-03-17 18:58
I've had the opportunity to use the seek() method of the gzip.GzipFile class for the first time in the past few days. Wondering why it seemed my processing times were so slow, I took a look at the code for seek() and read(). It seems like the chunk size for reading (1024 bytes) is rather small. I created a simple subclass that overrode just seek() and read(), then defined a CHUNK_SIZE to be 16 * 8192 bytes (the whole idea of compressing files is that they get large, right? seems like most of the time we will want to seek pretty far through the file). Over a small subset of my inputs, I measured about a 2x decrease in run times, from about 54s to 26s. I ran using both gzip.GzipFile and my subclass several times, measuring the last four runs (two using the stdlib implementation, two using my subclass). I measured both the total time of the run, the time to process each input records, and time to execute just the seek() call for each record. The bulk of the per-record time was in the call to seek(), so by reducing that time, I sped up my run-times significantly. I'm still using 2.7, but other than the usual 2.x->3.x changes, the code looks pretty much the same between 2.7 and (at least) 3.3, and the logic involving the read size doesn't seem to have changed at all. I'll try to produce a patch if I have a few minutes, but in the meantime, I've attached my modified GzipFile class (produced against 2.7).
msg216918 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-04-20 22:37
You should try with different chunk and file sizes and see what is the best compromise. Tagging as "easy" in case someone wants to put together a small script to benchmark this (maybe it could even be added to http://hg.python.org/benchmarks/), or even a patch.
msg216927 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2014-04-21 00:10
Here's a straightforward patch. I didn't want to change the public API of the module, so just defined the chunk size with a leading underscore. Gzip tests continue to pass.
msg217141 - (view)	Author: William Tisäter (tiwilliam) *	Date: 2014-04-24 23:53
I played around with different file and chunk sizes using attached benchmark script. After several test runs I think 1024 * 16 would be the biggest win without losing too many μs on small seeks. You can find my benchmark output here: https://gist.github.com/tiwilliam/11273483 My test data was generated with following commands: dd if=/dev/random of=10K bs=1024 count=10 dd if=/dev/random of=1M bs=1024 count=1000 dd if=/dev/random of=5M bs=1024 count=5000 dd if=/dev/random of=100M bs=1024 count=100000 dd if=/dev/random of=1000M bs=1024 count=1000000 gzip 10K 1M 5M 100M 1000M
msg217285 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 16:42
William, thanks for the benchmarks. Unfortunately this type of benchmark depends on the hardware (disk, SSD, emmoey bandwitdh, etc). So I'd suggest, instead of using an hardcoded value, to simply reuse io.DEFAULT_BUFFER_SIZE. That way, if some day we decide to change it, all user code wil benefit from the change.
msg217371 - (view)	Author: William Tisäter (tiwilliam) *	Date: 2014-04-28 13:20
That makes sense. I proceeded and updated `Lib/gzip.py` to use `io.DEFAULT_BUFFER_SIZE` instead. This will change the existing behaviour in two ways: * Start using 1024 * 8 as buffer size instead of 1024. * Add one more kwarg (`buffer_size`) to `GzipFile.__init__()`. Ps. This is my first patch, tell me if I'm missing something.
msg217391 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-28 18:22
> So I'd suggest, instead of using an hardcoded value, to simply reuse io.DEFAULT_BUFFER_SIZE. > That way, if some day we decide to change it, all user code wil benefit from the change. I don't think io.DEFAULT_BUFFER_SIZE makes much sense as a heuristic for the gzip module (or compressed files in general). Perhaps gzip should get its own DEFAULT_BUFFER_SIZE?
msg217393 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-28 18:52
> I don't think io.DEFAULT_BUFFER_SIZE makes much sense as a heuristic for the gzip module (or compressed files in general). Perhaps gzip should get its own DEFAULT_BUFFER_SIZE? Do you mean from a namespace point of vue, or from a performance point of view? Because this size is used to read/write from underlying the file object, so using the io default would make sense, no? Sure, it might not be optimal for compressed files, but I gues that the optimal value is function of the compression-level block size and many other factors which are just too varied to come up with a reasonable heuristic.
msg217396 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-28 18:59
> Sure, it might not be optimal for compressed files, but I gues that > the optimal value is function of the compression-level block size and > many other factors which are just too varied to come up with a > reasonable heuristic. Well, I think that compressed files in general would benefit from a larger buffer size than plain binary I/O, but that's just a hunch.
msg217401 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2014-04-28 19:24
On Mon, Apr 28, 2014 at 1:59 PM, Antoine Pitrou <report@bugs.python.org> wrote: > Well, I think that compressed files in general would benefit from a > larger buffer size than plain binary I/O, but that's just a hunch. I agree. When writing my patch, my (perhaps specious) thinking went like this. * We have a big-ass file, so we compress it. * On average, when seeking to another point in that file, we probably want to go a long way. * It's possible that operating system read-ahead semantics will make read performance relatively high. * That would put more burden on the Python code to be efficient. * Larger buffer sizes will reduce the amount of Python bytecode which must be executed. So, if I have a filesystem block size of 8192 bytes, while that would represent some sort of "optimal" chunk size, in practice, I think operating system read-ahead and post-read processing of the bytes read will tend to suggest larger chunk sizes. Hence my naive choice of 16k bytes for _CHUNK_SIZE in my patch. Skip
msg217412 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-28 20:08
That could make sense, dunno. Note that the bz2 module uses a harcoded 8K value. Note that the buffer size should probably be passed to the open() call. Also, the allocation is quite peculiar: it uses an exponential buffer size, starting at a tiny value: 202 # Starts small, scales exponentially 203 self.min_readsize = 100 In short, I think the overall buffering should be rewritten :-)
msg217413 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2014-04-28 20:13
On Mon, Apr 28, 2014 at 3:08 PM, Charles-François Natali <report@bugs.python.org> wrote: > In short, I think the overall buffering should be rewritten :-) Perhaps so, but I think we should open a separate ticket for that instead of instituting some feature creep here (no matter how reasonable the concept or its changes would be). S
msg217414 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-28 20:40
> Perhaps so, but I think we should open a separate ticket for that > instead of instituting some feature creep here (no matter how > reasonable the concept or its changes would be). Agreed. The patch looks good to me, so feel free to commit! (FWIW, gzip apparently uses a fixed-32K buffer read).
msg238166 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-16 00:03
See also the patch for Issue 23529, which changes over to using BufferedReader for GzipFile, BZ2File and LZMAFile. The current patch there also passes a buffer_size parameter through to BufferedReader, although it currently defaults to io.DEFAULT_BUFFER_SIZE.
msg240685 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-04-13 17:54
Martin, do you think this is still an issue or has it been fixed by the compression refactor?
msg241415 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-04-18 12:45
The gzip (as well as LZMA and bzip) modules should now use buffer and chunk sizes of 8 KiB (= io.DEFAULT_BUFFER_SIZE) for most read() and seek() type operations. I have a patch that adds a buffer_size parameter to the three compression modules if anyone is interested. It may need a bit work, e.g. adding the parameter to open(), mimicking the built-in open() function when buffer_size=0, etc. I did a quick test of seeking 100 MB into a gzip file, using the original Python 3.4.3 module, the current code that uses 8 KiB chunk sizes, and then my patched code with various chunk sizes. It looks like 8 KiB is significantly better than the previous code. My tests are peaking at about 64 KiB, but I guess that depends on the computer (cache etc). Anyway, 8 KiB seems like a good compromise without hogging all the fast memory cache or whatever, so I suggest we close this bug. Command line for timing looked like: python -m timeit -s 'import gzip' \ 'gzip.GzipFile("100M.gz", buffer_size=8192).seek(int(100e6))' Python 3.4.3: 10 loops, best of 3: 2.36 sec per loop Currently (8 KiB chunking): 10 loops, best of 3: 693 msec per loop buffer_size=1024: 10 loops, best of 3: 2.46 sec per loop buffer_size=8192: 10 loops, best of 3: 677 msec per loop buffer_size=16 * 1024: 10 loops, best of 3: 502 msec per loop buffer_size=int(60e3): 10 loops, best of 3: 400 msec per loop buffer_size=64 * 1024: 10 loops, best of 3: 398 msec per loop buffer_size=int(80e3): 10 loops, best of 3: 406 msec per loop buffer_size=16 * 8192: 10 loops, best of 3: 469 msec per loop
msg264034 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-04-23 00:33
Since there doesn’t seem to be much interest here any more, and the current code has changed and now uses 8 KiB buffering, I am closing this. Although in theory a buffer or chunk size paramter could still be added to the new code if there was a need.

History
Date	User	Action	Args
2022-04-11 14:58:00	admin	set	github: 65161
2016-04-23 00:33:55	martin.panter	set	status: open -> closed versions: - Python 3.4 superseder: Limit decompressed data when reading from LZMAFile and BZ2File messages: + msg264034 resolution: out of date
2016-04-22 09:11:53	martin.panter	set	messages: - msg263972
2016-04-22 07:21:03	editor-buzzfeed	set	status: pending -> open nosy: + editor-buzzfeed messages: + msg263972
2015-04-18 12:45:50	martin.panter	set	status: open -> pending messages: + msg241415
2015-04-13 17:54:52	pitrou	set	messages: + msg240685
2015-03-16 00:03:23	martin.panter	set	nosy: + martin.panter messages: + msg238166
2014-04-28 20:40:54	neologix	set	messages: + msg217414
2014-04-28 20:13:29	skip.montanaro	set	messages: + msg217413
2014-04-28 20:08:46	neologix	set	messages: + msg217412
2014-04-28 19:24:26	skip.montanaro	set	messages: + msg217401
2014-04-28 18:59:42	pitrou	set	messages: + msg217396
2014-04-28 18:52:27	neologix	set	messages: + msg217393
2014-04-28 18:22:01	pitrou	set	nosy: + pitrou messages: + msg217391
2014-04-28 13:20:45	tiwilliam	set	files: + 20962_default-buffer-size.patch messages: + msg217371
2014-04-27 16:42:53	neologix	set	nosy: + neologix messages: + msg217285
2014-04-24 23:53:22	tiwilliam	set	files: + 20962_benchmark.py nosy: + tiwilliam messages: + msg217141
2014-04-21 17:56:18	pitrou	set	nosy: + nadeem.vawda, serhiy.storchaka
2014-04-21 00:10:18	skip.montanaro	set	files: + gzip.diff keywords: + patch messages: + msg216927 stage: needs patch -> patch review
2014-04-20 22:37:28	ezio.melotti	set	nosy: + ezio.melotti messages: + msg216918 keywords: + easy stage: needs patch
2014-03-17 18:58:46	skip.montanaro	create