Message 241415 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	ezio.melotti, martin.panter, nadeem.vawda, neologix, pitrou, serhiy.storchaka, skip.montanaro, tiwilliam
Date	2015-04-18.12:45:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1429361150.04.0.8149135926.issue20962@psf.upfronthosting.co.za>
In-reply-to

Content
The gzip (as well as LZMA and bzip) modules should now use buffer and chunk sizes of 8 KiB (= io.DEFAULT_BUFFER_SIZE) for most read() and seek() type operations. I have a patch that adds a buffer_size parameter to the three compression modules if anyone is interested. It may need a bit work, e.g. adding the parameter to open(), mimicking the built-in open() function when buffer_size=0, etc. I did a quick test of seeking 100 MB into a gzip file, using the original Python 3.4.3 module, the current code that uses 8 KiB chunk sizes, and then my patched code with various chunk sizes. It looks like 8 KiB is significantly better than the previous code. My tests are peaking at about 64 KiB, but I guess that depends on the computer (cache etc). Anyway, 8 KiB seems like a good compromise without hogging all the fast memory cache or whatever, so I suggest we close this bug. Command line for timing looked like: python -m timeit -s 'import gzip' \ 'gzip.GzipFile("100M.gz", buffer_size=8192).seek(int(100e6))' Python 3.4.3: 10 loops, best of 3: 2.36 sec per loop Currently (8 KiB chunking): 10 loops, best of 3: 693 msec per loop buffer_size=1024: 10 loops, best of 3: 2.46 sec per loop buffer_size=8192: 10 loops, best of 3: 677 msec per loop buffer_size=16 * 1024: 10 loops, best of 3: 502 msec per loop buffer_size=int(60e3): 10 loops, best of 3: 400 msec per loop buffer_size=64 * 1024: 10 loops, best of 3: 398 msec per loop buffer_size=int(80e3): 10 loops, best of 3: 406 msec per loop buffer_size=16 * 8192: 10 loops, best of 3: 469 msec per loop

The gzip (as well as LZMA and bzip) modules should now use buffer and chunk sizes of 8 KiB (= io.DEFAULT_BUFFER_SIZE) for most read() and seek() type operations.

I have a patch that adds a buffer_size parameter to the three compression modules if anyone is interested. It may need a bit work, e.g. adding the parameter to open(), mimicking the built-in open() function when buffer_size=0, etc.

I did a quick test of seeking 100 MB into a gzip file, using the original Python 3.4.3 module, the current code that uses 8 KiB chunk sizes, and then my patched code with various chunk sizes. It looks like 8 KiB is significantly better than the previous code. My tests are peaking at about 64 KiB, but I guess that depends on the computer (cache etc). Anyway, 8 KiB seems like a good compromise without hogging all the fast memory cache or whatever, so I suggest we close this bug.

Command line for timing looked like:

python -m timeit -s 'import gzip' \
    'gzip.GzipFile("100M.gz", buffer_size=8192).seek(int(100e6))'

Python 3.4.3: 10 loops, best of 3: 2.36 sec per loop
Currently (8 KiB chunking): 10 loops, best of 3: 693 msec per loop
buffer_size=1024: 10 loops, best of 3: 2.46 sec per loop
buffer_size=8192: 10 loops, best of 3: 677 msec per loop
buffer_size=16 * 1024: 10 loops, best of 3: 502 msec per loop
buffer_size=int(60e3): 10 loops, best of 3: 400 msec per loop
buffer_size=64 * 1024: 10 loops, best of 3: 398 msec per loop
buffer_size=int(80e3): 10 loops, best of 3: 406 msec per loop
buffer_size=16 * 8192: 10 loops, best of 3: 469 msec per loop

History
Date	User	Action	Args
2015-04-18 12:45:50	martin.panter	set	recipients: + martin.panter, skip.montanaro, pitrou, nadeem.vawda, ezio.melotti, neologix, serhiy.storchaka, tiwilliam
2015-04-18 12:45:50	martin.panter	set	messageid: <1429361150.04.0.8149135926.issue20962@psf.upfronthosting.co.za>
2015-04-18 12:45:50	martin.panter	link	issue20962 messages
2015-04-18 12:45:49	martin.panter	create