Message241415
The gzip (as well as LZMA and bzip) modules should now use buffer and chunk sizes of 8 KiB (= io.DEFAULT_BUFFER_SIZE) for most read() and seek() type operations.
I have a patch that adds a buffer_size parameter to the three compression modules if anyone is interested. It may need a bit work, e.g. adding the parameter to open(), mimicking the built-in open() function when buffer_size=0, etc.
I did a quick test of seeking 100 MB into a gzip file, using the original Python 3.4.3 module, the current code that uses 8 KiB chunk sizes, and then my patched code with various chunk sizes. It looks like 8 KiB is significantly better than the previous code. My tests are peaking at about 64 KiB, but I guess that depends on the computer (cache etc). Anyway, 8 KiB seems like a good compromise without hogging all the fast memory cache or whatever, so I suggest we close this bug.
Command line for timing looked like:
python -m timeit -s 'import gzip' \
'gzip.GzipFile("100M.gz", buffer_size=8192).seek(int(100e6))'
Python 3.4.3: 10 loops, best of 3: 2.36 sec per loop
Currently (8 KiB chunking): 10 loops, best of 3: 693 msec per loop
buffer_size=1024: 10 loops, best of 3: 2.46 sec per loop
buffer_size=8192: 10 loops, best of 3: 677 msec per loop
buffer_size=16 * 1024: 10 loops, best of 3: 502 msec per loop
buffer_size=int(60e3): 10 loops, best of 3: 400 msec per loop
buffer_size=64 * 1024: 10 loops, best of 3: 398 msec per loop
buffer_size=int(80e3): 10 loops, best of 3: 406 msec per loop
buffer_size=16 * 8192: 10 loops, best of 3: 469 msec per loop |
|
Date |
User |
Action |
Args |
2015-04-18 12:45:50 | martin.panter | set | recipients:
+ martin.panter, skip.montanaro, pitrou, nadeem.vawda, ezio.melotti, neologix, serhiy.storchaka, tiwilliam |
2015-04-18 12:45:50 | martin.panter | set | messageid: <1429361150.04.0.8149135926.issue20962@psf.upfronthosting.co.za> |
2015-04-18 12:45:50 | martin.panter | link | issue20962 messages |
2015-04-18 12:45:49 | martin.panter | create | |
|