This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author rhpvorderman
Recipients rhpvorderman, serhiy.storchaka
Date 2021-11-24.11:10:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1637752250.79.0.629261568834.issue45509@roundup.psfhosted.org>
In-reply-to
Content
I have found that using the timeit module provides more precise measurements:

For a simple gzip header. (As returned by gzip.compress or zlib.compress with wbits=31)
./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'


For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile)
./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x08j\x1a\x9ea\x02\xffcompressable_file\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'

For a gzip header with all flags set:
./python -m timeit -s 'import gzip, io; data = b"\x1f\x8b\x08\x1f\x00\x00\x00\x00\x00\xff\x05\x00extraname\x00comment\x00\xe9T"; from gzip import _read_gzip_header' '_read_gzip_header(io.BytesIO(data))'


Since performance is most critical for in-memory compression and decompression, I now optimized for no flags.
Before (current main): 500000 loops, best of 5: 469 nsec per loop
after (PR): 1000000 loops, best of 5: 390 nsec per loop

For the most common case of only FNAME set:
before: 200000 loops, best of 5: 1.48 usec per loop
after: 200000 loops, best of 5: 1.45 usec per loop

For the case where FCHRC is set:
before: 200000 loops, best of 5: 1.62 usec per loop
after: 100000 loops, best of 5: 2.43 usec per loop


So this PR is now a clear win for decompressing anything that has been compressed with gzip.compress. It is neutral for normal file decompression. There is a performance cost associated with correctly checking the header, but that is expected. It is better than the alternative of not checking it.
History
Date User Action Args
2021-11-24 11:10:50rhpvordermansetrecipients: + rhpvorderman, serhiy.storchaka
2021-11-24 11:10:50rhpvordermansetmessageid: <1637752250.79.0.629261568834.issue45509@roundup.psfhosted.org>
2021-11-24 11:10:50rhpvordermanlinkissue45509 messages
2021-11-24 11:10:50rhpvordermancreate