This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Gzip header corruption not properly checked.
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: rhpvorderman, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2021-10-18 10:41 by rhpvorderman, last changed 2022-04-11 14:59 by admin.

File name Uploaded Description Edit rhpvorderman, 2021-11-23 06:12 rhpvorderman, 2021-11-24 11:10
Pull Requests
URL Status Linked Edit
PR 29028 open rhpvorderman, 2021-10-18 11:48
Messages (7)
msg404174 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-10-18 10:41
The following headers are currently allowed while being wrong:

- Headers with FCOMMENT flag set, but with incomplete or missing COMMENT bytes.
- Headers with FNAME flag set, but with incomplete or missing NAME bytes
- Headers with FHCRC set, the crc is read, but not checked.
msg405783 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-05 10:38
Bump. This is a bug that allows corrupted gzip files to be processed  without error. Therefore I bump this issue in the hopes someone will review the PR.
msg406747 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-22 08:35
msg406753 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-11-22 09:25
I think it is good idea, although we may get reports about regressions in 3.11 when Python will start to reject GZIP files which was successfully read before.

But before merging this I want to know:

1. How much overhead does it add for reading files with the FHCRC flag clear?
2. How much overhead does it add for reading files with the FHCRC flag set?
3. Are GZIP files created by other tools has the FHCRC flag by default?
msg406758 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-22 10:44
1. Quite a lot

I tested it for the two most common use case. 
import timeit
import statistics

from gzip import GzipFile, decompress
import io
fileobj = io.BytesIO()
g = GzipFile(fileobj=fileobj, mode='wb', filename='compressable_file')
from gzip import decompress
import zlib
data = zlib.compress(b'', wbits=31)

def benchmark(name, setup, loops=10000, runs=10):
    results = [timeit.timeit("decompress(data)", setup, number=loops) for _ in range(runs)]
    # Calculate microseconds
    results = [(result / loops) * 1_000_000 for result in results]
    print(f"average: {round(statistics.mean(results), 2)}, "
          f"range: {round(min(results), 2)}-{round(max(results),2)} "
          f"stdev: {round(statistics.stdev(results),2)}")

if __name__ == "__main__":
    benchmark("with_fname", WITH_FNAME)
    benchmark("with_noflags", WITH_FNAME)


average: 3.27, range: 3.21-3.36 stdev: 0.05
average: 3.24, range: 3.14-3.37 stdev: 0.07

average: 4.98, range: 4.85-5.14 stdev: 0.1
average: 4.87, range: 4.69-5.05 stdev: 0.1

That is a dramatic increase in overhead. (Okay the decompressed data is empty, but still)

2. Haven't tested this yet. But the regression is quite unacceptable already.

3. Not that I know of. But if it is set, it is safe to assume they care. Nevertheless this is a bit of an edge-case.
msg406815 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-23 06:12
I increased the performance of the patch. I added the file used for benchmarking. I also test the FHCRC changes now.

The benchmark tests headers with different flags concatenated to a DEFLATE block with no data and a gzip trailer. The data is fed to gzip.decompress. Please note that this is the *worst-case* performance overhead. When there is actual data to decompress the overhead will get less. When GzipFile is used the overhead will get less as well.

BEFORE (Current main branch):
$ ./python 
average: 3.01, range: 2.9-4.79 stdev: 0.19
average: 2.99, range: 2.93-3.04 stdev: 0.02
All flags (incl FHCRC)
average: 3.13, range: 3.05-3.16 stdev: 0.02

After (bpo-45509 PR):
average: 3.09, range: 3.01-4.63 stdev: 0.16
average: 3.1, range: 3.03-3.38 stdev: 0.04
All flags (incl FHCRC)
average: 4.09, range: 4.05-4.49 stdev: 0.04

An increase of .1 microsecond in the most common use cases. Roughly 3%. But now the FNAME field is correctly checked for truncation.

With the FHCRC the overhead is increased by 33%. But this is worth it, because the header is now actually checked. As it should.
msg406918 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-24 11:10
I have found that using the timeit module provides more precise measurements:

For a simple gzip header. (As returned by gzip.compress or zlib.compress with wbits=31)
./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'

For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile)
./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x08j\x1a\x9ea\x02\xffcompressable_file\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'

For a gzip header with all flags set:
./python -m timeit -s 'import gzip, io; data = b"\x1f\x8b\x08\x1f\x00\x00\x00\x00\x00\xff\x05\x00extraname\x00comment\x00\xe9T"; from gzip import _read_gzip_header' '_read_gzip_header(io.BytesIO(data))'

Since performance is most critical for in-memory compression and decompression, I now optimized for no flags.
Before (current main): 500000 loops, best of 5: 469 nsec per loop
after (PR): 1000000 loops, best of 5: 390 nsec per loop

For the most common case of only FNAME set:
before: 200000 loops, best of 5: 1.48 usec per loop
after: 200000 loops, best of 5: 1.45 usec per loop

For the case where FCHRC is set:
before: 200000 loops, best of 5: 1.62 usec per loop
after: 100000 loops, best of 5: 2.43 usec per loop

So this PR is now a clear win for decompressing anything that has been compressed with gzip.compress. It is neutral for normal file decompression. There is a performance cost associated with correctly checking the header, but that is expected. It is better than the alternative of not checking it.
Date User Action Args
2022-04-11 14:59:51adminsetgithub: 89672
2021-11-24 11:10:50rhpvordermansetfiles: +

messages: + msg406918
2021-11-23 06:12:51rhpvordermansetfiles: +

messages: + msg406815
2021-11-22 10:44:01rhpvordermansetmessages: + msg406758
2021-11-22 09:25:43serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg406753
2021-11-22 08:35:24rhpvordermansetmessages: + msg406747
2021-11-05 10:38:27rhpvordermansetmessages: + msg405783
2021-10-18 11:48:39rhpvordermansetkeywords: + patch
stage: patch review
pull_requests: + pull_request27300
2021-10-18 10:41:07rhpvordermancreate