classification
Title: Gzip header corruption not properly checked.
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: rhpvorderman, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2021-10-18 10:41 by rhpvorderman, last changed 2021-11-24 11:10 by rhpvorderman.

Files
File name Uploaded Description Edit
benchmark_gzip_read_header.py rhpvorderman, 2021-11-23 06:12
benchmark_gzip_read_header.py rhpvorderman, 2021-11-24 11:10
Pull Requests
URL Status Linked Edit
PR 29028 open rhpvorderman, 2021-10-18 11:48
Messages (7)
msg404174 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-10-18 10:41
The following headers are currently allowed while being wrong:

- Headers with FCOMMENT flag set, but with incomplete or missing COMMENT bytes.
- Headers with FNAME flag set, but with incomplete or missing NAME bytes
- Headers with FHCRC set, the crc is read, but not checked.
msg405783 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-05 10:38
Bump. This is a bug that allows corrupted gzip files to be processed  without error. Therefore I bump this issue in the hopes someone will review the PR.
msg406747 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-22 08:35
Ping
msg406753 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-11-22 09:25
I think it is good idea, although we may get reports about regressions in 3.11 when Python will start to reject GZIP files which was successfully read before.

But before merging this I want to know:

1. How much overhead does it add for reading files with the FHCRC flag clear?
2. How much overhead does it add for reading files with the FHCRC flag set?
3. Are GZIP files created by other tools has the FHCRC flag by default?
msg406758 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-22 10:44
1. Quite a lot

I tested it for the two most common use case. 
import timeit
import statistics

WITH_FNAME = """
from gzip import GzipFile, decompress
import io
fileobj = io.BytesIO()
g = GzipFile(fileobj=fileobj, mode='wb', filename='compressable_file')
g.write(b'')
g.close()
data=fileobj.getvalue()
"""
WITH_NO_FLAGS = """
from gzip import decompress
import zlib
data = zlib.compress(b'', wbits=31)
"""

def benchmark(name, setup, loops=10000, runs=10):
    print(f"{name}")
    results = [timeit.timeit("decompress(data)", setup, number=loops) for _ in range(runs)]
    # Calculate microseconds
    results = [(result / loops) * 1_000_000 for result in results]
    print(f"average: {round(statistics.mean(results), 2)}, "
          f"range: {round(min(results), 2)}-{round(max(results),2)} "
          f"stdev: {round(statistics.stdev(results),2)}")


if __name__ == "__main__":
    benchmark("with_fname", WITH_FNAME)
    benchmark("with_noflags", WITH_FNAME)

BEFORE:

with_fname
average: 3.27, range: 3.21-3.36 stdev: 0.05
with_noflags
average: 3.24, range: 3.14-3.37 stdev: 0.07

AFTER:
with_fname
average: 4.98, range: 4.85-5.14 stdev: 0.1
with_noflags
average: 4.87, range: 4.69-5.05 stdev: 0.1

That is a dramatic increase in overhead. (Okay the decompressed data is empty, but still)

2. Haven't tested this yet. But the regression is quite unacceptable already.

3. Not that I know of. But if it is set, it is safe to assume they care. Nevertheless this is a bit of an edge-case.
msg406815 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-23 06:12
I increased the performance of the patch. I added the file used for benchmarking. I also test the FHCRC changes now.

The benchmark tests headers with different flags concatenated to a DEFLATE block with no data and a gzip trailer. The data is fed to gzip.decompress. Please note that this is the *worst-case* performance overhead. When there is actual data to decompress the overhead will get less. When GzipFile is used the overhead will get less as well.

BEFORE (Current main branch):
$ ./python benchmark_gzip_read_header.py 
with_fname
average: 3.01, range: 2.9-4.79 stdev: 0.19
with_noflags
average: 2.99, range: 2.93-3.04 stdev: 0.02
All flags (incl FHCRC)
average: 3.13, range: 3.05-3.16 stdev: 0.02


After (bpo-45509 PR):
with_fname
average: 3.09, range: 3.01-4.63 stdev: 0.16
with_noflags
average: 3.1, range: 3.03-3.38 stdev: 0.04
All flags (incl FHCRC)
average: 4.09, range: 4.05-4.49 stdev: 0.04

An increase of .1 microsecond in the most common use cases. Roughly 3%. But now the FNAME field is correctly checked for truncation.

With the FHCRC the overhead is increased by 33%. But this is worth it, because the header is now actually checked. As it should.
msg406918 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-24 11:10
I have found that using the timeit module provides more precise measurements:

For a simple gzip header. (As returned by gzip.compress or zlib.compress with wbits=31)
./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'


For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile)
./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x08j\x1a\x9ea\x02\xffcompressable_file\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'

For a gzip header with all flags set:
./python -m timeit -s 'import gzip, io; data = b"\x1f\x8b\x08\x1f\x00\x00\x00\x00\x00\xff\x05\x00extraname\x00comment\x00\xe9T"; from gzip import _read_gzip_header' '_read_gzip_header(io.BytesIO(data))'


Since performance is most critical for in-memory compression and decompression, I now optimized for no flags.
Before (current main): 500000 loops, best of 5: 469 nsec per loop
after (PR): 1000000 loops, best of 5: 390 nsec per loop

For the most common case of only FNAME set:
before: 200000 loops, best of 5: 1.48 usec per loop
after: 200000 loops, best of 5: 1.45 usec per loop

For the case where FCHRC is set:
before: 200000 loops, best of 5: 1.62 usec per loop
after: 100000 loops, best of 5: 2.43 usec per loop


So this PR is now a clear win for decompressing anything that has been compressed with gzip.compress. It is neutral for normal file decompression. There is a performance cost associated with correctly checking the header, but that is expected. It is better than the alternative of not checking it.
History
Date User Action Args
2021-11-24 11:10:50rhpvordermansetfiles: + benchmark_gzip_read_header.py

messages: + msg406918
2021-11-23 06:12:51rhpvordermansetfiles: + benchmark_gzip_read_header.py

messages: + msg406815
2021-11-22 10:44:01rhpvordermansetmessages: + msg406758
2021-11-22 09:25:43serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg406753
2021-11-22 08:35:24rhpvordermansetmessages: + msg406747
2021-11-05 10:38:27rhpvordermansetmessages: + msg405783
2021-10-18 11:48:39rhpvordermansetkeywords: + patch
stage: patch review
pull_requests: + pull_request27300
2021-10-18 10:41:07rhpvordermancreate