Issue 45509: Gzip header corruption not properly checked.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/89672

classification

Title:	Gzip header corruption not properly checked.
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	rhpvorderman, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2021-10-18 10:41 by rhpvorderman, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
benchmark_gzip_read_header.py	rhpvorderman, 2021-11-23 06:12
benchmark_gzip_read_header.py	rhpvorderman, 2021-11-24 11:10

Pull Requests
URL	Status	Linked	Edit
PR 29028	open	rhpvorderman, 2021-10-18 11:48

Messages (7)
msg404174 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-10-18 10:41
The following headers are currently allowed while being wrong: - Headers with FCOMMENT flag set, but with incomplete or missing COMMENT bytes. - Headers with FNAME flag set, but with incomplete or missing NAME bytes - Headers with FHCRC set, the crc is read, but not checked.
msg405783 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-11-05 10:38
Bump. This is a bug that allows corrupted gzip files to be processed without error. Therefore I bump this issue in the hopes someone will review the PR.
msg406747 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-11-22 08:35
Ping
msg406753 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-11-22 09:25
I think it is good idea, although we may get reports about regressions in 3.11 when Python will start to reject GZIP files which was successfully read before. But before merging this I want to know: 1. How much overhead does it add for reading files with the FHCRC flag clear? 2. How much overhead does it add for reading files with the FHCRC flag set? 3. Are GZIP files created by other tools has the FHCRC flag by default?
msg406758 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-11-22 10:44
1. Quite a lot I tested it for the two most common use case. import timeit import statistics WITH_FNAME = """ from gzip import GzipFile, decompress import io fileobj = io.BytesIO() g = GzipFile(fileobj=fileobj, mode='wb', filename='compressable_file') g.write(b'') g.close() data=fileobj.getvalue() """ WITH_NO_FLAGS = """ from gzip import decompress import zlib data = zlib.compress(b'', wbits=31) """ def benchmark(name, setup, loops=10000, runs=10): print(f"{name}") results = [timeit.timeit("decompress(data)", setup, number=loops) for _ in range(runs)] # Calculate microseconds results = [(result / loops) * 1_000_000 for result in results] print(f"average: {round(statistics.mean(results), 2)}, " f"range: {round(min(results), 2)}-{round(max(results),2)} " f"stdev: {round(statistics.stdev(results),2)}") if __name__ == "__main__": benchmark("with_fname", WITH_FNAME) benchmark("with_noflags", WITH_FNAME) BEFORE: with_fname average: 3.27, range: 3.21-3.36 stdev: 0.05 with_noflags average: 3.24, range: 3.14-3.37 stdev: 0.07 AFTER: with_fname average: 4.98, range: 4.85-5.14 stdev: 0.1 with_noflags average: 4.87, range: 4.69-5.05 stdev: 0.1 That is a dramatic increase in overhead. (Okay the decompressed data is empty, but still) 2. Haven't tested this yet. But the regression is quite unacceptable already. 3. Not that I know of. But if it is set, it is safe to assume they care. Nevertheless this is a bit of an edge-case.
msg406815 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-11-23 06:12
I increased the performance of the patch. I added the file used for benchmarking. I also test the FHCRC changes now. The benchmark tests headers with different flags concatenated to a DEFLATE block with no data and a gzip trailer. The data is fed to gzip.decompress. Please note that this is the worst-case performance overhead. When there is actual data to decompress the overhead will get less. When GzipFile is used the overhead will get less as well. BEFORE (Current main branch): $ ./python benchmark_gzip_read_header.py with_fname average: 3.01, range: 2.9-4.79 stdev: 0.19 with_noflags average: 2.99, range: 2.93-3.04 stdev: 0.02 All flags (incl FHCRC) average: 3.13, range: 3.05-3.16 stdev: 0.02 After (bpo-45509 PR): with_fname average: 3.09, range: 3.01-4.63 stdev: 0.16 with_noflags average: 3.1, range: 3.03-3.38 stdev: 0.04 All flags (incl FHCRC) average: 4.09, range: 4.05-4.49 stdev: 0.04 An increase of .1 microsecond in the most common use cases. Roughly 3%. But now the FNAME field is correctly checked for truncation. With the FHCRC the overhead is increased by 33%. But this is worth it, because the header is now actually checked. As it should.
msg406918 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-11-24 11:10
I have found that using the timeit module provides more precise measurements: For a simple gzip header. (As returned by gzip.compress or zlib.compress with wbits=31) ./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))' For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile) ./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x08j\x1a\x9ea\x02\xffcompressable_file\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))' For a gzip header with all flags set: ./python -m timeit -s 'import gzip, io; data = b"\x1f\x8b\x08\x1f\x00\x00\x00\x00\x00\xff\x05\x00extraname\x00comment\x00\xe9T"; from gzip import _read_gzip_header' '_read_gzip_header(io.BytesIO(data))' Since performance is most critical for in-memory compression and decompression, I now optimized for no flags. Before (current main): 500000 loops, best of 5: 469 nsec per loop after (PR): 1000000 loops, best of 5: 390 nsec per loop For the most common case of only FNAME set: before: 200000 loops, best of 5: 1.48 usec per loop after: 200000 loops, best of 5: 1.45 usec per loop For the case where FCHRC is set: before: 200000 loops, best of 5: 1.62 usec per loop after: 100000 loops, best of 5: 2.43 usec per loop So this PR is now a clear win for decompressing anything that has been compressed with gzip.compress. It is neutral for normal file decompression. There is a performance cost associated with correctly checking the header, but that is expected. It is better than the alternative of not checking it.

History
Date	User	Action	Args
2022-04-11 14:59:51	admin	set	github: 89672
2021-11-24 11:10:50	rhpvorderman	set	files: + benchmark_gzip_read_header.py messages: + msg406918
2021-11-23 06:12:51	rhpvorderman	set	files: + benchmark_gzip_read_header.py messages: + msg406815
2021-11-22 10:44:01	rhpvorderman	set	messages: + msg406758
2021-11-22 09:25:43	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg406753
2021-11-22 08:35:24	rhpvorderman	set	messages: + msg406747
2021-11-05 10:38:27	rhpvorderman	set	messages: + msg405783
2021-10-18 11:48:39	rhpvorderman	set	keywords: + patch stage: patch review pull_requests: + pull_request27300
2021-10-18 10:41:07	rhpvorderman	create