Issue 24301: gzip module failing to decompress valid compressed file

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68489

classification

Title:	gzip module failing to decompress valid compressed file
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Ericg, iritkatriel, martin.panter, nczeczulin, ned.deily, rhpvorderman
Priority:	normal	Keywords:	patch

Created on 2015-05-27 15:59 by Ericg, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 29847	open	rhpvorderman, 2021-11-29 15:28

Messages (9)
msg244188 - (view)	Author: EricG (Ericg)	Date: 2015-05-27 15:59
I have a file whose first four bytes are 1F 8B 08 00 and if I use gunzip from the command line, it outputs: gzip: zImage_extracted.gz: decompression OK, trailing garbage ignored and correctly decompresses the file. However, if I use the gzip module to read and decompress the data, I get the following exception thrown: File "/usr/lib/python3.4/gzip.py", line 360, in read while self._read(readsize): File "/usr/lib/python3.4/gzip.py", line 433, in _read if not self._read_gzip_header(): File "/usr/lib/python3.4/gzip.py", line 297, in _read_gzip_header raise OSError('Not a gzipped file') I believe the problem I am facing is the same one described here in this SO question and answer: http://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data This would appear to be serious bug in the gzip module that needs to be fixed.
msg244214 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-05-27 18:47
Can you add a public copy of a file that fails this way? There are several open issues with gzip, like Issue1159051, that might cover this but it's hard to know for sure without a test case.
msg244230 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-05-28 00:26
I suspect Eric’s file has non-zero, non-gzip garbage bytes appended to the end of it. Assuming I am right, here is way to reproduce that scenario: >>> from gzip import GzipFile >>> from io import BytesIO >>> file = BytesIO() >>> with GzipFile(fileobj=file, mode="wb") as z: ... z.write(b"data") ... 4 >>> file.write(b"garbage") 7 >>> file.seek(0) 0 >>> GzipFile(fileobj=file).read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/proj/python/cpython/Lib/gzip.py", line 274, in read return self._buffer.read(size) File "/home/proj/python/cpython/Lib/gzip.py", line 461, in read if not self._read_gzip_header(): File "/home/proj/python/cpython/Lib/gzip.py", line 409, in _read_gzip_header raise OSError('Not a gzipped file (%r)' % magic) OSError: Not a gzipped file (b'ga') This is a bit different to Issue 1508475. That one is about cases where the “gzip” trailer has been truncated, although the compressed data is probably intact. This case is the converse: extra data has been added. All of the “gzip”, “bzip2” and XZ Utils (for LZMA) command-line decompressors happily extract the compressed data without an error exit status, but emit warning messages: gzip: stdin: decompression OK, trailing garbage ignored bzip2: (stdin): trailing garbage after EOF ignored xz: (stdin): Unexpected end of input In Python, the “bzip” and LZMA modules successfully extract the compressed data, and ignore the non-compressed garbage at the end without even a warning. On the other hand, the “gzip” module has special code to ignore trailing zero bytes (Issue 2846), but treats any other trailing non-gzip data as an error. So I think a strong argument could be made for the ability to extract all the compressed data from even if there is garbage appended. The question is, how would this support be added? Perhaps the mechanism chosen could also be integrated with a fix for Issue 1508475. Some options: * Silently ignore the condition by default like the other compression modules (consistent, but could silently swallow real errors) * An optional new GzipFile(strict=False) mode * Perhaps an exception deferred until close() is called
msg245368 - (view)	Author: Nick Czeczulin (nczeczulin)	Date: 2015-06-15 06:58
The spec allows for multi-member files. Some libraries and utilities seem to solve this problem (incorrectly?) by simply ignoring everything past the first member -- even when valid (e.g., DotNetZip, 7-Zip) For 2.7 and 3.4, the data that has been decompressed but not yet read before the exception was raised is still available: Modifying Martin's example slightly: >>> f = BytesIO() >>> with GzipFile(fileobj=f, mode="wb") as z: ... z.write(b"data") ... 4 >>> f.write(b"garbage") 7 >>> f.seek(0) 0 >>> with GzipFile(fileobj=f, mode="rb") as z: ... try: ... z.read(1) ... z.read() ... except OSError as e: ... z.extrabuf[z.offset - z.extrastart:] ... e ... b'd' b'ata' OSError('Not a gzipped file',) My issue is that catching and handling this specific exception is a little more involved because there are 3(?) different OSErrors (IOError on 2.7) that could potentially be raised during the read. But mostly: OSError('CRC check failed 0x447ba3f9 != 0x225cb2a3',) -- would be bad one to mistake for it. Maybe a specific Exception type to catch for an invalid header, and a better method to read the remaining buffer when handling it?
msg245369 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-06-15 08:17
Just noticed in my previous message I mentioned Issue 1508475 a few times when I meant to say Issue 1159051. In Python 3.5, a workaround is not so easy because we would need to access the internal buffer of a BufferedReader. One potential workaround is to use read1(): >>> z.read1(1) b'd' >>> z.read1() b'ata' >>> z.read1() OSError: Not a gzipped file (b'ga') The only practical way to allow for an exception and read() returning all the data is to defer the exception until close() is called. Another option might be to store a list of defects, similar to “email.message.Message.defects”.
msg407148 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-11-27 14:47
Reproduced on 3.11: >>> from gzip import GzipFile >>> from io import BytesIO >>> file = BytesIO() >>> with GzipFile(fileobj=file, mode="wb") as z: ... z.write(b"data") ... 4 >>> file.write(b"garbage") 7 >>> file.seek(0) 0 >>> GzipFile(fileobj=file).read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 301, in read return self._buffer.read(size) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-654/Lib/_compression.py", line 118, in readall while data := self.read(sys.maxsize): ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 499, in read if not self._read_gzip_header(): ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 468, in _read_gzip_header last_mtime = _read_gzip_header(self._fp) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 428, in _read_gzip_header raise BadGzipFile('Not a gzipped file (%r)' % magic) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ gzip.BadGzipFile: Not a gzipped file (b'ga')
msg407280 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-11-29 14:37
From the spec: https://datatracker.ietf.org/doc/html/rfc1952 2.2. File format A gzip file consists of a series of "members" (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them. Gzip files with garbage after them are corrupted or not spec compliant. Therefore the gzip module should raise an error in this case.
msg407282 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-11-29 14:53
Whoops. Sorry, I spoke before my turn. If gzip implements it, it seems only logical that python's gzip module should too. I believe it can be fixed quite easily. The code should raise a warning though. I will make a PR.
msg409410 - (view)	Author: Ruben Vorderman (rhpvorderman) *	Date: 2021-12-31 09:44
ping

History
Date	User	Action	Args
2022-04-11 14:58:17	admin	set	github: 68489
2021-12-31 09:44:00	rhpvorderman	set	messages: + msg409410
2021-11-29 15:28:28	rhpvorderman	set	keywords: + patch stage: patch review pull_requests: + pull_request28076
2021-11-29 14:53:12	rhpvorderman	set	messages: + msg407282
2021-11-29 14:37:50	rhpvorderman	set	nosy: + rhpvorderman messages: + msg407280
2021-11-27 14:47:00	iritkatriel	set	versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.4 nosy: + iritkatriel messages: + msg407148 type: behavior
2015-06-15 08:17:58	martin.panter	set	messages: + msg245369 components: + Library (Lib), - Extension Modules
2015-06-15 06:58:54	nczeczulin	set	nosy: + nczeczulin messages: + msg245368
2015-05-28 00:26:40	martin.panter	set	nosy: + martin.panter messages: + msg244230
2015-05-27 18:47:45	ned.deily	set	type: crash -> (no value) messages: + msg244214 nosy: + ned.deily
2015-05-27 15:59:03	Ericg	create