This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: gzip module failing to decompress valid compressed file
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ericg, iritkatriel, martin.panter, nczeczulin, ned.deily, rhpvorderman
Priority: normal Keywords: patch

Created on 2015-05-27 15:59 by Ericg, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL Status Linked Edit
PR 29847 open rhpvorderman, 2021-11-29 15:28
Messages (9)
msg244188 - (view) Author: EricG (Ericg) Date: 2015-05-27 15:59
I have a file whose first four bytes are 1F 8B 08 00 and if I use gunzip from the command line, it outputs:

gzip: zImage_extracted.gz: decompression OK, trailing garbage ignored

and correctly decompresses the file. However, if I use the gzip module to read and decompress the data, I get the following exception thrown:

  File "/usr/lib/python3.4/gzip.py", line 360, in read
    while self._read(readsize):
  File "/usr/lib/python3.4/gzip.py", line 433, in _read
    if not self._read_gzip_header():
  File "/usr/lib/python3.4/gzip.py", line 297, in _read_gzip_header
    raise OSError('Not a gzipped file')

I believe the problem I am facing is the same one described here in this SO question and answer:

http://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data


This would appear to be serious bug in the gzip module that needs to be fixed.
msg244214 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-05-27 18:47
Can you add a public copy of a file that fails this way?  There are several open issues with gzip, like Issue1159051, that might cover this but it's hard to know for sure without a test case.
msg244230 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-28 00:26
I suspect Eric’s file has non-zero, non-gzip garbage bytes appended to the end of it. Assuming I am right, here is way to reproduce that scenario:

>>> from gzip import GzipFile
>>> from io import BytesIO
>>> file = BytesIO()
>>> with GzipFile(fileobj=file, mode="wb") as z:
...     z.write(b"data")
... 
4
>>> file.write(b"garbage")
7
>>> file.seek(0)
0
>>> GzipFile(fileobj=file).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/gzip.py", line 274, in read
    return self._buffer.read(size)
  File "/home/proj/python/cpython/Lib/gzip.py", line 461, in read
    if not self._read_gzip_header():
  File "/home/proj/python/cpython/Lib/gzip.py", line 409, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'ga')

This is a bit different to Issue 1508475. That one is about cases where the “gzip” trailer has been truncated, although the compressed data is probably intact. This case is the converse: extra data has been added.

All of the “gzip”, “bzip2” and XZ Utils (for LZMA) command-line decompressors happily extract the compressed data without an error exit status, but emit warning messages:

gzip: stdin: decompression OK, trailing garbage ignored
bzip2: (stdin): trailing garbage after EOF ignored
xz: (stdin): Unexpected end of input

In Python, the “bzip” and LZMA modules successfully extract the compressed data, and ignore the non-compressed garbage at the end without even a warning. On the other hand, the “gzip” module has special code to ignore trailing zero bytes (Issue 2846), but treats any other trailing non-gzip data as an error.

So I think a strong argument could be made for the ability to extract all the compressed data from even if there is garbage appended. The question is, how would this support be added? Perhaps the mechanism chosen could also be integrated with a fix for Issue 1508475. Some options:

* Silently ignore the condition by default like the other compression modules (consistent, but could silently swallow real errors)
* An optional new GzipFile(strict=False) mode
* Perhaps an exception deferred until close() is called
msg245368 - (view) Author: Nick Czeczulin (nczeczulin) Date: 2015-06-15 06:58
The spec allows for multi-member files. Some libraries and utilities seem to solve this problem (incorrectly?) by simply ignoring everything past the first member -- even when valid (e.g., DotNetZip, 7-Zip)

For 2.7 and 3.4, the data that has been decompressed but not yet read before the exception was raised is still available:

Modifying Martin's example slightly:

>>> f = BytesIO()
>>> with GzipFile(fileobj=f, mode="wb") as z:
...     z.write(b"data")
...
4
>>> f.write(b"garbage")
7
>>> f.seek(0)
0
>>> with GzipFile(fileobj=f, mode="rb") as z:
...     try:
...         z.read(1)
...         z.read()
...     except OSError as e:
...         z.extrabuf[z.offset - z.extrastart:]
...         e
...
b'd'
b'ata'
OSError('Not a gzipped file',)

My issue is that catching and handling this specific exception is a little more involved because there are 3(?) different OSErrors (IOError on 2.7) that could potentially be raised during the read. But mostly:
OSError('CRC check failed 0x447ba3f9 != 0x225cb2a3',) -- would be bad one to mistake for it.

Maybe a specific Exception type to catch for an invalid header, and a better method to read the remaining buffer when handling it?
msg245369 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-06-15 08:17
Just noticed in my previous message I mentioned Issue 1508475 a few times when I meant to say Issue 1159051.

In Python 3.5, a workaround is not so easy because we would need to access the internal buffer of a BufferedReader. One potential workaround is to use read1():

>>> z.read1(1)
b'd'
>>> z.read1()
b'ata'
>>> z.read1()
OSError: Not a gzipped file (b'ga')

The only practical way to allow for an exception and read() returning all the data is to defer the exception until close() is called. Another option might be to store a list of defects, similar to “email.message.Message.defects”.
msg407148 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-11-27 14:47
Reproduced on 3.11:

>>> from gzip import GzipFile
>>> from io import BytesIO
>>> file = BytesIO()
>>> with GzipFile(fileobj=file, mode="wb") as z:
...     z.write(b"data")
... 
4
>>> file.write(b"garbage")
7
>>> file.seek(0)
0
>>> GzipFile(fileobj=file).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 301, in read
    return self._buffer.read(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-654/Lib/_compression.py", line 118, in readall
    while data := self.read(sys.maxsize):
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 499, in read
    if not self._read_gzip_header():
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 468, in _read_gzip_header
    last_mtime = _read_gzip_header(self._fp)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iritkatriel/src/cpython-654/Lib/gzip.py", line 428, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gzip.BadGzipFile: Not a gzipped file (b'ga')
msg407280 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-29 14:37
From the spec:

https://datatracker.ietf.org/doc/html/rfc1952


   2.2. File format

      A gzip file consists of a series of "members" (compressed data
      sets).  The format of each member is specified in the following
      section.  The members simply appear one after another in the file,
      with no additional information before, between, or after them.


Gzip files with garbage after them are corrupted or not spec compliant. Therefore the gzip module should raise an error in this case.
msg407282 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-11-29 14:53
Whoops. Sorry, I spoke before my turn. If gzip implements it, it seems only logical that python's *gzip* module should too. 
I believe it can be fixed quite easily. The code should raise a warning though. I will make a PR.
msg409410 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2021-12-31 09:44
ping
History
Date User Action Args
2022-04-11 14:58:17adminsetgithub: 68489
2021-12-31 09:44:00rhpvordermansetmessages: + msg409410
2021-11-29 15:28:28rhpvordermansetkeywords: + patch
stage: patch review
pull_requests: + pull_request28076
2021-11-29 14:53:12rhpvordermansetmessages: + msg407282
2021-11-29 14:37:50rhpvordermansetnosy: + rhpvorderman
messages: + msg407280
2021-11-27 14:47:00iritkatrielsetversions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.4
nosy: + iritkatriel

messages: + msg407148

type: behavior
2015-06-15 08:17:58martin.pantersetmessages: + msg245369
components: + Library (Lib), - Extension Modules
2015-06-15 06:58:54nczeczulinsetnosy: + nczeczulin
messages: + msg245368
2015-05-28 00:26:40martin.pantersetnosy: + martin.panter
messages: + msg244230
2015-05-27 18:47:45ned.deilysettype: crash -> (no value)

messages: + msg244214
nosy: + ned.deily
2015-05-27 15:59:03Ericgcreate