classification
Title: gzip module failing to decompress valid compressed file
Type: Stage:
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ericg, martin.panter, nczeczulin, ned.deily
Priority: normal Keywords:

Created on 2015-05-27 15:59 by Ericg, last changed 2015-06-15 08:17 by martin.panter.

Messages (5)
msg244188 - (view) Author: EricG (Ericg) Date: 2015-05-27 15:59
I have a file whose first four bytes are 1F 8B 08 00 and if I use gunzip from the command line, it outputs:

gzip: zImage_extracted.gz: decompression OK, trailing garbage ignored

and correctly decompresses the file. However, if I use the gzip module to read and decompress the data, I get the following exception thrown:

  File "/usr/lib/python3.4/gzip.py", line 360, in read
    while self._read(readsize):
  File "/usr/lib/python3.4/gzip.py", line 433, in _read
    if not self._read_gzip_header():
  File "/usr/lib/python3.4/gzip.py", line 297, in _read_gzip_header
    raise OSError('Not a gzipped file')

I believe the problem I am facing is the same one described here in this SO question and answer:

http://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data


This would appear to be serious bug in the gzip module that needs to be fixed.
msg244214 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-05-27 18:47
Can you add a public copy of a file that fails this way?  There are several open issues with gzip, like Issue1159051, that might cover this but it's hard to know for sure without a test case.
msg244230 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-28 00:26
I suspect Eric’s file has non-zero, non-gzip garbage bytes appended to the end of it. Assuming I am right, here is way to reproduce that scenario:

>>> from gzip import GzipFile
>>> from io import BytesIO
>>> file = BytesIO()
>>> with GzipFile(fileobj=file, mode="wb") as z:
...     z.write(b"data")
... 
4
>>> file.write(b"garbage")
7
>>> file.seek(0)
0
>>> GzipFile(fileobj=file).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/gzip.py", line 274, in read
    return self._buffer.read(size)
  File "/home/proj/python/cpython/Lib/gzip.py", line 461, in read
    if not self._read_gzip_header():
  File "/home/proj/python/cpython/Lib/gzip.py", line 409, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'ga')

This is a bit different to Issue 1508475. That one is about cases where the “gzip” trailer has been truncated, although the compressed data is probably intact. This case is the converse: extra data has been added.

All of the “gzip”, “bzip2” and XZ Utils (for LZMA) command-line decompressors happily extract the compressed data without an error exit status, but emit warning messages:

gzip: stdin: decompression OK, trailing garbage ignored
bzip2: (stdin): trailing garbage after EOF ignored
xz: (stdin): Unexpected end of input

In Python, the “bzip” and LZMA modules successfully extract the compressed data, and ignore the non-compressed garbage at the end without even a warning. On the other hand, the “gzip” module has special code to ignore trailing zero bytes (Issue 2846), but treats any other trailing non-gzip data as an error.

So I think a strong argument could be made for the ability to extract all the compressed data from even if there is garbage appended. The question is, how would this support be added? Perhaps the mechanism chosen could also be integrated with a fix for Issue 1508475. Some options:

* Silently ignore the condition by default like the other compression modules (consistent, but could silently swallow real errors)
* An optional new GzipFile(strict=False) mode
* Perhaps an exception deferred until close() is called
msg245368 - (view) Author: Nick Czeczulin (nczeczulin) Date: 2015-06-15 06:58
The spec allows for multi-member files. Some libraries and utilities seem to solve this problem (incorrectly?) by simply ignoring everything past the first member -- even when valid (e.g., DotNetZip, 7-Zip)

For 2.7 and 3.4, the data that has been decompressed but not yet read before the exception was raised is still available:

Modifying Martin's example slightly:

>>> f = BytesIO()
>>> with GzipFile(fileobj=f, mode="wb") as z:
...     z.write(b"data")
...
4
>>> f.write(b"garbage")
7
>>> f.seek(0)
0
>>> with GzipFile(fileobj=f, mode="rb") as z:
...     try:
...         z.read(1)
...         z.read()
...     except OSError as e:
...         z.extrabuf[z.offset - z.extrastart:]
...         e
...
b'd'
b'ata'
OSError('Not a gzipped file',)

My issue is that catching and handling this specific exception is a little more involved because there are 3(?) different OSErrors (IOError on 2.7) that could potentially be raised during the read. But mostly:
OSError('CRC check failed 0x447ba3f9 != 0x225cb2a3',) -- would be bad one to mistake for it.

Maybe a specific Exception type to catch for an invalid header, and a better method to read the remaining buffer when handling it?
msg245369 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-06-15 08:17
Just noticed in my previous message I mentioned Issue 1508475 a few times when I meant to say Issue 1159051.

In Python 3.5, a workaround is not so easy because we would need to access the internal buffer of a BufferedReader. One potential workaround is to use read1():

>>> z.read1(1)
b'd'
>>> z.read1()
b'ata'
>>> z.read1()
OSError: Not a gzipped file (b'ga')

The only practical way to allow for an exception and read() returning all the data is to defer the exception until close() is called. Another option might be to store a list of defects, similar to “email.message.Message.defects”.
History
Date User Action Args
2015-06-15 08:17:58martin.pantersetmessages: + msg245369
components: + Library (Lib), - Extension Modules
2015-06-15 06:58:54nczeczulinsetnosy: + nczeczulin
messages: + msg245368
2015-05-28 00:26:40martin.pantersetnosy: + martin.panter
messages: + msg244230
2015-05-27 18:47:45ned.deilysettype: crash ->

messages: + msg244214
nosy: + ned.deily
2015-05-27 15:59:03Ericgcreate