Message 244230 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	Ericg, martin.panter, ned.deily
Date	2015-05-28.00:26:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1432772800.88.0.542267139732.issue24301@psf.upfronthosting.co.za>
In-reply-to

Content
I suspect Eric’s file has non-zero, non-gzip garbage bytes appended to the end of it. Assuming I am right, here is way to reproduce that scenario: >>> from gzip import GzipFile >>> from io import BytesIO >>> file = BytesIO() >>> with GzipFile(fileobj=file, mode="wb") as z: ... z.write(b"data") ... 4 >>> file.write(b"garbage") 7 >>> file.seek(0) 0 >>> GzipFile(fileobj=file).read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/proj/python/cpython/Lib/gzip.py", line 274, in read return self._buffer.read(size) File "/home/proj/python/cpython/Lib/gzip.py", line 461, in read if not self._read_gzip_header(): File "/home/proj/python/cpython/Lib/gzip.py", line 409, in _read_gzip_header raise OSError('Not a gzipped file (%r)' % magic) OSError: Not a gzipped file (b'ga') This is a bit different to Issue 1508475. That one is about cases where the “gzip” trailer has been truncated, although the compressed data is probably intact. This case is the converse: extra data has been added. All of the “gzip”, “bzip2” and XZ Utils (for LZMA) command-line decompressors happily extract the compressed data without an error exit status, but emit warning messages: gzip: stdin: decompression OK, trailing garbage ignored bzip2: (stdin): trailing garbage after EOF ignored xz: (stdin): Unexpected end of input In Python, the “bzip” and LZMA modules successfully extract the compressed data, and ignore the non-compressed garbage at the end without even a warning. On the other hand, the “gzip” module has special code to ignore trailing zero bytes (Issue 2846), but treats any other trailing non-gzip data as an error. So I think a strong argument could be made for the ability to extract all the compressed data from even if there is garbage appended. The question is, how would this support be added? Perhaps the mechanism chosen could also be integrated with a fix for Issue 1508475. Some options: * Silently ignore the condition by default like the other compression modules (consistent, but could silently swallow real errors) * An optional new GzipFile(strict=False) mode * Perhaps an exception deferred until close() is called

I suspect Eric’s file has non-zero, non-gzip garbage bytes appended to the end of it. Assuming I am right, here is way to reproduce that scenario:

>>> from gzip import GzipFile
>>> from io import BytesIO
>>> file = BytesIO()
>>> with GzipFile(fileobj=file, mode="wb") as z:
...     z.write(b"data")
... 
4
>>> file.write(b"garbage")
7
>>> file.seek(0)
0
>>> GzipFile(fileobj=file).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/gzip.py", line 274, in read
    return self._buffer.read(size)
  File "/home/proj/python/cpython/Lib/gzip.py", line 461, in read
    if not self._read_gzip_header():
  File "/home/proj/python/cpython/Lib/gzip.py", line 409, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'ga')

This is a bit different to Issue 1508475. That one is about cases where the “gzip” trailer has been truncated, although the compressed data is probably intact. This case is the converse: extra data has been added.

All of the “gzip”, “bzip2” and XZ Utils (for LZMA) command-line decompressors happily extract the compressed data without an error exit status, but emit warning messages:

gzip: stdin: decompression OK, trailing garbage ignored
bzip2: (stdin): trailing garbage after EOF ignored
xz: (stdin): Unexpected end of input

In Python, the “bzip” and LZMA modules successfully extract the compressed data, and ignore the non-compressed garbage at the end without even a warning. On the other hand, the “gzip” module has special code to ignore trailing zero bytes (Issue 2846), but treats any other trailing non-gzip data as an error.

So I think a strong argument could be made for the ability to extract all the compressed data from even if there is garbage appended. The question is, how would this support be added? Perhaps the mechanism chosen could also be integrated with a fix for Issue 1508475. Some options:

* Silently ignore the condition by default like the other compression modules (consistent, but could silently swallow real errors)
* An optional new GzipFile(strict=False) mode
* Perhaps an exception deferred until close() is called

History
Date	User	Action	Args
2015-05-28 00:26:40	martin.panter	set	recipients: + martin.panter, ned.deily, Ericg
2015-05-28 00:26:40	martin.panter	set	messageid: <1432772800.88.0.542267139732.issue24301@psf.upfronthosting.co.za>
2015-05-28 00:26:40	martin.panter	link	issue24301 messages
2015-05-28 00:26:39	martin.panter	create