classification
Title: Handle corrupted gzip files with unexpected EOF
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ajaksu2, benjamin.peterson, calvin, doko, georg.brandl, larry, loewis, nadeem.vawda, python-dev, serhiy.storchaka, yeeeev
Priority: normal Keywords: patch

Created on 2005-03-08 13:57 by calvin, last changed 2013-05-12 10:47 by python-dev.

Files
File name Uploaded Description Edit
t.gz calvin, 2005-03-08 13:57 test gzip file with missing/corrupted CRC checksum
t.py calvin, 2005-03-08 13:58 test script
gzip_eof.diff calvin, 2005-03-08 14:00 patch
test_gzip_error.py calvin, 2007-10-25 20:58
gzip_eof-3.4.patch serhiy.storchaka, 2013-01-14 19:03 Patch for 3.4 review
gzip_eof-3.4_2.patch serhiy.storchaka, 2013-01-20 11:51 review
lzma_deferred_error.patch serhiy.storchaka, 2013-01-24 07:43 Defer the error if it is possible to return some data review
Messages (20)
msg47892 - (view) Author: Bastian Kleineidam (calvin) Date: 2005-03-08 13:57
The GzipFile algorithm crashes when reading a corrupted
.gz file (attached as t.gz) with a missing CRC checksum
at the end.
Tested with python2.3, python2.4 and CVS python on a
Debian Linux system.
$ python2.3 t.py
Traceback (most recent call last):
  File "t.py", line 4, in ?
    print gzip.GzipFile('', 'rb', 9, fileobj).read()
  File "/usr/lib/python2.3/gzip.py", line 217, in read
    self._read(readsize)
  File "/usr/lib/python2.3/gzip.py", line 289, in _read
    self._read_eof()
  File "/usr/lib/python2.3/gzip.py", line 305, in _read_eof
    crc32 = read32(self.fileobj)
  File "/usr/lib/python2.3/gzip.py", line 40, in read32
    return struct.unpack("<l", input.read(4))[0]
struct.error: unpack str size does not match format

The attached patch (against current CVS) tries to cope
with this situation by
a) detecting the missing data by examining the rewind
value and
b) assuming that EOF is reached and returning the
buffered uncompressed data (by raising EOFError)

For history I encountered this kind of bug when
downloading HTML pages with Content-Encoding: gzip. It
seems some versions of the mod_gzip Apache module are
producing corrupted gzip data.
msg47893 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-03-06 13:02
I don't think it is right to treat this as EOF. I haven't fully understood the problem (can you explain the format error encountered: what is expected, what is sent instead?), however, I notice that gzip itself also complains about an "unexpected EOF" error, so I don't think we should silently decompress the file. Producing an exception indicating the proper problem and including the data received so far in it would be more appropriate.

Also, can you provide an example file for the test suite that isn't copyrighted by somebody else?
msg56754 - (view) Author: Bastian Kleineidam (calvin) Date: 2007-10-25 20:58
Here is a new test script that works with simple strings and no file
objects. It reproduces the error by cutting off the last two bytes of
the GZIP data.

The resulting struct error is due to the read() methods missing a check
that the requested amount of data is actually returned. In this case
read(4) returned 2 bytes instead of 4, and the struct raises an error.

I think the easiest way to handle this is to introduce a
read_save(fileobj, size) method that checks that the read() data is of
the requested size, else raise an error (perhaps an IOError?).

btw: you can remove the t.{gz,py} files, the test_gzip_error.py replaces
them.
msg82185 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-02-15 23:41
Confirmed on trunk with test_gzip_error.py:

struct.error: unpack requires a string argument of length 4
msg143719 - (view) Author: Yoav Weiss (yeeeev) Date: 2011-09-08 08:04
What is the reason that the currently submitted patch is not good enough and current stage is "needs patch"?
The current patch seem to solve this issue, which is a very common one when dealing with gzip files coming from the Internet.
In any case, an indication on *why* the current patch is not good enough will help create a better patch that may be good enough.
msg179969 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-14 19:03
At the moment gzip can raise two errors on unexpected EOF: struct.error from struct.unpack() or TypeError from ord(). Both bz2 and lzma raise EOFError in such cases.

The proposed patch converts both truncated gzip errors to EOFError as for bz2 and lzma. Added similar tests for gzip, bz2 and lzma.

I doubt about backward compatibility. It's obvious that struct.error and TypeError are unintentional, and EOFError is purposed for this case. However users can catch undocumented but de facto exceptions and doesn't expect EOFError.
msg180265 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2013-01-19 21:57
I've reviewed the patch and posted some comments on Rietveld.


> I doubt about backward compatibility. It's obvious that struct.error and TypeError are unintentional, and EOFError is purposed for this case. However users can catch undocumented but de facto exceptions and doesn't expect EOFError.

I think it's fine for us to change it to raise EOFError in these cases.
msg180288 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-20 11:51
Here is an updated patch addressing Nadeem Vawda's comments. Thank you.
msg180289 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2013-01-20 12:00
The updated patch looks good to me.
msg180399 - (view) Author: Roundup Robot (python-dev) Date: 2013-01-22 15:17
New changeset 174332b89a0d by Serhiy Storchaka in branch '3.2':
Issue #1159051: GzipFile now raises EOFError when reading a corrupted file
http://hg.python.org/cpython/rev/174332b89a0d

New changeset 87171e88847b by Serhiy Storchaka in branch '3.3':
Issue #1159051: GzipFile now raises EOFError when reading a corrupted file
http://hg.python.org/cpython/rev/87171e88847b

New changeset f2f947cdc5fe by Serhiy Storchaka in branch 'default':
Issue #1159051: GzipFile now raises EOFError when reading a corrupted file
http://hg.python.org/cpython/rev/f2f947cdc5fe

New changeset 214d8909513d by Serhiy Storchaka in branch '2.7':
Issue #1159051: GzipFile now raises EOFError when reading a corrupted file
http://hg.python.org/cpython/rev/214d8909513d
msg180513 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-24 07:43
Actually previous patch doesn't fix original problem, it only ensure that GzipFile consistent with BZ2File and LZMAFile.

To fix original problem we need other patch, and this patch looks as new feature for 3.4. Here is a sample patch for LZMAFile. BZ2File patch will be similar, and GzipFile patch will be more different and complex.

Now error doesn't raised immediately when read the file unexpectedly ended if some data can be read. Instead maximal possible part of read data returned and exception raising deferred to next read (see tests).

Perhaps we need a new flag for constructor or for read() which enables a new behavior (what will be a good name for this?). Or we can use a special value for size argument which means "read to the end as much as possible" (we can differentiate the behavior for size<0 and size=None). Unconditional enabling a new behavior for size >=0 is safe.
msg181237 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2013-02-03 01:32
I think the new behavior should be controlled by a constructor flag, maybe
named "defer_errors". I don't like the idea of adding the flag to read(),
since that makes us diverge from the standard file interface. Making a
distinction between size<0 and size=None seems confusing and error-prone,
not to mention that we (again) would have read() work differently from most
other file classes.

I'd prefer it if the new behavior is not enabled by default for size>=0,
even if this wouldn't break well-behaved code. Having a flag that only
controls the size<0 case is inelegant, and I don't think we should change
the default behavior unless there is a clear benefit to doing so.
msg184259 - (view) Author: Matthias Klose (doko) * (Python committer) Date: 2013-03-15 21:23
this change breaks a test case in the bzr testsuite; will try to get to it next week. See https://launchpad.net/bugs/1116079
msg184345 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-16 19:46
tuned_gzip does dangerous things, it overloads private methods of GzipFile.

From Bazaar 2.3 Release Notes:

* Stop using ``bzrlib.tuned_gzip.GzipFile``. It is incompatible with
  python-2.7 and was only used for Knit format repositories, which haven't
  been recommended since 2007. The file itself will be removed in the next
  release. (John Arbash Meinel)

Current version is 2.6b2. bzrlib.tuned_gzip.GzipFile should be removed two releases ago.
msg184688 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-19 21:58
I will be offline some time. Feel free to revert these changes in 2.7-3.3 if it 
is necessary.
msg185555 - (view) Author: Matthias Klose (doko) * (Python committer) Date: 2013-03-30 13:00
another test case failure with this patch:
https://launchpad.net/ubuntu/+archive/test-rebuild-20130329/+build/4416983

reproducible with feedparser 5.1.3 from pypi, on x86 (but not x86_64).

ERROR: test_gzip_struct_error (__main__.TestCompression)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./feedparsertest.py", line 433, in test_gzip_struct_error
    f = feedparser.parse('http://localhost:8097/tests/compression/gzip-struct-error.gz')
  File "/build/buildd/feedparser-5.1.2/feedparser/feedparser.py", line 3836, in parse
    data = gzip.GzipFile(fileobj=_StringIO(data)).read()
  File "/usr/lib/python2.7/gzip.py", line 253, in read
    while self._read(readsize):
  File "/usr/lib/python2.7/gzip.py", line 323, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 340, in _read_eof
    crc32, isize = struct.unpack("<II", self._read_exact(8))
  File "/usr/lib/python2.7/gzip.py", line 189, in _read_exact
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

----------------------------------------------------------------------
Ran 4237 tests in 5.190s

FAILED (errors=1)
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 43939)
msg188313 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-03 17:39
In both cases broken applications use the undocumented implementation details. But such changes are extremely strong for bugfix release and they should not be done without a special need. I propose to revert these changes in 2.7, 3.2 and 3.3 (possibly leaving in the default branch). Unfortunately, I was not online right before the release of the latest bugfix release and failed to do this. Fortunately, it is now possible to fix this in regression fix releases.
msg188946 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-11 19:08
New changeset abc780332b60 by Benjamin Peterson in branch '2.7':
backout 214d8909513d for regressions (#1159051)
http://hg.python.org/cpython/rev/abc780332b60
msg189002 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-12 10:32
New changeset 854ba6f414a8 by Georg Brandl in branch '3.2':
Issue #1159051: Back out a fix for handling corrupted gzip files that
http://hg.python.org/cpython/rev/854ba6f414a8
msg189010 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-12 10:47
New changeset 9c2831fe84e9 by Georg Brandl in branch '3.3':
Back out patch for #1159051, which caused backwards compatibility problems.
http://hg.python.org/cpython/rev/9c2831fe84e9

New changeset 5400e8fbc1de by Georg Brandl in branch 'default':
null-merge reversion of #1159051 patch from 3.3
http://hg.python.org/cpython/rev/5400e8fbc1de
History
Date User Action Args
2013-05-12 10:47:24python-devsetmessages: + msg189010
2013-05-12 10:46:25georg.brandlsetversions: - Python 2.7, Python 3.2, Python 3.3
2013-05-12 10:46:19georg.brandlsetpriority: release blocker -> normal
2013-05-12 10:32:41python-devsetmessages: + msg189002
2013-05-11 19:08:39python-devsetmessages: + msg188946
2013-05-03 17:39:58serhiy.storchakasetpriority: normal -> release blocker
versions: + Python 2.7, Python 3.2, Python 3.3
nosy: + benjamin.peterson, georg.brandl

messages: + msg188313

type: enhancement -> behavior
2013-03-30 13:00:16dokosetmessages: + msg185555
2013-03-19 21:58:04serhiy.storchakasetmessages: + msg184688
2013-03-19 21:06:33dokosetpriority: release blocker -> normal
2013-03-16 19:46:48serhiy.storchakasetmessages: + msg184345
2013-03-15 21:23:08dokosetpriority: normal -> release blocker
nosy: + larry, doko
messages: + msg184259

2013-02-03 01:32:54nadeem.vawdasetmessages: + msg181237
2013-01-24 07:43:51serhiy.storchakasetfiles: + lzma_deferred_error.patch
type: behavior -> enhancement
messages: + msg180513

versions: - Python 2.7, Python 3.2, Python 3.3
2013-01-22 15:17:01python-devsetnosy: + python-dev
messages: + msg180399
2013-01-20 12:00:13nadeem.vawdasetmessages: + msg180289
2013-01-20 11:51:52serhiy.storchakasetfiles: + gzip_eof-3.4_2.patch

messages: + msg180288
2013-01-19 21:57:12nadeem.vawdasetmessages: + msg180265
2013-01-14 19:03:15serhiy.storchakasetfiles: + gzip_eof-3.4.patch
versions: + Python 3.3, Python 3.4, - Python 3.1
nosy: + nadeem.vawda

messages: + msg179969

stage: needs patch -> patch review
2013-01-13 20:31:38serhiy.storchakasetassignee: serhiy.storchaka

nosy: + serhiy.storchaka
2011-09-08 08:04:03yeeeevsetnosy: + yeeeev
messages: + msg143719
2010-08-21 12:35:59BreamoreBoysetstage: needs patch
type: behavior
versions: + Python 3.1, Python 2.7, Python 3.2
2009-02-15 23:41:25ajaksu2setnosy: + ajaksu2
messages: + msg82185
2007-10-25 20:58:44calvinsetfiles: + test_gzip_error.py
messages: + msg56754
2005-03-08 13:57:41calvincreate