classification
Title: Gzip fails for file over 2**32 bytes
Type: behavior Stage: resolved
Components: Extension Modules, Library (Lib) Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: martin.panter Nosy List: Ben Cipollini, Matthew.Brett, martin.panter, nadeem.vawda, python-dev, serhiy.storchaka, twouters
Priority: normal Keywords: 3.5regression, patch

Created on 2015-11-14 19:48 by Ben Cipollini, last changed 2015-11-21 11:03 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
zlib-size_t.patch martin.panter, 2015-11-16 06:54 Cap zlib to UINT_MAX review
zlib-size_t.2.patch martin.panter, 2015-11-20 08:11 review
zlib-Py_ssize_t.patch martin.panter, 2015-11-20 12:51 review
zlib-Py_ssize_t.2.patch martin.panter, 2015-11-21 00:37 review
Messages (18)
msg254668 - (view) Author: Ben Cipollini (Ben Cipollini) Date: 2015-11-14 19:48
Gzip fails when opening a file more than 2**32 bytes. This is a new issue in Python 3.5.

We hit this opening large neuroimaging files from the Human Connectome Project. See https://github.com/nipy/nibabel/issues/362 for more details.

When size is > 2**32, we get the following error on Python 3.5:

/usr/lib/python3.5/gzip.py in read(self, size)
    467             buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
    468 
--> 469             uncompress = self._decompressor.decompress(buf, size)
    470             if self._decompressor.unconsumed_tail != b"":
    471                 self._fp.prepend(self._decompressor.unconsumed_tail)

OverflowError: Python int too large for C unsigned int
msg254684 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-15 11:13
Thanks for the report. Can you confirm if this demo illustrates your problem? For me, I only have 2 GiB of memory so I get a MemoryError, which seems reasonable for my situation.

from gzip import GzipFile
from io import BytesIO
file = BytesIO()
writer = GzipFile(fileobj=file, mode="wb")
writer.write(b"data")
writer.close()
file.seek(0)
reader = GzipFile(fileobj=file, mode="rb")
data = reader.read(2**32)  # Ideally this should return b"data"

Assuming that triggers the OverflowError, then the heart of the problem is that the zlib.decompressobj.decompress() method does not accept such large numbers for the length limit:

>>> import zlib
>>> decompressor = zlib.decompressobj(wbits=16 + 15)
>>> decompressor.decompress(file.getvalue(), 2**32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large for C unsigned int

I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library. Would this be okay for a 3.5.1 bug fix release, or would it be considered a feature change?

Failing that, another option would be to cap the limit in the gzip library, and just document the zlib limitation. I already have a patch in Issue 23200 documenting another quirk when max_length=0.

The same problem may also apply to the LZMA and bzip2 modules; I need to check.
msg254686 - (view) Author: Ben Cipollini (Ben Cipollini) Date: 2015-11-15 12:20
@Martin: yes, that reproduces the problem.

>I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library.

If this cap is implemented, is there any workaround how we can efficiently open gzipped files over 4GB? Such files exist, exist in high-priority, government-funded datasets, and neuroinformatics in Python requires a way to open them efficiently.

>another option would be to cap the limit in the gzip library

Again, these limitations are new in Python 3.5 and currently block us from using Python 3.5 to run neuroinformatics efficiently. Is there any chance to revert this new behavior, or any known memory-efficient workaround?
msg254717 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-16 06:54
Ben: By adding the cap, it means when the lower-level zlib module is asked to decode say 5 GiB, it will decode at most 4 GiB (rather than raising an exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying zlib library; it’s not Python’s doing, and the limit was present before 3.5 as well.

In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 5 GiB because there can only be 8ish MB data per internal decompress() call.

Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk size up to 10 MiB in each internal iteration.

If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching.

===

I propose this patch for the 3.5 branch, which implements the internal capping in the zlib module. It changes the zlib module to accept sizes greater than UINT_MAX, but internally limits them. I think this is the most practical way forward, because it does not require Python code to know the value of UINT_MAX (which could in theory vary by platform).

I would be good to get a review of my patch. In particular, is this change okay for the 3.5 maintainence branch? It would also be good to confirm that my patch fixes the original problem, i.e. read(2**32) works on computers with more than 4 GiB of memory (which I cannot test).

This regression is caused by Issue 23529, where I make sure that GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression bomb.
msg254727 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-16 12:14
Is 3.4 affected? A uint_converter was added in issue18294.
msg254754 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-16 20:42
3.4 shouldn’t be affected by the gzip regression. In 3.4 the gzip module does not set max_length to limit the decompression size.

In 3.4 the underlying zlib module will also raise OverflowError if max_lenth=2**32, and my patch could be applied, but I don’t think it is worth it at this stage.
msg254865 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-19 00:16
I am inclined to commit my patch to get this fixed in the upcoming 3.5.1 release, unless anyone thinks that could be a problem.
msg254884 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-19 08:08
Could you please add tests for other two functions? And tests for the gzip module?
msg254887 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-19 08:34
Okay. For the gzip module, I cannot easily test this myself. Quickly looking at other cases, I guess it would look something like this, but I need to spend some time understanging the bigmemtest decorator properly:

@unittest.skipIf(sys.maxsize < _4G, "Requires non-32-bit system")
@test.support.bigmemtest(_4G, 1, dry_run=False)
def test_large_read(self, size):
    ...
    data = reader.read(size)  # Should not raise OverflowError
    self.assertEqual(data, b"data")
msg254891 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-19 09:28
You can commit your patch right now (it shouldn't make things worse) and add new tests later.
msg254955 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-20 08:11
Actually it did make an existing bug a bit worse. The bug is triggered with valid sizes greater than LONG_MAX, due to an exception not being cleared, and my change made this bug more dramatic, turning the OverflowError into a crash. This new patch should fix that.

I’m now looking at adding the other tests that require allocating lots of memory.
msg254963 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-20 09:45
Ah, I found yet one bug.

>>> zlib.decompress(zlib.compress(b'abcd'), 0, sys.maxsize+1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: Negative size passed to PyBytes_FromStringAndSize

There are similar bugs in decompressor's methods decompress() and flush() but it is hard to reproduce them.

The maximal length value should be bounded not only with UINT_MAX, but with PY_SSIZE_T_MAX too.

I would merge sval and uval into one variable of type Py_ssize_t.
msg254972 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-20 12:51
Thanks for your testing Serhiy. I can reproduce this on 32-bit Linux (also by building with CC="gcc -m32"). It is easy to produce the bug with flush(), but with Decompress.decompress() it would need over 2 GiB of memory and data to expand the buffer enough to trigger the error.

>>> zlib.decompressobj().flush(sys.maxsize + 1)
SystemError: Negative size passed to PyBytes_FromStringAndSize

In zlib-Py_ssize_t.patch, I added the bigmem tests, but I am not able to actually test them. The gzip test would require 8 GiB of memory. I also changed the converter to Py_ssize_t to fix this latest bug.
msg254976 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-20 14:23
Added new comments.
msg255032 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-21 00:37
New patch addressing comments. I used the undocumented API _PyLong_FromNbInt() to call __int__() rather than __index__().
msg255044 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-21 07:00
Thank you Martin, the patch LGTM.

The difference between __int__() and __index__() is that float has the former, but not the latter. For better compatibility we have to use __int__() in maintained releases. But obviously __index__() is more appropriate semantically. float buffer length doesn't make sense. We have to change __int__ to __index__ in future releases. But this is separate issue.
msg255048 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-11-21 10:59
New changeset 80f6f77a7cc3 by Martin Panter in branch '3.5':
Issue #25626: Change zlib to accept Py_ssize_t and cap to UINT_MAX
https://hg.python.org/cpython/rev/80f6f77a7cc3

New changeset afa1b6cd77a5 by Martin Panter in branch 'default':
Issue #25626: Merge zlib fix from 3.5
https://hg.python.org/cpython/rev/afa1b6cd77a5

New changeset 5117f0b3be09 by Martin Panter in branch 'default':
Issue #25626: Add news to 3.6 section
https://hg.python.org/cpython/rev/5117f0b3be09
msg255049 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-21 11:03
Thank you Serhiy for you help, and Ben for reporting this.
History
Date User Action Args
2015-11-21 11:03:41martin.pantersetstatus: open -> closed
resolution: fixed
messages: + msg255049

stage: commit review -> resolved
2015-11-21 10:59:44python-devsetnosy: + python-dev
messages: + msg255048
2015-11-21 07:00:59serhiy.storchakasetassignee: martin.panter
stage: patch review -> commit review
2015-11-21 07:00:18serhiy.storchakasetmessages: + msg255044
2015-11-21 00:37:36martin.pantersetfiles: + zlib-Py_ssize_t.2.patch

messages: + msg255032
2015-11-20 14:23:12serhiy.storchakasetmessages: + msg254976
2015-11-20 12:51:35martin.pantersetfiles: + zlib-Py_ssize_t.patch

messages: + msg254972
2015-11-20 09:45:44serhiy.storchakasetmessages: + msg254963
2015-11-20 08:11:42martin.pantersetfiles: + zlib-size_t.2.patch

messages: + msg254955
2015-11-19 09:28:28serhiy.storchakasetmessages: + msg254891
2015-11-19 08:34:26martin.pantersetmessages: + msg254887
2015-11-19 08:08:18serhiy.storchakasetmessages: + msg254884
2015-11-19 00:16:58martin.pantersetmessages: + msg254865
components: + Extension Modules
2015-11-16 20:42:43martin.pantersetmessages: + msg254754
2015-11-16 12:14:39serhiy.storchakasetmessages: + msg254727
2015-11-16 07:02:21serhiy.storchakasetnosy: + serhiy.storchaka
2015-11-16 06:54:40martin.pantersetfiles: + zlib-size_t.patch
keywords: + patch
messages: + msg254717

stage: patch review
2015-11-15 12:20:34Ben Cipollinisetmessages: + msg254686
2015-11-15 11:13:22martin.pantersetnosy: + martin.panter
messages: + msg254684

keywords: + 3.5regression
type: crash -> behavior
2015-11-14 20:16:04SilentGhostsetnosy: + twouters, nadeem.vawda

components: + Library (Lib)
versions: + Python 3.6
2015-11-14 19:55:00Matthew.Brettsetnosy: + Matthew.Brett
2015-11-14 19:48:21Ben Cipollinicreate