Gzip fails for file over 2**32 bytes #69812

BenCipollini · 2015-11-14T19:48:22Z

BPO	25626
Nosy	@Yhg1s, @vadmium, @serhiy-storchaka
Files	zlib-size_t.patch: Cap zlib to UINT_MAX zlib-size_t.2.patch zlib-Py_ssize_t.patch zlib-Py_ssize_t.2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/vadmium'
closed_at = <Date 2015-11-21.11:03:41.734>
created_at = <Date 2015-11-14.19:48:21.730>
labels = ['extension-modules', 'type-bug', 'library']
title = 'Gzip fails for file over 2**32 bytes'
updated_at = <Date 2015-11-21.11:03:41.732>
user = 'https://bugs.python.org/BenCipollini'

bugs.python.org fields:

activity = <Date 2015-11-21.11:03:41.732>
actor = 'martin.panter'
assignee = 'martin.panter'
closed = True
closed_date = <Date 2015-11-21.11:03:41.734>
closer = 'martin.panter'
components = ['Extension Modules', 'Library (Lib)']
creation = <Date 2015-11-14.19:48:21.730>
creator = 'Ben Cipollini'
dependencies = []
files = ['41053', '41090', '41096', '41107']
hgrepos = []
issue_num = 25626
keywords = ['patch', '3.5regression']
message_count = 18.0
messages = ['254668', '254684', '254686', '254717', '254727', '254754', '254865', '254884', '254887', '254891', '254955', '254963', '254972', '254976', '255032', '255044', '255048', '255049']
nosy_count = 7.0
nosy_names = ['twouters', 'nadeem.vawda', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'Matthew.Brett', 'Ben Cipollini']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25626'
versions = ['Python 3.5', 'Python 3.6']

BenCipollini · 2015-11-14T19:48:21Z

Gzip fails when opening a file more than 2**32 bytes. This is a new issue in Python 3.5.

We hit this opening large neuroimaging files from the Human Connectome Project. See nipy/nibabel#362 for more details.

When size is > 2**32, we get the following error on Python 3.5:

/usr/lib/python3.5/gzip.py in read(self, size)
467 buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
468
--> 469 uncompress = self._decompressor.decompress(buf, size)
470 if self._decompressor.unconsumed_tail != b"":
471 self._fp.prepend(self._decompressor.unconsumed_tail)

OverflowError: Python int too large for C unsigned int

vadmium · 2015-11-15T11:13:22Z

Thanks for the report. Can you confirm if this demo illustrates your problem? For me, I only have 2 GiB of memory so I get a MemoryError, which seems reasonable for my situation.

from gzip import GzipFile
from io import BytesIO
file = BytesIO()
writer = GzipFile(fileobj=file, mode="wb")
writer.write(b"data")
writer.close()
file.seek(0)
reader = GzipFile(fileobj=file, mode="rb")
data = reader.read(2**32)  # Ideally this should return b"data"

Assuming that triggers the OverflowError, then the heart of the problem is that the zlib.decompressobj.decompress() method does not accept such large numbers for the length limit:

>>> import zlib
>>> decompressor = zlib.decompressobj(wbits=16 + 15)
>>> decompressor.decompress(file.getvalue(), 2**32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large for C unsigned int

I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library. Would this be okay for a 3.5.1 bug fix release, or would it be considered a feature change?

Failing that, another option would be to cap the limit in the gzip library, and just document the zlib limitation. I already have a patch in bpo-23200 documenting another quirk when max_length=0.

The same problem may also apply to the LZMA and bzip2 modules; I need to check.

BenCipollini · 2015-11-15T12:20:34Z

@martin: yes, that reproduces the problem.

I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library.

If this cap is implemented, is there any workaround how we can efficiently open gzipped files over 4GB? Such files exist, exist in high-priority, government-funded datasets, and neuroinformatics in Python requires a way to open them efficiently.

another option would be to cap the limit in the gzip library

Again, these limitations are new in Python 3.5 and currently block us from using Python 3.5 to run neuroinformatics efficiently. Is there any chance to revert this new behavior, or any known memory-efficient workaround?

vadmium · 2015-11-16T06:54:35Z

Ben: By adding the cap, it means when the lower-level zlib module is asked to decode say 5 GiB, it will decode at most 4 GiB (rather than raising an exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying zlib library; it’s not Python’s doing, and the limit was present before 3.5 as well.

In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 5 GiB because there can only be 8ish MB data per internal decompress() call.

Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk size up to 10 MiB in each internal iteration.

If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching.

===

I propose this patch for the 3.5 branch, which implements the internal capping in the zlib module. It changes the zlib module to accept sizes greater than UINT_MAX, but internally limits them. I think this is the most practical way forward, because it does not require Python code to know the value of UINT_MAX (which could in theory vary by platform).

I would be good to get a review of my patch. In particular, is this change okay for the 3.5 maintainence branch? It would also be good to confirm that my patch fixes the original problem, i.e. read(2**32) works on computers with more than 4 GiB of memory (which I cannot test).

This regression is caused by bpo-23529, where I make sure that GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression bomb.

serhiy-storchaka · 2015-11-16T12:14:39Z

Is 3.4 affected? A uint_converter was added in bpo-18294.

vadmium · 2015-11-16T20:42:43Z

3.4 shouldn’t be affected by the gzip regression. In 3.4 the gzip module does not set max_length to limit the decompression size.

In 3.4 the underlying zlib module will also raise OverflowError if max_lenth=2**32, and my patch could be applied, but I don’t think it is worth it at this stage.

vadmium · 2015-11-19T00:16:58Z

I am inclined to commit my patch to get this fixed in the upcoming 3.5.1 release, unless anyone thinks that could be a problem.

serhiy-storchaka · 2015-11-19T08:08:18Z

Could you please add tests for other two functions? And tests for the gzip module?

vadmium · 2015-11-19T08:34:27Z

Okay. For the gzip module, I cannot easily test this myself. Quickly looking at other cases, I guess it would look something like this, but I need to spend some time understanging the bigmemtest decorator properly:

@unittest.skipIf(sys.maxsize < _4G, "Requires non-32-bit system")
@test.support.bigmemtest(_4G, 1, dry_run=False)
def test_large_read(self, size):
...
data = reader.read(size) # Should not raise OverflowError
self.assertEqual(data, b"data")

serhiy-storchaka · 2015-11-19T09:28:29Z

You can commit your patch right now (it shouldn't make things worse) and add new tests later.

vadmium · 2015-11-20T08:11:40Z

Actually it did make an existing bug a bit worse. The bug is triggered with valid sizes greater than LONG_MAX, due to an exception not being cleared, and my change made this bug more dramatic, turning the OverflowError into a crash. This new patch should fix that.

I’m now looking at adding the other tests that require allocating lots of memory.

serhiy-storchaka · 2015-11-20T09:45:44Z

Ah, I found yet one bug.

>>> zlib.decompress(zlib.compress(b'abcd'), 0, sys.maxsize+1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: Negative size passed to PyBytes_FromStringAndSize

There are similar bugs in decompressor's methods decompress() and flush() but it is hard to reproduce them.

The maximal length value should be bounded not only with UINT_MAX, but with PY_SSIZE_T_MAX too.

I would merge sval and uval into one variable of type Py_ssize_t.

vadmium · 2015-11-20T12:51:35Z

Thanks for your testing Serhiy. I can reproduce this on 32-bit Linux (also by building with CC="gcc -m32"). It is easy to produce the bug with flush(), but with Decompress.decompress() it would need over 2 GiB of memory and data to expand the buffer enough to trigger the error.

>>> zlib.decompressobj().flush(sys.maxsize + 1)
SystemError: Negative size passed to PyBytes_FromStringAndSize

In zlib-Py_ssize_t.patch, I added the bigmem tests, but I am not able to actually test them. The gzip test would require 8 GiB of memory. I also changed the converter to Py_ssize_t to fix this latest bug.

serhiy-storchaka · 2015-11-20T14:23:12Z

Added new comments.

vadmium · 2015-11-21T00:37:37Z

New patch addressing comments. I used the undocumented API _PyLong_FromNbInt() to call __int__() rather than __index__().

serhiy-storchaka · 2015-11-21T07:00:18Z

Thank you Martin, the patch LGTM.

The difference between __int__() and __index__() is that float has the former, but not the latter. For better compatibility we have to use __int__() in maintained releases. But obviously __index__() is more appropriate semantically. float buffer length doesn't make sense. We have to change __int__ to __index__ in future releases. But this is separate issue.

python-dev · 2015-11-21T10:59:44Z

New changeset 80f6f77a7cc3 by Martin Panter in branch '3.5':
Issue bpo-25626: Change zlib to accept Py_ssize_t and cap to UINT_MAX
https://hg.python.org/cpython/rev/80f6f77a7cc3

New changeset afa1b6cd77a5 by Martin Panter in branch 'default':
Issue bpo-25626: Merge zlib fix from 3.5
https://hg.python.org/cpython/rev/afa1b6cd77a5

New changeset 5117f0b3be09 by Martin Panter in branch 'default':
Issue bpo-25626: Add news to 3.6 section
https://hg.python.org/cpython/rev/5117f0b3be09

vadmium · 2015-11-21T11:03:42Z

Thank you Serhiy for you help, and Ben for reporting this.

GitHub-Issue-Link: python/cpython#69812 The underlying zlib library stores sizes in “unsigned int”. The corresponding Python parameters are all sizes of buffers filled in by zlib, so it is okay to reduce higher values to the UINT_MAX internal cap. OverflowError is still raised for sizes that do not fit in Py_ssize_t. Sizes are now limited to Py_ssize_t rather than unsigned long, because Python byte strings cannot be larger than Py_ssize_t. Previously this could result in a SystemError on 32-bit platforms. This resolves a regression in the gzip module when reading more than UINT_MAX or LONG_MAX bytes in one call, introduced by revision 62723172412c.

BenCipollini mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Nov 14, 2015

SilentGhost mannequin added the stdlib Python modules in the Lib dir label Nov 14, 2015

vadmium added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 15, 2015

vadmium added the extension-modules C modules in the Modules dir label Nov 19, 2015

serhiy-storchaka assigned vadmium Nov 21, 2015

vadmium closed this as completed Nov 21, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gzip fails for file over 2**32 bytes #69812

Gzip fails for file over 2**32 bytes #69812

BenCipollini mannequin commented Nov 14, 2015

BenCipollini mannequin commented Nov 14, 2015

vadmium commented Nov 15, 2015

BenCipollini mannequin commented Nov 15, 2015

vadmium commented Nov 16, 2015

serhiy-storchaka commented Nov 16, 2015

vadmium commented Nov 16, 2015

vadmium commented Nov 19, 2015

serhiy-storchaka commented Nov 19, 2015

vadmium commented Nov 19, 2015

serhiy-storchaka commented Nov 19, 2015

vadmium commented Nov 20, 2015

serhiy-storchaka commented Nov 20, 2015

vadmium commented Nov 20, 2015

serhiy-storchaka commented Nov 20, 2015

vadmium commented Nov 21, 2015

serhiy-storchaka commented Nov 21, 2015

python-dev mannequin commented Nov 21, 2015

vadmium commented Nov 21, 2015

Gzip fails for file over 2**32 bytes #69812

Gzip fails for file over 2**32 bytes #69812

Comments

BenCipollini mannequin commented Nov 14, 2015

BenCipollini mannequin commented Nov 14, 2015

vadmium commented Nov 15, 2015

BenCipollini mannequin commented Nov 15, 2015

vadmium commented Nov 16, 2015

serhiy-storchaka commented Nov 16, 2015

vadmium commented Nov 16, 2015

vadmium commented Nov 19, 2015

serhiy-storchaka commented Nov 19, 2015

vadmium commented Nov 19, 2015

serhiy-storchaka commented Nov 19, 2015

vadmium commented Nov 20, 2015

serhiy-storchaka commented Nov 20, 2015

vadmium commented Nov 20, 2015

serhiy-storchaka commented Nov 20, 2015

vadmium commented Nov 21, 2015

serhiy-storchaka commented Nov 21, 2015

python-dev mannequin commented Nov 21, 2015

vadmium commented Nov 21, 2015