msg254668 - (view) |
Author: Ben Cipollini (Ben Cipollini) |
Date: 2015-11-14 19:48 |
Gzip fails when opening a file more than 2**32 bytes. This is a new issue in Python 3.5.
We hit this opening large neuroimaging files from the Human Connectome Project. See https://github.com/nipy/nibabel/issues/362 for more details.
When size is > 2**32, we get the following error on Python 3.5:
/usr/lib/python3.5/gzip.py in read(self, size)
467 buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
468
--> 469 uncompress = self._decompressor.decompress(buf, size)
470 if self._decompressor.unconsumed_tail != b"":
471 self._fp.prepend(self._decompressor.unconsumed_tail)
OverflowError: Python int too large for C unsigned int
|
msg254684 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-15 11:13 |
Thanks for the report. Can you confirm if this demo illustrates your problem? For me, I only have 2 GiB of memory so I get a MemoryError, which seems reasonable for my situation.
from gzip import GzipFile
from io import BytesIO
file = BytesIO()
writer = GzipFile(fileobj=file, mode="wb")
writer.write(b"data")
writer.close()
file.seek(0)
reader = GzipFile(fileobj=file, mode="rb")
data = reader.read(2**32) # Ideally this should return b"data"
Assuming that triggers the OverflowError, then the heart of the problem is that the zlib.decompressobj.decompress() method does not accept such large numbers for the length limit:
>>> import zlib
>>> decompressor = zlib.decompressobj(wbits=16 + 15)
>>> decompressor.decompress(file.getvalue(), 2**32)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large for C unsigned int
I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library. Would this be okay for a 3.5.1 bug fix release, or would it be considered a feature change?
Failing that, another option would be to cap the limit in the gzip library, and just document the zlib limitation. I already have a patch in Issue 23200 documenting another quirk when max_length=0.
The same problem may also apply to the LZMA and bzip2 modules; I need to check.
|
msg254686 - (view) |
Author: Ben Cipollini (Ben Cipollini) |
Date: 2015-11-15 12:20 |
@Martin: yes, that reproduces the problem.
>I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library.
If this cap is implemented, is there any workaround how we can efficiently open gzipped files over 4GB? Such files exist, exist in high-priority, government-funded datasets, and neuroinformatics in Python requires a way to open them efficiently.
>another option would be to cap the limit in the gzip library
Again, these limitations are new in Python 3.5 and currently block us from using Python 3.5 to run neuroinformatics efficiently. Is there any chance to revert this new behavior, or any known memory-efficient workaround?
|
msg254717 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-16 06:54 |
Ben: By adding the cap, it means when the lower-level zlib module is asked to decode say 5 GiB, it will decode at most 4 GiB (rather than raising an exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying zlib library; it’s not Python’s doing, and the limit was present before 3.5 as well.
In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 5 GiB because there can only be 8ish MB data per internal decompress() call.
Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk size up to 10 MiB in each internal iteration.
If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching.
===
I propose this patch for the 3.5 branch, which implements the internal capping in the zlib module. It changes the zlib module to accept sizes greater than UINT_MAX, but internally limits them. I think this is the most practical way forward, because it does not require Python code to know the value of UINT_MAX (which could in theory vary by platform).
I would be good to get a review of my patch. In particular, is this change okay for the 3.5 maintainence branch? It would also be good to confirm that my patch fixes the original problem, i.e. read(2**32) works on computers with more than 4 GiB of memory (which I cannot test).
This regression is caused by Issue 23529, where I make sure that GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression bomb.
|
msg254727 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-11-16 12:14 |
Is 3.4 affected? A uint_converter was added in issue18294.
|
msg254754 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-16 20:42 |
3.4 shouldn’t be affected by the gzip regression. In 3.4 the gzip module does not set max_length to limit the decompression size.
In 3.4 the underlying zlib module will also raise OverflowError if max_lenth=2**32, and my patch could be applied, but I don’t think it is worth it at this stage.
|
msg254865 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-19 00:16 |
I am inclined to commit my patch to get this fixed in the upcoming 3.5.1 release, unless anyone thinks that could be a problem.
|
msg254884 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-11-19 08:08 |
Could you please add tests for other two functions? And tests for the gzip module?
|
msg254887 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-19 08:34 |
Okay. For the gzip module, I cannot easily test this myself. Quickly looking at other cases, I guess it would look something like this, but I need to spend some time understanging the bigmemtest decorator properly:
@unittest.skipIf(sys.maxsize < _4G, "Requires non-32-bit system")
@test.support.bigmemtest(_4G, 1, dry_run=False)
def test_large_read(self, size):
...
data = reader.read(size) # Should not raise OverflowError
self.assertEqual(data, b"data")
|
msg254891 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-11-19 09:28 |
You can commit your patch right now (it shouldn't make things worse) and add new tests later.
|
msg254955 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-20 08:11 |
Actually it did make an existing bug a bit worse. The bug is triggered with valid sizes greater than LONG_MAX, due to an exception not being cleared, and my change made this bug more dramatic, turning the OverflowError into a crash. This new patch should fix that.
I’m now looking at adding the other tests that require allocating lots of memory.
|
msg254963 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-11-20 09:45 |
Ah, I found yet one bug.
>>> zlib.decompress(zlib.compress(b'abcd'), 0, sys.maxsize+1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SystemError: Negative size passed to PyBytes_FromStringAndSize
There are similar bugs in decompressor's methods decompress() and flush() but it is hard to reproduce them.
The maximal length value should be bounded not only with UINT_MAX, but with PY_SSIZE_T_MAX too.
I would merge sval and uval into one variable of type Py_ssize_t.
|
msg254972 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-20 12:51 |
Thanks for your testing Serhiy. I can reproduce this on 32-bit Linux (also by building with CC="gcc -m32"). It is easy to produce the bug with flush(), but with Decompress.decompress() it would need over 2 GiB of memory and data to expand the buffer enough to trigger the error.
>>> zlib.decompressobj().flush(sys.maxsize + 1)
SystemError: Negative size passed to PyBytes_FromStringAndSize
In zlib-Py_ssize_t.patch, I added the bigmem tests, but I am not able to actually test them. The gzip test would require 8 GiB of memory. I also changed the converter to Py_ssize_t to fix this latest bug.
|
msg254976 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-11-20 14:23 |
Added new comments.
|
msg255032 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-21 00:37 |
New patch addressing comments. I used the undocumented API _PyLong_FromNbInt() to call __int__() rather than __index__().
|
msg255044 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2015-11-21 07:00 |
Thank you Martin, the patch LGTM.
The difference between __int__() and __index__() is that float has the former, but not the latter. For better compatibility we have to use __int__() in maintained releases. But obviously __index__() is more appropriate semantically. float buffer length doesn't make sense. We have to change __int__ to __index__ in future releases. But this is separate issue.
|
msg255048 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2015-11-21 10:59 |
New changeset 80f6f77a7cc3 by Martin Panter in branch '3.5':
Issue #25626: Change zlib to accept Py_ssize_t and cap to UINT_MAX
https://hg.python.org/cpython/rev/80f6f77a7cc3
New changeset afa1b6cd77a5 by Martin Panter in branch 'default':
Issue #25626: Merge zlib fix from 3.5
https://hg.python.org/cpython/rev/afa1b6cd77a5
New changeset 5117f0b3be09 by Martin Panter in branch 'default':
Issue #25626: Add news to 3.6 section
https://hg.python.org/cpython/rev/5117f0b3be09
|
msg255049 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-11-21 11:03 |
Thank you Serhiy for you help, and Ben for reporting this.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:23 | admin | set | github: 69812 |
2015-11-21 11:03:41 | martin.panter | set | status: open -> closed resolution: fixed messages:
+ msg255049
stage: commit review -> resolved |
2015-11-21 10:59:44 | python-dev | set | nosy:
+ python-dev messages:
+ msg255048
|
2015-11-21 07:00:59 | serhiy.storchaka | set | assignee: martin.panter stage: patch review -> commit review |
2015-11-21 07:00:18 | serhiy.storchaka | set | messages:
+ msg255044 |
2015-11-21 00:37:36 | martin.panter | set | files:
+ zlib-Py_ssize_t.2.patch
messages:
+ msg255032 |
2015-11-20 14:23:12 | serhiy.storchaka | set | messages:
+ msg254976 |
2015-11-20 12:51:35 | martin.panter | set | files:
+ zlib-Py_ssize_t.patch
messages:
+ msg254972 |
2015-11-20 09:45:44 | serhiy.storchaka | set | messages:
+ msg254963 |
2015-11-20 08:11:42 | martin.panter | set | files:
+ zlib-size_t.2.patch
messages:
+ msg254955 |
2015-11-19 09:28:28 | serhiy.storchaka | set | messages:
+ msg254891 |
2015-11-19 08:34:26 | martin.panter | set | messages:
+ msg254887 |
2015-11-19 08:08:18 | serhiy.storchaka | set | messages:
+ msg254884 |
2015-11-19 00:16:58 | martin.panter | set | messages:
+ msg254865 components:
+ Extension Modules |
2015-11-16 20:42:43 | martin.panter | set | messages:
+ msg254754 |
2015-11-16 12:14:39 | serhiy.storchaka | set | messages:
+ msg254727 |
2015-11-16 07:02:21 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka
|
2015-11-16 06:54:40 | martin.panter | set | files:
+ zlib-size_t.patch keywords:
+ patch messages:
+ msg254717
stage: patch review |
2015-11-15 12:20:34 | Ben Cipollini | set | messages:
+ msg254686 |
2015-11-15 11:13:22 | martin.panter | set | nosy:
+ martin.panter messages:
+ msg254684
keywords:
+ 3.5regression type: crash -> behavior |
2015-11-14 20:16:04 | SilentGhost | set | nosy:
+ twouters, nadeem.vawda
components:
+ Library (Lib) versions:
+ Python 3.6 |
2015-11-14 19:55:00 | Matthew.Brett | set | nosy:
+ Matthew.Brett
|
2015-11-14 19:48:21 | Ben Cipollini | create | |