Issue 25626: Gzip fails for file over 2**32 bytes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69812

classification

Title:	Gzip fails for file over 2**32 bytes
Type:	behavior	Stage:	resolved
Components:	Extension Modules, Library (Lib)	Versions:	Python 3.6, Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	martin.panter	Nosy List:	Ben Cipollini, Matthew.Brett, martin.panter, nadeem.vawda, python-dev, serhiy.storchaka, twouters
Priority:	normal	Keywords:	3.5regression, patch

Created on 2015-11-14 19:48 by Ben Cipollini, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
zlib-size_t.patch	martin.panter, 2015-11-16 06:54	Cap zlib to UINT_MAX	review
zlib-size_t.2.patch	martin.panter, 2015-11-20 08:11		review
zlib-Py_ssize_t.patch	martin.panter, 2015-11-20 12:51		review
zlib-Py_ssize_t.2.patch	martin.panter, 2015-11-21 00:37		review

Messages (18)
msg254668 - (view)	Author: Ben Cipollini (Ben Cipollini)	Date: 2015-11-14 19:48
Gzip fails when opening a file more than 232 bytes. This is a new issue in Python 3.5. We hit this opening large neuroimaging files from the Human Connectome Project. See https://github.com/nipy/nibabel/issues/362 for more details. When size is > 232, we get the following error on Python 3.5: /usr/lib/python3.5/gzip.py in read(self, size) 467 buf = self._fp.read(io.DEFAULT_BUFFER_SIZE) 468 --> 469 uncompress = self._decompressor.decompress(buf, size) 470 if self._decompressor.unconsumed_tail != b"": 471 self._fp.prepend(self._decompressor.unconsumed_tail) OverflowError: Python int too large for C unsigned int
msg254684 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-15 11:13
Thanks for the report. Can you confirm if this demo illustrates your problem? For me, I only have 2 GiB of memory so I get a MemoryError, which seems reasonable for my situation. from gzip import GzipFile from io import BytesIO file = BytesIO() writer = GzipFile(fileobj=file, mode="wb") writer.write(b"data") writer.close() file.seek(0) reader = GzipFile(fileobj=file, mode="rb") data = reader.read(232) # Ideally this should return b"data" Assuming that triggers the OverflowError, then the heart of the problem is that the zlib.decompressobj.decompress() method does not accept such large numbers for the length limit: >>> import zlib >>> decompressor = zlib.decompressobj(wbits=16 + 15) >>> decompressor.decompress(file.getvalue(), 232) Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: Python int too large for C unsigned int I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library. Would this be okay for a 3.5.1 bug fix release, or would it be considered a feature change? Failing that, another option would be to cap the limit in the gzip library, and just document the zlib limitation. I already have a patch in Issue 23200 documenting another quirk when max_length=0. The same problem may also apply to the LZMA and bzip2 modules; I need to check.
msg254686 - (view)	Author: Ben Cipollini (Ben Cipollini)	Date: 2015-11-15 12:20
@Martin: yes, that reproduces the problem. >I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library. If this cap is implemented, is there any workaround how we can efficiently open gzipped files over 4GB? Such files exist, exist in high-priority, government-funded datasets, and neuroinformatics in Python requires a way to open them efficiently. >another option would be to cap the limit in the gzip library Again, these limitations are new in Python 3.5 and currently block us from using Python 3.5 to run neuroinformatics efficiently. Is there any chance to revert this new behavior, or any known memory-efficient workaround?
msg254717 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-16 06:54
Ben: By adding the cap, it means when the lower-level zlib module is asked to decode say 5 GiB, it will decode at most 4 GiB (rather than raising an exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying zlib library; it’s not Python’s doing, and the limit was present before 3.5 as well. In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 5 GiB because there can only be 8ish MB data per internal decompress() call. Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk size up to 10 MiB in each internal iteration. If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching. === I propose this patch for the 3.5 branch, which implements the internal capping in the zlib module. It changes the zlib module to accept sizes greater than UINT_MAX, but internally limits them. I think this is the most practical way forward, because it does not require Python code to know the value of UINT_MAX (which could in theory vary by platform). I would be good to get a review of my patch. In particular, is this change okay for the 3.5 maintainence branch? It would also be good to confirm that my patch fixes the original problem, i.e. read(2**32) works on computers with more than 4 GiB of memory (which I cannot test). This regression is caused by Issue 23529, where I make sure that GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression bomb.
msg254727 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-16 12:14
Is 3.4 affected? A uint_converter was added in issue18294.
msg254754 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-16 20:42
3.4 shouldn’t be affected by the gzip regression. In 3.4 the gzip module does not set max_length to limit the decompression size. In 3.4 the underlying zlib module will also raise OverflowError if max_lenth=2**32, and my patch could be applied, but I don’t think it is worth it at this stage.
msg254865 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-19 00:16
I am inclined to commit my patch to get this fixed in the upcoming 3.5.1 release, unless anyone thinks that could be a problem.
msg254884 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-19 08:08
Could you please add tests for other two functions? And tests for the gzip module?
msg254887 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-19 08:34
Okay. For the gzip module, I cannot easily test this myself. Quickly looking at other cases, I guess it would look something like this, but I need to spend some time understanging the bigmemtest decorator properly: @unittest.skipIf(sys.maxsize < _4G, "Requires non-32-bit system") @test.support.bigmemtest(_4G, 1, dry_run=False) def test_large_read(self, size): ... data = reader.read(size) # Should not raise OverflowError self.assertEqual(data, b"data")
msg254891 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-19 09:28
You can commit your patch right now (it shouldn't make things worse) and add new tests later.
msg254955 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-20 08:11
Actually it did make an existing bug a bit worse. The bug is triggered with valid sizes greater than LONG_MAX, due to an exception not being cleared, and my change made this bug more dramatic, turning the OverflowError into a crash. This new patch should fix that. I’m now looking at adding the other tests that require allocating lots of memory.
msg254963 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-20 09:45
Ah, I found yet one bug. >>> zlib.decompress(zlib.compress(b'abcd'), 0, sys.maxsize+1) Traceback (most recent call last): File "<stdin>", line 1, in <module> SystemError: Negative size passed to PyBytes_FromStringAndSize There are similar bugs in decompressor's methods decompress() and flush() but it is hard to reproduce them. The maximal length value should be bounded not only with UINT_MAX, but with PY_SSIZE_T_MAX too. I would merge sval and uval into one variable of type Py_ssize_t.
msg254972 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-20 12:51
Thanks for your testing Serhiy. I can reproduce this on 32-bit Linux (also by building with CC="gcc -m32"). It is easy to produce the bug with flush(), but with Decompress.decompress() it would need over 2 GiB of memory and data to expand the buffer enough to trigger the error. >>> zlib.decompressobj().flush(sys.maxsize + 1) SystemError: Negative size passed to PyBytes_FromStringAndSize In zlib-Py_ssize_t.patch, I added the bigmem tests, but I am not able to actually test them. The gzip test would require 8 GiB of memory. I also changed the converter to Py_ssize_t to fix this latest bug.
msg254976 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-20 14:23
Added new comments.
msg255032 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-21 00:37
New patch addressing comments. I used the undocumented API _PyLong_FromNbInt() to call __int__() rather than __index__().
msg255044 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-11-21 07:00
Thank you Martin, the patch LGTM. The difference between __int__() and __index__() is that float has the former, but not the latter. For better compatibility we have to use __int__() in maintained releases. But obviously __index__() is more appropriate semantically. float buffer length doesn't make sense. We have to change __int__ to __index__ in future releases. But this is separate issue.
msg255048 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-11-21 10:59
New changeset 80f6f77a7cc3 by Martin Panter in branch '3.5': Issue #25626: Change zlib to accept Py_ssize_t and cap to UINT_MAX https://hg.python.org/cpython/rev/80f6f77a7cc3 New changeset afa1b6cd77a5 by Martin Panter in branch 'default': Issue #25626: Merge zlib fix from 3.5 https://hg.python.org/cpython/rev/afa1b6cd77a5 New changeset 5117f0b3be09 by Martin Panter in branch 'default': Issue #25626: Add news to 3.6 section https://hg.python.org/cpython/rev/5117f0b3be09
msg255049 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-11-21 11:03
Thank you Serhiy for you help, and Ben for reporting this.

History
Date	User	Action	Args
2022-04-11 14:58:23	admin	set	github: 69812
2015-11-21 11:03:41	martin.panter	set	status: open -> closed resolution: fixed messages: + msg255049 stage: commit review -> resolved
2015-11-21 10:59:44	python-dev	set	nosy: + python-dev messages: + msg255048
2015-11-21 07:00:59	serhiy.storchaka	set	assignee: martin.panter stage: patch review -> commit review
2015-11-21 07:00:18	serhiy.storchaka	set	messages: + msg255044
2015-11-21 00:37:36	martin.panter	set	files: + zlib-Py_ssize_t.2.patch messages: + msg255032
2015-11-20 14:23:12	serhiy.storchaka	set	messages: + msg254976
2015-11-20 12:51:35	martin.panter	set	files: + zlib-Py_ssize_t.patch messages: + msg254972
2015-11-20 09:45:44	serhiy.storchaka	set	messages: + msg254963
2015-11-20 08:11:42	martin.panter	set	files: + zlib-size_t.2.patch messages: + msg254955
2015-11-19 09:28:28	serhiy.storchaka	set	messages: + msg254891
2015-11-19 08:34:26	martin.panter	set	messages: + msg254887
2015-11-19 08:08:18	serhiy.storchaka	set	messages: + msg254884
2015-11-19 00:16:58	martin.panter	set	messages: + msg254865 components: + Extension Modules
2015-11-16 20:42:43	martin.panter	set	messages: + msg254754
2015-11-16 12:14:39	serhiy.storchaka	set	messages: + msg254727
2015-11-16 07:02:21	serhiy.storchaka	set	nosy: + serhiy.storchaka
2015-11-16 06:54:40	martin.panter	set	files: + zlib-size_t.patch keywords: + patch messages: + msg254717 stage: patch review
2015-11-15 12:20:34	Ben Cipollini	set	messages: + msg254686
2015-11-15 11:13:22	martin.panter	set	nosy: + martin.panter messages: + msg254684 keywords: + 3.5regression type: crash -> behavior
2015-11-14 20:16:04	SilentGhost	set	nosy: + twouters, nadeem.vawda components: + Library (Lib) versions: + Python 3.6
2015-11-14 19:55:00	Matthew.Brett	set	nosy: + Matthew.Brett
2015-11-14 19:48:21	Ben Cipollini	create