Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gzip fails for file over 2**32 bytes #69812

Closed
BenCipollini mannequin opened this issue Nov 14, 2015 · 18 comments
Closed

Gzip fails for file over 2**32 bytes #69812

BenCipollini mannequin opened this issue Nov 14, 2015 · 18 comments
Assignees
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@BenCipollini
Copy link
Mannequin

BenCipollini mannequin commented Nov 14, 2015

BPO 25626
Nosy @Yhg1s, @vadmium, @serhiy-storchaka
Files
  • zlib-size_t.patch: Cap zlib to UINT_MAX
  • zlib-size_t.2.patch
  • zlib-Py_ssize_t.patch
  • zlib-Py_ssize_t.2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vadmium'
    closed_at = <Date 2015-11-21.11:03:41.734>
    created_at = <Date 2015-11-14.19:48:21.730>
    labels = ['extension-modules', 'type-bug', 'library']
    title = 'Gzip fails for file over 2**32 bytes'
    updated_at = <Date 2015-11-21.11:03:41.732>
    user = 'https://bugs.python.org/BenCipollini'

    bugs.python.org fields:

    activity = <Date 2015-11-21.11:03:41.732>
    actor = 'martin.panter'
    assignee = 'martin.panter'
    closed = True
    closed_date = <Date 2015-11-21.11:03:41.734>
    closer = 'martin.panter'
    components = ['Extension Modules', 'Library (Lib)']
    creation = <Date 2015-11-14.19:48:21.730>
    creator = 'Ben Cipollini'
    dependencies = []
    files = ['41053', '41090', '41096', '41107']
    hgrepos = []
    issue_num = 25626
    keywords = ['patch', '3.5regression']
    message_count = 18.0
    messages = ['254668', '254684', '254686', '254717', '254727', '254754', '254865', '254884', '254887', '254891', '254955', '254963', '254972', '254976', '255032', '255044', '255048', '255049']
    nosy_count = 7.0
    nosy_names = ['twouters', 'nadeem.vawda', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'Matthew.Brett', 'Ben Cipollini']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25626'
    versions = ['Python 3.5', 'Python 3.6']

    @BenCipollini
    Copy link
    Mannequin Author

    BenCipollini mannequin commented Nov 14, 2015

    Gzip fails when opening a file more than 2**32 bytes. This is a new issue in Python 3.5.

    We hit this opening large neuroimaging files from the Human Connectome Project. See nipy/nibabel#362 for more details.

    When size is > 2**32, we get the following error on Python 3.5:

    /usr/lib/python3.5/gzip.py in read(self, size)
    467 buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
    468
    --> 469 uncompress = self._decompressor.decompress(buf, size)
    470 if self._decompressor.unconsumed_tail != b"":
    471 self._fp.prepend(self._decompressor.unconsumed_tail)

    OverflowError: Python int too large for C unsigned int

    @BenCipollini BenCipollini mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Nov 14, 2015
    @SilentGhost SilentGhost mannequin added the stdlib Python modules in the Lib dir label Nov 14, 2015
    @vadmium
    Copy link
    Member

    vadmium commented Nov 15, 2015

    Thanks for the report. Can you confirm if this demo illustrates your problem? For me, I only have 2 GiB of memory so I get a MemoryError, which seems reasonable for my situation.

    from gzip import GzipFile
    from io import BytesIO
    file = BytesIO()
    writer = GzipFile(fileobj=file, mode="wb")
    writer.write(b"data")
    writer.close()
    file.seek(0)
    reader = GzipFile(fileobj=file, mode="rb")
    data = reader.read(2**32)  # Ideally this should return b"data"

    Assuming that triggers the OverflowError, then the heart of the problem is that the zlib.decompressobj.decompress() method does not accept such large numbers for the length limit:

    >>> import zlib
    >>> decompressor = zlib.decompressobj(wbits=16 + 15)
    >>> decompressor.decompress(file.getvalue(), 2**32)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OverflowError: Python int too large for C unsigned int

    I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library. Would this be okay for a 3.5.1 bug fix release, or would it be considered a feature change?

    Failing that, another option would be to cap the limit in the gzip library, and just document the zlib limitation. I already have a patch in bpo-23200 documenting another quirk when max_length=0.

    The same problem may also apply to the LZMA and bzip2 modules; I need to check.

    @vadmium vadmium added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 15, 2015
    @BenCipollini
    Copy link
    Mannequin Author

    BenCipollini mannequin commented Nov 15, 2015

    @martin: yes, that reproduces the problem.

    I think the ideal fix would be to cap the limit at 2**32 - 1 in the zlib library.

    If this cap is implemented, is there any workaround how we can efficiently open gzipped files over 4GB? Such files exist, exist in high-priority, government-funded datasets, and neuroinformatics in Python requires a way to open them efficiently.

    another option would be to cap the limit in the gzip library

    Again, these limitations are new in Python 3.5 and currently block us from using Python 3.5 to run neuroinformatics efficiently. Is there any chance to revert this new behavior, or any known memory-efficient workaround?

    @vadmium
    Copy link
    Member

    vadmium commented Nov 16, 2015

    Ben: By adding the cap, it means when the lower-level zlib module is asked to decode say 5 GiB, it will decode at most 4 GiB (rather than raising an exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying zlib library; it’s not Python’s doing, and the limit was present before 3.5 as well.

    In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 5 GiB because there can only be 8ish MB data per internal decompress() call.

    Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk size up to 10 MiB in each internal iteration.

    If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching.

    ===

    I propose this patch for the 3.5 branch, which implements the internal capping in the zlib module. It changes the zlib module to accept sizes greater than UINT_MAX, but internally limits them. I think this is the most practical way forward, because it does not require Python code to know the value of UINT_MAX (which could in theory vary by platform).

    I would be good to get a review of my patch. In particular, is this change okay for the 3.5 maintainence branch? It would also be good to confirm that my patch fixes the original problem, i.e. read(2**32) works on computers with more than 4 GiB of memory (which I cannot test).

    This regression is caused by bpo-23529, where I make sure that GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression bomb.

    @serhiy-storchaka
    Copy link
    Member

    Is 3.4 affected? A uint_converter was added in bpo-18294.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 16, 2015

    3.4 shouldn’t be affected by the gzip regression. In 3.4 the gzip module does not set max_length to limit the decompression size.

    In 3.4 the underlying zlib module will also raise OverflowError if max_lenth=2**32, and my patch could be applied, but I don’t think it is worth it at this stage.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 19, 2015

    I am inclined to commit my patch to get this fixed in the upcoming 3.5.1 release, unless anyone thinks that could be a problem.

    @vadmium vadmium added the extension-modules C modules in the Modules dir label Nov 19, 2015
    @serhiy-storchaka
    Copy link
    Member

    Could you please add tests for other two functions? And tests for the gzip module?

    @vadmium
    Copy link
    Member

    vadmium commented Nov 19, 2015

    Okay. For the gzip module, I cannot easily test this myself. Quickly looking at other cases, I guess it would look something like this, but I need to spend some time understanging the bigmemtest decorator properly:

    @unittest.skipIf(sys.maxsize < _4G, "Requires non-32-bit system")
    @test.support.bigmemtest(_4G, 1, dry_run=False)
    def test_large_read(self, size):
    ...
    data = reader.read(size) # Should not raise OverflowError
    self.assertEqual(data, b"data")

    @serhiy-storchaka
    Copy link
    Member

    You can commit your patch right now (it shouldn't make things worse) and add new tests later.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 20, 2015

    Actually it did make an existing bug a bit worse. The bug is triggered with valid sizes greater than LONG_MAX, due to an exception not being cleared, and my change made this bug more dramatic, turning the OverflowError into a crash. This new patch should fix that.

    I’m now looking at adding the other tests that require allocating lots of memory.

    @serhiy-storchaka
    Copy link
    Member

    Ah, I found yet one bug.

    >>> zlib.decompress(zlib.compress(b'abcd'), 0, sys.maxsize+1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    SystemError: Negative size passed to PyBytes_FromStringAndSize

    There are similar bugs in decompressor's methods decompress() and flush() but it is hard to reproduce them.

    The maximal length value should be bounded not only with UINT_MAX, but with PY_SSIZE_T_MAX too.

    I would merge sval and uval into one variable of type Py_ssize_t.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 20, 2015

    Thanks for your testing Serhiy. I can reproduce this on 32-bit Linux (also by building with CC="gcc -m32"). It is easy to produce the bug with flush(), but with Decompress.decompress() it would need over 2 GiB of memory and data to expand the buffer enough to trigger the error.

    >>> zlib.decompressobj().flush(sys.maxsize + 1)
    SystemError: Negative size passed to PyBytes_FromStringAndSize

    In zlib-Py_ssize_t.patch, I added the bigmem tests, but I am not able to actually test them. The gzip test would require 8 GiB of memory. I also changed the converter to Py_ssize_t to fix this latest bug.

    @serhiy-storchaka
    Copy link
    Member

    Added new comments.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 21, 2015

    New patch addressing comments. I used the undocumented API _PyLong_FromNbInt() to call __int__() rather than __index__().

    @serhiy-storchaka
    Copy link
    Member

    Thank you Martin, the patch LGTM.

    The difference between __int__() and __index__() is that float has the former, but not the latter. For better compatibility we have to use __int__() in maintained releases. But obviously __index__() is more appropriate semantically. float buffer length doesn't make sense. We have to change __int__ to __index__ in future releases. But this is separate issue.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 21, 2015

    New changeset 80f6f77a7cc3 by Martin Panter in branch '3.5':
    Issue bpo-25626: Change zlib to accept Py_ssize_t and cap to UINT_MAX
    https://hg.python.org/cpython/rev/80f6f77a7cc3

    New changeset afa1b6cd77a5 by Martin Panter in branch 'default':
    Issue bpo-25626: Merge zlib fix from 3.5
    https://hg.python.org/cpython/rev/afa1b6cd77a5

    New changeset 5117f0b3be09 by Martin Panter in branch 'default':
    Issue bpo-25626: Add news to 3.6 section
    https://hg.python.org/cpython/rev/5117f0b3be09

    @vadmium
    Copy link
    Member

    vadmium commented Nov 21, 2015

    Thank you Serhiy for you help, and Ben for reporting this.

    @vadmium vadmium closed this as completed Nov 21, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    AA-Turner pushed a commit to AA-Turner/devguide that referenced this issue Sep 13, 2023
    GitHub-Issue-Link: python/cpython#69812
    
    The underlying zlib library stores sizes in “unsigned int”. The corresponding
    Python parameters are all sizes of buffers filled in by zlib, so it is okay
    to reduce higher values to the UINT_MAX internal cap. OverflowError is still
    raised for sizes that do not fit in Py_ssize_t.
    
    Sizes are now limited to Py_ssize_t rather than unsigned long, because Python
    byte strings cannot be larger than Py_ssize_t. Previously this could result
    in a SystemError on 32-bit platforms.
    
    This resolves a regression in the gzip module when reading more than UINT_MAX
    or LONG_MAX bytes in one call, introduced by revision 62723172412c.
    erlend-aasland pushed a commit to python/devguide that referenced this issue Sep 26, 2023
    GitHub-Issue-Link: python/cpython#69812
    
    The underlying zlib library stores sizes in “unsigned int”. The corresponding
    Python parameters are all sizes of buffers filled in by zlib, so it is okay
    to reduce higher values to the UINT_MAX internal cap. OverflowError is still
    raised for sizes that do not fit in Py_ssize_t.
    
    Sizes are now limited to Py_ssize_t rather than unsigned long, because Python
    byte strings cannot be larger than Py_ssize_t. Previously this could result
    in a SystemError on 32-bit platforms.
    
    This resolves a regression in the gzip module when reading more than UINT_MAX
    or LONG_MAX bytes in one call, introduced by revision 62723172412c.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants