This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: zlib decompress as_bytearray flag
Type: enhancement Stage:
Components: Extension Modules Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: llllllllll, martin.panter, nadeem.vawda, twouters
Priority: normal Keywords: patch

Created on 2016-02-18 04:08 by llllllllll, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
zlib.patch llllllllll, 2016-02-18 04:08 review
Messages (4)
msg260424 - (view) Author: Joe Jevnik (llllllllll) * Date: 2016-02-18 04:08
Adds a keyword only flag to zlib.decompress and zlib.Decompress.decompress which marks that the return type should be a bytearray instead of bytes.

The use case for this is reading compressed data into a mutable array to do further operations without requiring that the user copy the data first. Often, if a user is choosing to compress the data there might be a real cost to copying the uncompressed bytes into a mutable buffer.

The impetus for this change comes from a patch for Pandas (https://github.com/pydata/pandas/pull/12359). I have also worked on a similar feature for blosc, another popular compression library for python (https://github.com/Blosc/python-blosc/pull/107).

My hope is to fix the hacky solution presented in the pandas patch by supporting this feature directly in the compression libraries.

As a side note: this is my first time using the argument clinic and I love it. It was so simple to add this argument, thank you very much for developing such an amazing tool!
msg260447 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-02-18 10:48
This is an unusual technique. I can’t think of any other standard library routine that returns a new bytearray instead of a bytes object.

Normally there are sister functions with an -into() suffix that accept a pre-allocated buffer. Examples are BufferedIOBase.readinto(), struct.pack_into(), socket.recv_into(). What do you think about adding a decompressobj.decompress_into(input, output) method instead? If necessary you could use it in a wrapper, something like:

def decompress_as_bytearray(data, wbits=15, bufsize=16384):
    decompressor = zlib.decompressobj(wbits=wbits)
    buffer = bytearray(bufsize + 1)  # Overallocate to help detect EOF
    offset = 0
    while True:
        with memoryview(buffer) as view:
            offset += decompressor.decompress_into(data, view[offset:])
        if offset < len(buffer):
            break
        data = decompressor.unconsumed_tail
        buffer *= 2  # Expand the buffer
    del buffer[offset:]
    if not decompressor.eof:
        raise zlib.error("Incomplete data")
    return buffer

If anything is added for zlib decompression, it would be good to add equivalent functionality in the bz2 and LZMA modules.
msg260480 - (view) Author: Joe Jevnik (llllllllll) * Date: 2016-02-18 19:51
The recipe you showed looks like it is very complicated for the expected use case of decompressing all of the data into a mutable buffer. One difference I see with this and struct.pack_into or socket.recv_into is that in both of those cases we know how large our buffer needs to be. With zlib.decompress there is not a simple way for users to preallocate a buffer to hold the result and need to resort to some looping with a lot of resizes. I think that returning either a mutable or immutable buffer hits 95% of use cases. If we go with the decompress_into method I would strongly advise putting decompress_as_bytearray into the standard library so that users do not need write this themselves.

Thank you for pointing me at the bz2 and LZMA modules. When we come to an agreement on the API I will update bz2 and LZMA so that they match. Should I also update the gzip module?
msg260481 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-02-18 21:16
If you don’t know how large the buffer is when you call zlib.decompress(), there may be copying involved as the buffer is expanded. Have you found that the copying the bytes result into a bytearray is significantly worse than expanding a bytearray (e.g. timing results)?

The trouble with this proposal is that to me it doesn’t seem like there is a high demand for it. That is why I suggested the more minimal, low-level decompress_into() version. I think I would like to hear other opinions.

As for updating GzipFile, that uses an intermediate BufferedReader, so we cannot completely avoid copying. But it might be worthwhile updating the internal readinto() method anyway. Possibly also with the zipfile module, though I am not so familiar with that.
History
Date User Action Args
2022-04-11 14:58:27adminsetgithub: 70567
2016-02-20 00:56:51terry.reedysetnosy: + twouters, nadeem.vawda
2016-02-18 21:16:21martin.pantersetmessages: + msg260481
2016-02-18 19:51:31llllllllllsetmessages: + msg260480
2016-02-18 10:48:46martin.pantersetnosy: + martin.panter
messages: + msg260447
2016-02-18 04:08:20llllllllllcreate