Issue 26379: zlib decompress as_bytearray flag

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70567

classification

Title:	zlib decompress as_bytearray flag
Type:	enhancement	Stage:
Components:	Extension Modules	Versions:	Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	llllllllll, martin.panter, nadeem.vawda, twouters
Priority:	normal	Keywords:	patch

Created on 2016-02-18 04:08 by llllllllll, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
zlib.patch	llllllllll, 2016-02-18 04:08		review

Messages (4)
msg260424 - (view)	Author: Joe Jevnik (llllllllll) *	Date: 2016-02-18 04:08
Adds a keyword only flag to zlib.decompress and zlib.Decompress.decompress which marks that the return type should be a bytearray instead of bytes. The use case for this is reading compressed data into a mutable array to do further operations without requiring that the user copy the data first. Often, if a user is choosing to compress the data there might be a real cost to copying the uncompressed bytes into a mutable buffer. The impetus for this change comes from a patch for Pandas (https://github.com/pydata/pandas/pull/12359). I have also worked on a similar feature for blosc, another popular compression library for python (https://github.com/Blosc/python-blosc/pull/107). My hope is to fix the hacky solution presented in the pandas patch by supporting this feature directly in the compression libraries. As a side note: this is my first time using the argument clinic and I love it. It was so simple to add this argument, thank you very much for developing such an amazing tool!
msg260447 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-02-18 10:48
This is an unusual technique. I can’t think of any other standard library routine that returns a new bytearray instead of a bytes object. Normally there are sister functions with an -into() suffix that accept a pre-allocated buffer. Examples are BufferedIOBase.readinto(), struct.pack_into(), socket.recv_into(). What do you think about adding a decompressobj.decompress_into(input, output) method instead? If necessary you could use it in a wrapper, something like: def decompress_as_bytearray(data, wbits=15, bufsize=16384): decompressor = zlib.decompressobj(wbits=wbits) buffer = bytearray(bufsize + 1) # Overallocate to help detect EOF offset = 0 while True: with memoryview(buffer) as view: offset += decompressor.decompress_into(data, view[offset:]) if offset < len(buffer): break data = decompressor.unconsumed_tail buffer *= 2 # Expand the buffer del buffer[offset:] if not decompressor.eof: raise zlib.error("Incomplete data") return buffer If anything is added for zlib decompression, it would be good to add equivalent functionality in the bz2 and LZMA modules.
msg260480 - (view)	Author: Joe Jevnik (llllllllll) *	Date: 2016-02-18 19:51
The recipe you showed looks like it is very complicated for the expected use case of decompressing all of the data into a mutable buffer. One difference I see with this and struct.pack_into or socket.recv_into is that in both of those cases we know how large our buffer needs to be. With zlib.decompress there is not a simple way for users to preallocate a buffer to hold the result and need to resort to some looping with a lot of resizes. I think that returning either a mutable or immutable buffer hits 95% of use cases. If we go with the decompress_into method I would strongly advise putting decompress_as_bytearray into the standard library so that users do not need write this themselves. Thank you for pointing me at the bz2 and LZMA modules. When we come to an agreement on the API I will update bz2 and LZMA so that they match. Should I also update the gzip module?
msg260481 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-02-18 21:16
If you don’t know how large the buffer is when you call zlib.decompress(), there may be copying involved as the buffer is expanded. Have you found that the copying the bytes result into a bytearray is significantly worse than expanding a bytearray (e.g. timing results)? The trouble with this proposal is that to me it doesn’t seem like there is a high demand for it. That is why I suggested the more minimal, low-level decompress_into() version. I think I would like to hear other opinions. As for updating GzipFile, that uses an intermediate BufferedReader, so we cannot completely avoid copying. But it might be worthwhile updating the internal readinto() method anyway. Possibly also with the zipfile module, though I am not so familiar with that.

History
Date	User	Action	Args
2022-04-11 14:58:27	admin	set	github: 70567
2016-02-20 00:56:51	terry.reedy	set	nosy: + twouters, nadeem.vawda
2016-02-18 21:16:21	martin.panter	set	messages: + msg260481
2016-02-18 19:51:31	llllllllll	set	messages: + msg260480
2016-02-18 10:48:46	martin.panter	set	nosy: + martin.panter messages: + msg260447
2016-02-18 04:08:20	llllllllll	create