Issue 5804: Add an 'offset' argument to zlib.decompress

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50054

classification

Title:	Add an 'offset' argument to zlib.decompress
Type:	enhancement	Stage:
Components:	Extension Modules	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, gregory.p.smith, kristjan.jonsson, martin.panter, nadeem.vawda, pitrou, r.david.murray, rbcollins
Priority:	normal	Keywords:	patch

Created on 2009-04-21 09:51 by kristjan.jonsson, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
zlibpatch.patch	kristjan.jonsson, 2009-04-21 09:51
zlibpatch.patch	kristjan.jonsson, 2009-04-21 16:13
zlibpatch2.patch	kristjan.jonsson, 2009-04-23 19:18

Messages (16)
msg86227 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2009-04-21 09:51
It can be useful to get whatever remaining data there exists in the decompression stream for further processing. This is currently possible by creating a zlib.decompressobj and accessing its unused_data member, but that is quite cumbersome. The attached patch adds a 'tail' keyword argument to zlib.decompress which causes the result to be a 2-tuple, with the first part containing the decompressed data, and the second containing any remaining data
msg86240 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-04-21 15:43
As usual the patch should come with some unit tests. Also, it seems you don't test the return value of PyString_FromStringAndSize before building your tuple.
msg86241 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2009-04-21 15:46
Unittests coming up. It is not necessary to test the return value, it is done by Py_BuildValue(). I think it is common to do it this way, but if You think explicit testing is better, then that's no problem.
msg86244 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-04-21 15:57
> It is not necessary to test the return value, it is done by > Py_BuildValue(). Sorry, it's ok then.
msg86247 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2009-04-21 16:13
New patch, with explicit error test and tests in test_zlib.py, both for the "tail" parameter and the hitherto untested "unused_data" attribute. I'm not completely happy with the name of the argument, "tail", any suggestions? Also, since I made compress a "keywords" method, should I apply that to the rest of the functions?
msg86284 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2009-04-22 10:26
Note, I thought of the "tail" argument to reflect the "unused_data" member of the decompressobj(). But a more useful way is actually this: result, offset = zlib.decompress(buffer, offset=0) if offset<len(buffer): result2, offset = zlib.decompress(buffer, offset=offset) This avoids data copying as much as possible. The presence of a non- default offset argument, would trigger the "tuple" return value. We could add the same argument to decompressobj.decompress, and add "unused_offset" member there for symmetry. Any thoughts?
msg86373 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2009-04-23 19:18
I've added a new patch, which instead uses an "offset" parameter. This is more natural, perhaps, and also avoids data copying. Complete with tests and documentation update.
msg87550 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2009-05-10 21:05
Overall, this looks good. Some mostly minor comments in this review. http://codereview.appspot.com/63060/diff/1/2 File Doc/library/zlib.rst (right): http://codereview.appspot.com/63060/diff/1/2#newcode136 Line 136: When specified, it will cause the function's return value to be a (uncompressed, offset) how about this wording: ..."to be a tuple of (uncompressed, offset), with the"... http://codereview.appspot.com/63060/diff/1/2#newcode142 Line 142: Added the offset argument missing . at the end of the sentence. Also, mention that this function now accepts keyword arguments. http://codereview.appspot.com/63060/diff/1/3 File Lib/test/test_zlib.py (right): http://codereview.appspot.com/63060/diff/1/3#newcode110 Line 110: c = zlib.compress(a)+zlib.compress(b) please surround the + with spaces for readability and consistency with other code in the file.\ http://codereview.appspot.com/63060/diff/1/4 File Modules/zlibmodule.c (right): http://codereview.appspot.com/63060/diff/1/4#newcode193 Line 193: "the start position in the string to start decompresion and, if spceified,\n" typo: specified. http://codereview.appspot.com/63060/diff/1/4#newcode296 Line 296: offset = length-zst.avail_in+offset; leave spaces around your - and + operators and this becomes easier to read. http://codereview.appspot.com/63060
msg87551 - (view)	Author: Robert Collins (rbcollins) *	Date: 2009-05-10 21:14
Maybe I'm missing something, but isn't the offset parameter just another way of spelling buffer(input, offset)? I like the avoiding of copying, just wondering if having a magic parameter to get a tuple is really better than (say) result, used = zlib.decompress2(source) if used<len(source): result2, used = zlib.decompress2(buffer(source, used)) This changes two things. Rather than a magic parameter it adds a new function which always returns tuples (meaning that you don't need magic to make sure users don't accidentally pass offset=None and get a single result when they wanted a tuple), and uses the built in buffer facility to avoid copying rather than relying on delayed slicing.
msg87567 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2009-05-11 08:56
Well, yes and no. I agree that it is annoying to have to add another parameter to get a returning tuple. But you have to weigh that against adding another conveinence API. On the other hand, I think the offset argument is quite convenient. If we were to add a "decompress2" function, I would like to keep the offset because it removes the extra overhead of having to create a temporary buffer object, at no extra cost. Also, note that your example code becomes more complex, for a loop. You would need to: results = [] used =0 while used<len(source): r, u = zlib.decompress2(buffer(source, used)) used += u results.append(r) as opposed to: r, used = zlib.decompress2(source, used) results.append(r) You have to make sure you add the buffer's hidden offset to the output offset. There is a third option. The "offset" could be a list containg the offset in its 0th position, a sort of "byref" passing, but that is not in the general Python spirit, is it? This would give us the loop: o = [0] while len(source)>o[0] results.append(zlib.decompress(source, offset=offsetlist))
msg87568 - (view)	Author: Robert Collins (rbcollins) *	Date: 2009-05-11 09:02
Well, I think its relatively uncommon to be doing such a loop with a static buffer anyway - often you'll instead be reading from disk or a network stream; if we could make those cases simpler and avoid copying that would be great. Anyhow, no strong opinions here, just saw it going by before added my 2c. -Rob
msg110993 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-21 02:47
This has been reviewed see msg87550, can we get this into 3.2?
msg111017 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-07-21 07:25
The function returns different kind of data depending on the value of the last parameter. I don't like it at all. Why is decompressobj not enough?
msg113415 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-08-09 13:53
decompressobj is indeed "enough". But if you are doing a lot of this (decompressing chunks), then using the unused_data, as it is, involves a lot of copying. If there were a "unused_data_pos" or some equivalent, then it would be possible to continue decommpressing with buffer(olddata, unused_data_pos), without copying the source data. The point of the "offset" keyword argument was to have a way to get this end position also without having an explicit decompressobj. I agree that having the result type change to a tuple is not good. So, a suggestion: 1) Add the unused_data_pos to the decompressobj. 2) (opotional) Add the automatic retrieval of this with a decompress_ex method, for convenience, making it usable without creating a decompressobj 3) (optional) Add the "offset" kw argument to decompress and (decompress_ex) making the creation of a temporary buffer() object unnecessary.
msg229045 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-10-10 23:24
Moving this from commit review back to no selection, since there doesn't yet seem to be an agreement on an API.
msg234040 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-14 22:27
A different test case for “unused_data” attribute was added in 2012 for Issue 16350, so that part is no longer needed. If this feature goes ahead, it might be nice to also update the bzip and LZMA modules for consistency. In Python 3, the equivalent of the buffer() option would look like this, assuming the memory view is in byte format: zlib.decompress(memoryview(source)[unused_offset:]) Another option would be to copy some paint from the struct.unpack_from() bikeshed: [data, offset] = zlib.decompress_from(buffer, offset)

History
Date	User	Action	Args
2022-04-11 14:56:48	admin	set	github: 50054
2015-01-14 22:27:13	martin.panter	set	nosy: + martin.panter messages: + msg234040
2014-10-10 23:24:51	r.david.murray	set	versions: + Python 3.5, - Python 3.2 nosy: + r.david.murray messages: + msg229045 stage: commit review ->
2014-02-03 19:14:31	BreamoreBoy	set	nosy: - BreamoreBoy
2012-01-26 13:07:45	nadeem.vawda	set	nosy: + nadeem.vawda
2010-08-09 13:53:55	kristjan.jonsson	set	messages: + msg113415
2010-07-21 07:25:33	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg111017
2010-07-21 02:47:37	BreamoreBoy	set	versions: + Python 3.2, - Python 3.1, Python 2.7 nosy: + BreamoreBoy messages: + msg110993 stage: patch review -> commit review
2009-05-11 09:02:04	rbcollins	set	messages: + msg87568
2009-05-11 08:56:24	kristjan.jonsson	set	messages: + msg87567
2009-05-10 21:14:40	rbcollins	set	nosy: + rbcollins messages: + msg87551
2009-05-10 21:05:29	gregory.p.smith	set	messages: + msg87550
2009-05-10 20:49:45	gregory.p.smith	set	nosy: + gregory.p.smith
2009-05-08 09:45:40	kristjan.jonsson	set	title: Add a "tail" argument to zlib.decompress -> Add an 'offset' argument to zlib.decompress
2009-04-23 19:18:40	kristjan.jonsson	set	files: + zlibpatch2.patch messages: + msg86373
2009-04-22 10:26:02	kristjan.jonsson	set	messages: + msg86284
2009-04-21 16:13:14	kristjan.jonsson	set	files: + zlibpatch.patch messages: + msg86247
2009-04-21 15:57:20	pitrou	set	messages: + msg86244
2009-04-21 15:46:42	kristjan.jonsson	set	messages: + msg86241
2009-04-21 15:43:15	pitrou	set	priority: normal versions: + Python 3.1 nosy: + pitrou messages: + msg86240 stage: patch review
2009-04-21 09:51:41	kristjan.jonsson	create