classification
Title: Add an 'offset' argument to zlib.decompress
Type: enhancement Stage:
Components: Extension Modules Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, gregory.p.smith, kristjan.jonsson, martin.panter, nadeem.vawda, pitrou, r.david.murray, rbcollins
Priority: normal Keywords: patch

Created on 2009-04-21 09:51 by kristjan.jonsson, last changed 2015-01-14 22:27 by martin.panter.

Files
File name Uploaded Description Edit
zlibpatch.patch kristjan.jonsson, 2009-04-21 09:51
zlibpatch.patch kristjan.jonsson, 2009-04-21 16:13
zlibpatch2.patch kristjan.jonsson, 2009-04-23 19:18
Messages (16)
msg86227 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-21 09:51
It can be useful to get whatever remaining data there exists in the 
decompression stream for further processing.  This is currently possible 
by creating a zlib.decompressobj and accessing its unused_data member, 
but that is quite cumbersome.
The attached patch adds a 'tail' keyword argument to zlib.decompress 
which causes the result to be a 2-tuple, with the first part containing 
the decompressed data, and the second containing any remaining data
msg86240 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-21 15:43
As usual the patch should come with some unit tests.
Also, it seems you don't test the return value of
PyString_FromStringAndSize before building your tuple.
msg86241 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-21 15:46
Unittests coming up.
It is not necessary to test the return value, it is done by 
Py_BuildValue().  I think it is common to do it this way, but if You 
think explicit testing is better, then that's no problem.
msg86244 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-21 15:57
> It is not necessary to test the return value, it is done by 
> Py_BuildValue().

Sorry, it's ok then.
msg86247 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-21 16:13
New patch, with explicit error test and tests in test_zlib.py, both for 
the "tail" parameter and the hitherto untested "unused_data" attribute.

I'm not completely happy with the name of the argument, "tail", any 
suggestions?  Also, since I made compress a "keywords" method, should I 
apply that to the rest of the functions?
msg86284 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-22 10:26
Note, I thought of the "tail" argument to reflect the "unused_data" 
member of the decompressobj().  But a more useful way is actually this:
result, offset = zlib.decompress(buffer, offset=0)
if offset<len(buffer):
    result2, offset = zlib.decompress(buffer, offset=offset)

This avoids data copying as much as possible.  The presence of a non-
default offset argument, would trigger the "tuple" return value.

We could add the same argument to decompressobj.decompress, and 
add "unused_offset" member there for symmetry.

Any thoughts?
msg86373 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-23 19:18
I've added a new patch, which instead uses an "offset" parameter.  This 
is more natural, perhaps, and also avoids data copying.  Complete with 
tests and documentation update.
msg87550 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-05-10 21:05
Overall, this looks good.  Some mostly minor comments in this review.

http://codereview.appspot.com/63060/diff/1/2
File Doc/library/zlib.rst (right):

http://codereview.appspot.com/63060/diff/1/2#newcode136
Line 136: When specified, it will cause the function's return value to
be a (uncompressed, offset)
how about this wording:

..."to be a tuple of (uncompressed, offset), with the"...

http://codereview.appspot.com/63060/diff/1/2#newcode142
Line 142: Added the *offset* argument
missing . at the end of the sentence.  Also, mention that this function
now accepts keyword arguments.

http://codereview.appspot.com/63060/diff/1/3
File Lib/test/test_zlib.py (right):

http://codereview.appspot.com/63060/diff/1/3#newcode110
Line 110: c = zlib.compress(a)+zlib.compress(b)
please surround the + with spaces for readability and consistency with
other code in the file.\

http://codereview.appspot.com/63060/diff/1/4
File Modules/zlibmodule.c (right):

http://codereview.appspot.com/63060/diff/1/4#newcode193
Line 193: "the start position in the string to start decompresion and,
if spceified,\n"
typo: specified.

http://codereview.appspot.com/63060/diff/1/4#newcode296
Line 296: offset = length-zst.avail_in+offset;
leave spaces around your - and + operators and this becomes easier to
read.

http://codereview.appspot.com/63060
msg87551 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2009-05-10 21:14
Maybe I'm missing something, but isn't the offset parameter just another
way of spelling
buffer(input, offset)?

I like the avoiding of copying, just wondering if having a magic
parameter to get a tuple is really better than (say)
result, used = zlib.decompress2(source)
if used<len(source):
    result2, used = zlib.decompress2(buffer(source, used))

This changes two things. Rather than a magic parameter it adds a new
function which always returns tuples (meaning that you don't need magic
to make sure users don't accidentally pass offset=None and get a single
result when they wanted a tuple), and uses the built in buffer facility
to avoid copying rather than relying on delayed slicing.
msg87567 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-05-11 08:56
Well, yes and no.
I agree that it is annoying to have to add another parameter to get a 
returning tuple.  But you have to weigh that against adding another 
conveinence API.
On the other hand, I think the offset argument is quite convenient.  If 
we were to add a "decompress2" function, I would like to keep the 
offset because it removes the extra overhead of having to create a 
temporary buffer object, at no extra cost.

Also, note that your example code becomes more complex, for a loop.  
You would need to:
results = []
used =0
while used<len(source):
    r, u = zlib.decompress2(buffer(source, used))
    used += u
    results.append(r)

as opposed to:
    r, used = zlib.decompress2(source, used)
    results.append(r)

You have to make sure you add the buffer's hidden offset to the output 
offset.

There is a third option.  The "offset" could be a list containg the 
offset in its 0th position, a sort of "byref" passing, but that is not 
in the general Python spirit, is it?

This would give us the loop:
o = [0]
while len(source)>o[0]
    results.append(zlib.decompress(source, offset=offsetlist))
msg87568 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2009-05-11 09:02
Well, I think its relatively uncommon to be doing such a loop with a
static buffer anyway - often you'll instead be reading from disk or a
network stream; if we could make those cases simpler and avoid copying
that would be great.

Anyhow, no strong opinions here, just saw it going by before added my
2c.

-Rob
msg110993 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-21 02:47
This has been reviewed see msg87550, can we get this into 3.2?
msg111017 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-21 07:25
The function returns different kind of data depending on the value of the last parameter. I don't like it at all.

Why is decompressobj not enough?
msg113415 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2010-08-09 13:53
decompressobj is indeed "enough".  But if you are doing a lot of this (decompressing chunks), then using the unused_data, as it is, involves a lot of copying.
If there were a "unused_data_pos" or some equivalent, then it would be possible to continue decommpressing with buffer(olddata, unused_data_pos), without copying the source data.

The point of the "offset" keyword argument was to have a way to get this end position also without having an explicit decompressobj.

I agree that having the result type change to a tuple is not good.

So, a suggestion:
1) Add the unused_data_pos to the decompressobj.
2) (opotional) Add the automatic retrieval of this with a decompress_ex method, for convenience, making it usable without creating a decompressobj
3) (optional) Add the "offset" kw argument to decompress and (decompress_ex) making the creation of a temporary buffer() object unnecessary.
msg229045 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-10 23:24
Moving this from commit review back to no selection, since there doesn't yet seem to be an agreement on an API.
msg234040 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-14 22:27
A different test case for “unused_data” attribute was added in 2012 for Issue 16350, so that part is no longer needed.

If this feature goes ahead, it might be nice to also update the bzip and LZMA modules for consistency.

In Python 3, the equivalent of the buffer() option would look like this, assuming the memory view is in byte format:

zlib.decompress(memoryview(source)[unused_offset:])

Another option would be to copy some paint from the struct.unpack_from() bikeshed:

[data, offset] = zlib.decompress_from(buffer, offset)
History
Date User Action Args
2015-01-14 22:27:13martin.pantersetnosy: + martin.panter
messages: + msg234040
2014-10-10 23:24:51r.david.murraysetversions: + Python 3.5, - Python 3.2
nosy: + r.david.murray

messages: + msg229045

stage: commit review ->
2014-02-03 19:14:31BreamoreBoysetnosy: - BreamoreBoy
2012-01-26 13:07:45nadeem.vawdasetnosy: + nadeem.vawda
2010-08-09 13:53:55kristjan.jonssonsetmessages: + msg113415
2010-07-21 07:25:33amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg111017
2010-07-21 02:47:37BreamoreBoysetversions: + Python 3.2, - Python 3.1, Python 2.7
nosy: + BreamoreBoy

messages: + msg110993

stage: patch review -> commit review
2009-05-11 09:02:04rbcollinssetmessages: + msg87568
2009-05-11 08:56:24kristjan.jonssonsetmessages: + msg87567
2009-05-10 21:14:40rbcollinssetnosy: + rbcollins
messages: + msg87551
2009-05-10 21:05:29gregory.p.smithsetmessages: + msg87550
2009-05-10 20:49:45gregory.p.smithsetnosy: + gregory.p.smith
2009-05-08 09:45:40kristjan.jonssonsettitle: Add a "tail" argument to zlib.decompress -> Add an 'offset' argument to zlib.decompress
2009-04-23 19:18:40kristjan.jonssonsetfiles: + zlibpatch2.patch

messages: + msg86373
2009-04-22 10:26:02kristjan.jonssonsetmessages: + msg86284
2009-04-21 16:13:14kristjan.jonssonsetfiles: + zlibpatch.patch

messages: + msg86247
2009-04-21 15:57:20pitrousetmessages: + msg86244
2009-04-21 15:46:42kristjan.jonssonsetmessages: + msg86241
2009-04-21 15:43:15pitrousetpriority: normal
versions: + Python 3.1
nosy: + pitrou

messages: + msg86240

stage: patch review
2009-04-21 09:51:41kristjan.jonssoncreate