classification
Title: Add an 'offset' argument to zlib.decompress
Type: enhancement Stage: commit review
Components: Extension Modules Versions: Python 3.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, amaury.forgeotdarc, gregory.p.smith, krisvale, nadeem.vawda, pitrou, rbcollins
Priority: normal Keywords: patch

Created on 2009-04-21 09:51 by krisvale, last changed 2012-01-26 13:07 by nadeem.vawda.

Files
File name Uploaded Description Edit
zlibpatch.patch krisvale, 2009-04-21 09:51 review
zlibpatch.patch krisvale, 2009-04-21 16:13 review
zlibpatch2.patch krisvale, 2009-04-23 19:18 review
Messages (14)
msg86227 - (view) Author: Kristján Valur Jónsson (krisvale) * Date: 2009-04-21 09:51
It can be useful to get whatever remaining data there exists in the 
decompression stream for further processing.  This is currently possible 
by creating a zlib.decompressobj and accessing its unused_data member, 
but that is quite cumbersome.
The attached patch adds a 'tail' keyword argument to zlib.decompress 
which causes the result to be a 2-tuple, with the first part containing 
the decompressed data, and the second containing any remaining data
msg86240 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-21 15:43
As usual the patch should come with some unit tests.
Also, it seems you don't test the return value of
PyString_FromStringAndSize before building your tuple.
msg86241 - (view) Author: Kristján Valur Jónsson (krisvale) * Date: 2009-04-21 15:46
Unittests coming up.
It is not necessary to test the return value, it is done by 
Py_BuildValue().  I think it is common to do it this way, but if You 
think explicit testing is better, then that's no problem.
msg86244 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-21 15:57
> It is not necessary to test the return value, it is done by 
> Py_BuildValue().

Sorry, it's ok then.
msg86247 - (view) Author: Kristján Valur Jónsson (krisvale) * Date: 2009-04-21 16:13
New patch, with explicit error test and tests in test_zlib.py, both for 
the "tail" parameter and the hitherto untested "unused_data" attribute.

I'm not completely happy with the name of the argument, "tail", any 
suggestions?  Also, since I made compress a "keywords" method, should I 
apply that to the rest of the functions?
msg86284 - (view) Author: Kristján Valur Jónsson (krisvale) * Date: 2009-04-22 10:26
Note, I thought of the "tail" argument to reflect the "unused_data" 
member of the decompressobj().  But a more useful way is actually this:
result, offset = zlib.decompress(buffer, offset=0)
if offset<len(buffer):
    result2, offset = zlib.decompress(buffer, offset=offset)

This avoids data copying as much as possible.  The presence of a non-
default offset argument, would trigger the "tuple" return value.

We could add the same argument to decompressobj.decompress, and 
add "unused_offset" member there for symmetry.

Any thoughts?
msg86373 - (view) Author: Kristján Valur Jónsson (krisvale) * Date: 2009-04-23 19:18
I've added a new patch, which instead uses an "offset" parameter.  This 
is more natural, perhaps, and also avoids data copying.  Complete with 
tests and documentation update.
msg87550 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-05-10 21:05
Overall, this looks good.  Some mostly minor comments in this review.

http://codereview.appspot.com/63060/diff/1/2
File Doc/library/zlib.rst (right):

http://codereview.appspot.com/63060/diff/1/2#newcode136
Line 136: When specified, it will cause the function's return value to
be a (uncompressed, offset)
how about this wording:

..."to be a tuple of (uncompressed, offset), with the"...

http://codereview.appspot.com/63060/diff/1/2#newcode142
Line 142: Added the *offset* argument
missing . at the end of the sentence.  Also, mention that this function
now accepts keyword arguments.

http://codereview.appspot.com/63060/diff/1/3
File Lib/test/test_zlib.py (right):

http://codereview.appspot.com/63060/diff/1/3#newcode110
Line 110: c = zlib.compress(a)+zlib.compress(b)
please surround the + with spaces for readability and consistency with
other code in the file.\

http://codereview.appspot.com/63060/diff/1/4
File Modules/zlibmodule.c (right):

http://codereview.appspot.com/63060/diff/1/4#newcode193
Line 193: "the start position in the string to start decompresion and,
if spceified,\n"
typo: specified.

http://codereview.appspot.com/63060/diff/1/4#newcode296
Line 296: offset = length-zst.avail_in+offset;
leave spaces around your - and + operators and this becomes easier to
read.

http://codereview.appspot.com/63060
msg87551 - (view) Author: Robert Collins (rbcollins) Date: 2009-05-10 21:14
Maybe I'm missing something, but isn't the offset parameter just another
way of spelling
buffer(input, offset)?

I like the avoiding of copying, just wondering if having a magic
parameter to get a tuple is really better than (say)
result, used = zlib.decompress2(source)
if used<len(source):
    result2, used = zlib.decompress2(buffer(source, used))

This changes two things. Rather than a magic parameter it adds a new
function which always returns tuples (meaning that you don't need magic
to make sure users don't accidentally pass offset=None and get a single
result when they wanted a tuple), and uses the built in buffer facility
to avoid copying rather than relying on delayed slicing.
msg87567 - (view) Author: Kristján Valur Jónsson (krisvale) * Date: 2009-05-11 08:56
Well, yes and no.
I agree that it is annoying to have to add another parameter to get a 
returning tuple.  But you have to weigh that against adding another 
conveinence API.
On the other hand, I think the offset argument is quite convenient.  If 
we were to add a "decompress2" function, I would like to keep the 
offset because it removes the extra overhead of having to create a 
temporary buffer object, at no extra cost.

Also, note that your example code becomes more complex, for a loop.  
You would need to:
results = []
used =0
while used<len(source):
    r, u = zlib.decompress2(buffer(source, used))
    used += u
    results.append(r)

as opposed to:
    r, used = zlib.decompress2(source, used)
    results.append(r)

You have to make sure you add the buffer's hidden offset to the output 
offset.

There is a third option.  The "offset" could be a list containg the 
offset in its 0th position, a sort of "byref" passing, but that is not 
in the general Python spirit, is it?

This would give us the loop:
o = [0]
while len(source)>o[0]
    results.append(zlib.decompress(source, offset=offsetlist))
msg87568 - (view) Author: Robert Collins (rbcollins) Date: 2009-05-11 09:02
Well, I think its relatively uncommon to be doing such a loop with a
static buffer anyway - often you'll instead be reading from disk or a
network stream; if we could make those cases simpler and avoid copying
that would be great.

Anyhow, no strong opinions here, just saw it going by before added my
2c.

-Rob
msg110993 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-21 02:47
This has been reviewed see msg87550, can we get this into 3.2?
msg111017 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-21 07:25
The function returns different kind of data depending on the value of the last parameter. I don't like it at all.

Why is decompressobj not enough?
msg113415 - (view) Author: Kristján Valur Jónsson (krisvale) * Date: 2010-08-09 13:53
decompressobj is indeed "enough".  But if you are doing a lot of this (decompressing chunks), then using the unused_data, as it is, involves a lot of copying.
If there were a "unused_data_pos" or some equivalent, then it would be possible to continue decommpressing with buffer(olddata, unused_data_pos), without copying the source data.

The point of the "offset" keyword argument was to have a way to get this end position also without having an explicit decompressobj.

I agree that having the result type change to a tuple is not good.

So, a suggestion:
1) Add the unused_data_pos to the decompressobj.
2) (opotional) Add the automatic retrieval of this with a decompress_ex method, for convenience, making it usable without creating a decompressobj
3) (optional) Add the "offset" kw argument to decompress and (decompress_ex) making the creation of a temporary buffer() object unnecessary.
History
Date User Action Args
2012-01-26 13:07:45nadeem.vawdasetnosy: + nadeem.vawda
2010-08-09 13:53:55krisvalesetmessages: + msg113415
2010-07-21 07:25:33amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg111017
2010-07-21 02:47:37BreamoreBoysetversions: + Python 3.2, - Python 3.1, Python 2.7
nosy: + BreamoreBoy

messages: + msg110993

stage: patch review -> commit review
2009-05-11 09:02:04rbcollinssetmessages: + msg87568
2009-05-11 08:56:24krisvalesetmessages: + msg87567
2009-05-10 21:14:40rbcollinssetnosy: + rbcollins
messages: + msg87551
2009-05-10 21:05:29gregory.p.smithsetmessages: + msg87550
2009-05-10 20:49:45gregory.p.smithsetnosy: + gregory.p.smith
2009-05-08 09:45:40krisvalesettitle: Add a "tail" argument to zlib.decompress -> Add an 'offset' argument to zlib.decompress
2009-04-23 19:18:40krisvalesetfiles: + zlibpatch2.patch

messages: + msg86373
2009-04-22 10:26:02krisvalesetmessages: + msg86284
2009-04-21 16:13:14krisvalesetfiles: + zlibpatch.patch

messages: + msg86247
2009-04-21 15:57:20pitrousetmessages: + msg86244
2009-04-21 15:46:42krisvalesetmessages: + msg86241
2009-04-21 15:43:15pitrousetpriority: normal
versions: + Python 3.1
nosy: + pitrou

messages: + msg86240

stage: patch review
2009-04-21 09:51:41krisvalecreate