classification
Title: zlib.Decompress.decompress() after EOF discards existing value of unused_data
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, nadeem.vawda, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2012-10-28 16:25 by nadeem.vawda, last changed 2012-11-11 09:49 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
zlib_unused_data_test.py nadeem.vawda, 2012-10-28 16:25
zlib_accum_unused_data.patch serhiy.storchaka, 2012-10-28 18:03 review
zlib_accum_unused_data_2.patch serhiy.storchaka, 2012-10-29 12:11 review
zlib_unused_data_3.patch serhiy.storchaka, 2012-11-07 06:23 review
Messages (18)
msg174056 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-10-28 16:25
From issue 5210:

amaury.forgeotdarc wrote:
> Hm, I tried a modified version of your first test, and I found another
> problem with the current zlib library;
> starting with the input:
> x = x1 + x2 + HAMLET_SCENE    # both compressed and uncompressed data
>
> The following scenario is OK:
> dco.decompress(x) # returns HAMLET_SCENE
> dco.unused_data   # returns HAMLET_SCENE
>
> But this one:
> for c in x:
>     dco.decompress(x) # will return HAMLET_SCENE, in several pieces
> dco.unused_data   # only one character, the last of (c in x)!
>
> This is a bug IMO: unused_data should accumulate all the extra uncompressed 
> data.

Ideally, I would prefer to raise an EOFError if decompress() is called
after end-of-stream is reached (for consistency with BZ2Decompressor).
However, accumulating the data in unused_data is closer to being backward-
compatible, so it's probably the better approach to take.
msg174058 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-28 17:13
Yes, accumulating the data will be backward-compatible.

But what if add a special flag for zlib.decompress, which makes bz2 and lzma compatible decoder?  I.e.:

1) unconsumed_tail is always empty (the unconsumed data accumulated in internal buffer, no need manually add it to next chunk of data).  Or even non-existent.
2) EOFError raised if decompress() is called after end-of-stream is reached.

Of course, it is a new feature.
msg174059 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-10-28 17:29
Interesting idea, but I'm not sure it would be worth the effort. It would
make the code and API more complicated, so it wouldn't really help users,
and would be an added maintenance burden.
msg174064 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-28 18:03
Here is a patch for your suggestion.
msg174065 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-28 18:10
Actually the current code contains a bug.  If memory allocation for unused_data fails then unused_data will become NULL and using this properties can crash.  The patch fixes this.
msg174109 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-29 12:11
From Nadeem Vawda's mail:

> I wasn't suggesting that you try to resize the existing unused_data
> object; I agree that this would be a bad idea. What I was thinking of
> was something like this:
> 
>     size_t old_unused_size = PyBytes_GET_SIZE(self->unused_data);
>     size_t new_unused_size = old_unused_size + self->zst.avail_in;
>     PyObject *new_unused_data = PyBytes_FromStringAndSize(NULL, 0);
>     if (_PyBytes_Resize(&new_unused_data, new_unused_size) < 0) {
>         Py_DECREF(RetVal);
>         goto error;
>     }
>     memcpy(PyBytes_AS_STRING(new_unused_data),
>            PyBytes_AS_STRING(self->unused_data),
>            old_unused_size);
>     memcpy(PyBytes_AS_STRING(new_unused_data) + old_unused_size,
>            self->zst.next_in, self->zst.avail_in);
>     Py_DECREF(self->unused_data);
>     self->unused_data = new_unused_data;
> 
> Basically, hacking around the fact that (AFAICT) there's no direct way to
> allocate an uninitialized bytes object. That way we only do one allocation,
> and only copy the new data once.

This hacking is not needed, if first argument of PyBytes_FromStringAndSize() is NULL, the contents of the bytes object are uninitialized. And more, this hacking is invalid, because empty bytes object shared and can't be resized.

Patch updated. The concatenation optimized as you suggested, one bug fixed.
msg174131 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-29 16:50
There is yet one memory management bug (with unconsumed_tail).

flush() does not update unconsumed_tail and unused_data.

>>> import zlib
>>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789'
>>> dco = zlib.decompressobj()
>>> dco.decompress(x, 1)
b'a'
>>> dco.flush()
b'bcdefghijklmnopqrstuvwxyz'
>>> dco.unconsumed_tail
b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789'
>>> dco.unused_data
b''

What should unconsumed_tail be equal after EOF? b'' or unused_data?

>>> import zlib
>>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789'
>>> dco = zlib.decompressobj()
>>> dco.decompress(x)
b'abcdefghijklmnopqrstuvwxyz'
>>> dco.flush()
b''
>>> dco.unconsumed_tail
b''
>>> dco.unused_data
b'0123456789'
>>> dco = zlib.decompressobj()
>>> dco.decompress(x, 1000)
b'abcdefghijklmnopqrstuvwxyz'
>>> dco.flush()
b''
>>> dco.unconsumed_tail
b'0123456789'
>>> dco.unused_data
b'0123456789'
msg174845 - (view) Author: Roundup Robot (python-dev) Date: 2012-11-04 23:44
New changeset be882735e0b6 by Nadeem Vawda in branch '3.2':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/be882735e0b6

New changeset 4182119c3f0a by Nadeem Vawda in branch '3.3':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/4182119c3f0a

New changeset bcdb3836be72 by Nadeem Vawda in branch 'default':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/bcdb3836be72
msg174848 - (view) Author: Roundup Robot (python-dev) Date: 2012-11-04 23:57
New changeset 1c54def5947c by Nadeem Vawda in branch '2.7':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/1c54def5947c
msg174849 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-11-04 23:58
Fixed. Thanks for the patch!


> This hacking is not needed, if first argument of PyBytes_FromStringAndSize()
> is NULL, the contents of the bytes object are uninitialized.

Oh, cool. I didn't know about that.


> What should unconsumed_tail be equal after EOF? b'' or unused_data?

Definitely b''. unconsumed_tail is meant to hold compressed data that should
be passed in to the next call to decompress(). If we are at EOF, then
decompress() should not be called again, and so it would be misleading to have
unconsumed_tail be non-empty.
msg174854 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-11-05 00:31
> flush() does not update unconsumed_tail and unused_data.
>
> >>> import zlib
> >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789'
> >>> dco = zlib.decompressobj()
> >>> dco.decompress(x, 1)
> b'a'
> >>> dco.flush()
> b'bcdefghijklmnopqrstuvwxyz'
> >>> dco.unconsumed_tail
> b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789'
> >>> dco.unused_data
> b''

I see another bug here - described in issue 16411.
msg174910 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-05 13:58
These were not idle questions.  I wrote the patch, and I had to know what behavior is correct.

Here's the patch.  It fixes potential memory bug (unconsumed_tail sets to NULL in case of out of memory), resets the unconsumed_tail to b'' after EOF, updates unconsumed_tail and unused_data in flush().

I a little changed test for the previous case (here was O(N^2) for large data).  I checked it on non-fixed Python, bug was catched.
msg175031 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-11-06 23:34
> These were not idle questions.  I wrote the patch, and I had to know
> what behavior is correct.

Ah, sorry. I assumed you were going to submit a separate patch to fix the
unconsumed_tail issues.

> Here's the patch.  It fixes potential memory bug (unconsumed_tail sets
> to NULL in case of out of memory), resets the unconsumed_tail to b''
> after EOF, updates unconsumed_tail and unused_data in flush().

Did you perhaps forget to attach the patch? The only ones I see are those
that you uploaded last week.
msg175045 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-07 06:23
Sorry. Here is a patch.
msg175116 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-07 18:01
Also note that your variant of check for overflow causes undefined behavior.  Python is gradually getting rid of such code.
msg175308 - (view) Author: Roundup Robot (python-dev) Date: 2012-11-11 01:26
New changeset 94256d7804b8 by Nadeem Vawda in branch '2.7':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/94256d7804b8

New changeset 85886268c40b by Nadeem Vawda in branch '3.2':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/85886268c40b

New changeset 654eb5163ba1 by Nadeem Vawda in branch '3.3':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/654eb5163ba1

New changeset 4440e45c10f9 by Nadeem Vawda in branch 'default':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/4440e45c10f9
msg175309 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-11-11 01:37
New patch committed. Once again, thanks for all your work on this issue!
msg175336 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-11 09:49
I see you much adjusted the patch.  It looks good, thanks you.
History
Date User Action Args
2012-11-11 09:49:38serhiy.storchakasetmessages: + msg175336
2012-11-11 01:37:56nadeem.vawdasetstatus: open -> closed

messages: + msg175309
stage: patch review -> resolved
2012-11-11 01:26:27python-devsetmessages: + msg175308
2012-11-07 18:01:38serhiy.storchakasetmessages: + msg175116
2012-11-07 06:23:30serhiy.storchakasetfiles: + zlib_unused_data_3.patch

messages: + msg175045
2012-11-06 23:34:12nadeem.vawdasetmessages: + msg175031
2012-11-05 13:58:46serhiy.storchakasetstatus: closed -> open

messages: + msg174910
stage: resolved -> patch review
2012-11-05 00:31:40nadeem.vawdasetmessages: + msg174854
2012-11-04 23:58:58nadeem.vawdasetstatus: open -> closed
resolution: fixed
messages: + msg174849

stage: patch review -> resolved
2012-11-04 23:57:30python-devsetmessages: + msg174848
2012-11-04 23:44:27python-devsetnosy: + python-dev
messages: + msg174845
2012-10-29 16:50:50serhiy.storchakasetmessages: + msg174131
2012-10-29 12:11:33serhiy.storchakasetfiles: + zlib_accum_unused_data_2.patch

messages: + msg174109
2012-10-28 18:10:38serhiy.storchakasetmessages: + msg174065
2012-10-28 18:03:23serhiy.storchakasetstage: needs patch -> patch review
2012-10-28 18:03:11serhiy.storchakasetfiles: + zlib_accum_unused_data.patch
keywords: + patch
messages: + msg174064
2012-10-28 17:29:12nadeem.vawdasetmessages: + msg174059
2012-10-28 17:13:52serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg174058
2012-10-28 16:25:19nadeem.vawdacreate