msg174056 - (view) |
Author: Nadeem Vawda (nadeem.vawda) *  |
Date: 2012-10-28 16:25 |
From issue 5210:
amaury.forgeotdarc wrote:
> Hm, I tried a modified version of your first test, and I found another
> problem with the current zlib library;
> starting with the input:
> x = x1 + x2 + HAMLET_SCENE # both compressed and uncompressed data
>
> The following scenario is OK:
> dco.decompress(x) # returns HAMLET_SCENE
> dco.unused_data # returns HAMLET_SCENE
>
> But this one:
> for c in x:
> dco.decompress(x) # will return HAMLET_SCENE, in several pieces
> dco.unused_data # only one character, the last of (c in x)!
>
> This is a bug IMO: unused_data should accumulate all the extra uncompressed
> data.
Ideally, I would prefer to raise an EOFError if decompress() is called
after end-of-stream is reached (for consistency with BZ2Decompressor).
However, accumulating the data in unused_data is closer to being backward-
compatible, so it's probably the better approach to take.
|
msg174058 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-28 17:13 |
Yes, accumulating the data will be backward-compatible.
But what if add a special flag for zlib.decompress, which makes bz2 and lzma compatible decoder? I.e.:
1) unconsumed_tail is always empty (the unconsumed data accumulated in internal buffer, no need manually add it to next chunk of data). Or even non-existent.
2) EOFError raised if decompress() is called after end-of-stream is reached.
Of course, it is a new feature.
|
msg174059 - (view) |
Author: Nadeem Vawda (nadeem.vawda) *  |
Date: 2012-10-28 17:29 |
Interesting idea, but I'm not sure it would be worth the effort. It would
make the code and API more complicated, so it wouldn't really help users,
and would be an added maintenance burden.
|
msg174064 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-28 18:03 |
Here is a patch for your suggestion.
|
msg174065 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-28 18:10 |
Actually the current code contains a bug. If memory allocation for unused_data fails then unused_data will become NULL and using this properties can crash. The patch fixes this.
|
msg174109 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-29 12:11 |
From Nadeem Vawda's mail:
> I wasn't suggesting that you try to resize the existing unused_data
> object; I agree that this would be a bad idea. What I was thinking of
> was something like this:
>
> size_t old_unused_size = PyBytes_GET_SIZE(self->unused_data);
> size_t new_unused_size = old_unused_size + self->zst.avail_in;
> PyObject *new_unused_data = PyBytes_FromStringAndSize(NULL, 0);
> if (_PyBytes_Resize(&new_unused_data, new_unused_size) < 0) {
> Py_DECREF(RetVal);
> goto error;
> }
> memcpy(PyBytes_AS_STRING(new_unused_data),
> PyBytes_AS_STRING(self->unused_data),
> old_unused_size);
> memcpy(PyBytes_AS_STRING(new_unused_data) + old_unused_size,
> self->zst.next_in, self->zst.avail_in);
> Py_DECREF(self->unused_data);
> self->unused_data = new_unused_data;
>
> Basically, hacking around the fact that (AFAICT) there's no direct way to
> allocate an uninitialized bytes object. That way we only do one allocation,
> and only copy the new data once.
This hacking is not needed, if first argument of PyBytes_FromStringAndSize() is NULL, the contents of the bytes object are uninitialized. And more, this hacking is invalid, because empty bytes object shared and can't be resized.
Patch updated. The concatenation optimized as you suggested, one bug fixed.
|
msg174131 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-29 16:50 |
There is yet one memory management bug (with unconsumed_tail).
flush() does not update unconsumed_tail and unused_data.
>>> import zlib
>>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789'
>>> dco = zlib.decompressobj()
>>> dco.decompress(x, 1)
b'a'
>>> dco.flush()
b'bcdefghijklmnopqrstuvwxyz'
>>> dco.unconsumed_tail
b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789'
>>> dco.unused_data
b''
What should unconsumed_tail be equal after EOF? b'' or unused_data?
>>> import zlib
>>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789'
>>> dco = zlib.decompressobj()
>>> dco.decompress(x)
b'abcdefghijklmnopqrstuvwxyz'
>>> dco.flush()
b''
>>> dco.unconsumed_tail
b''
>>> dco.unused_data
b'0123456789'
>>> dco = zlib.decompressobj()
>>> dco.decompress(x, 1000)
b'abcdefghijklmnopqrstuvwxyz'
>>> dco.flush()
b''
>>> dco.unconsumed_tail
b'0123456789'
>>> dco.unused_data
b'0123456789'
|
msg174845 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-11-04 23:44 |
New changeset be882735e0b6 by Nadeem Vawda in branch '3.2':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/be882735e0b6
New changeset 4182119c3f0a by Nadeem Vawda in branch '3.3':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/4182119c3f0a
New changeset bcdb3836be72 by Nadeem Vawda in branch 'default':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/bcdb3836be72
|
msg174848 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-11-04 23:57 |
New changeset 1c54def5947c by Nadeem Vawda in branch '2.7':
Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF.
http://hg.python.org/cpython/rev/1c54def5947c
|
msg174849 - (view) |
Author: Nadeem Vawda (nadeem.vawda) *  |
Date: 2012-11-04 23:58 |
Fixed. Thanks for the patch!
> This hacking is not needed, if first argument of PyBytes_FromStringAndSize()
> is NULL, the contents of the bytes object are uninitialized.
Oh, cool. I didn't know about that.
> What should unconsumed_tail be equal after EOF? b'' or unused_data?
Definitely b''. unconsumed_tail is meant to hold compressed data that should
be passed in to the next call to decompress(). If we are at EOF, then
decompress() should not be called again, and so it would be misleading to have
unconsumed_tail be non-empty.
|
msg174854 - (view) |
Author: Nadeem Vawda (nadeem.vawda) *  |
Date: 2012-11-05 00:31 |
> flush() does not update unconsumed_tail and unused_data.
>
> >>> import zlib
> >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789'
> >>> dco = zlib.decompressobj()
> >>> dco.decompress(x, 1)
> b'a'
> >>> dco.flush()
> b'bcdefghijklmnopqrstuvwxyz'
> >>> dco.unconsumed_tail
> b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789'
> >>> dco.unused_data
> b''
I see another bug here - described in issue 16411.
|
msg174910 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-05 13:58 |
These were not idle questions. I wrote the patch, and I had to know what behavior is correct.
Here's the patch. It fixes potential memory bug (unconsumed_tail sets to NULL in case of out of memory), resets the unconsumed_tail to b'' after EOF, updates unconsumed_tail and unused_data in flush().
I a little changed test for the previous case (here was O(N^2) for large data). I checked it on non-fixed Python, bug was catched.
|
msg175031 - (view) |
Author: Nadeem Vawda (nadeem.vawda) *  |
Date: 2012-11-06 23:34 |
> These were not idle questions. I wrote the patch, and I had to know
> what behavior is correct.
Ah, sorry. I assumed you were going to submit a separate patch to fix the
unconsumed_tail issues.
> Here's the patch. It fixes potential memory bug (unconsumed_tail sets
> to NULL in case of out of memory), resets the unconsumed_tail to b''
> after EOF, updates unconsumed_tail and unused_data in flush().
Did you perhaps forget to attach the patch? The only ones I see are those
that you uploaded last week.
|
msg175045 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-07 06:23 |
Sorry. Here is a patch.
|
msg175116 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-07 18:01 |
Also note that your variant of check for overflow causes undefined behavior. Python is gradually getting rid of such code.
|
msg175308 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-11-11 01:26 |
New changeset 94256d7804b8 by Nadeem Vawda in branch '2.7':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/94256d7804b8
New changeset 85886268c40b by Nadeem Vawda in branch '3.2':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/85886268c40b
New changeset 654eb5163ba1 by Nadeem Vawda in branch '3.3':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/654eb5163ba1
New changeset 4440e45c10f9 by Nadeem Vawda in branch 'default':
Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush().
http://hg.python.org/cpython/rev/4440e45c10f9
|
msg175309 - (view) |
Author: Nadeem Vawda (nadeem.vawda) *  |
Date: 2012-11-11 01:37 |
New patch committed. Once again, thanks for all your work on this issue!
|
msg175336 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-11-11 09:49 |
I see you much adjusted the patch. It looks good, thanks you.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:37 | admin | set | github: 60554 |
2012-11-11 09:49:38 | serhiy.storchaka | set | messages:
+ msg175336 |
2012-11-11 01:37:56 | nadeem.vawda | set | status: open -> closed
messages:
+ msg175309 stage: patch review -> resolved |
2012-11-11 01:26:27 | python-dev | set | messages:
+ msg175308 |
2012-11-07 18:01:38 | serhiy.storchaka | set | messages:
+ msg175116 |
2012-11-07 06:23:30 | serhiy.storchaka | set | files:
+ zlib_unused_data_3.patch
messages:
+ msg175045 |
2012-11-06 23:34:12 | nadeem.vawda | set | messages:
+ msg175031 |
2012-11-05 13:58:46 | serhiy.storchaka | set | status: closed -> open
messages:
+ msg174910 stage: resolved -> patch review |
2012-11-05 00:31:40 | nadeem.vawda | set | messages:
+ msg174854 |
2012-11-04 23:58:58 | nadeem.vawda | set | status: open -> closed resolution: fixed messages:
+ msg174849
stage: patch review -> resolved |
2012-11-04 23:57:30 | python-dev | set | messages:
+ msg174848 |
2012-11-04 23:44:27 | python-dev | set | nosy:
+ python-dev messages:
+ msg174845
|
2012-10-29 16:50:50 | serhiy.storchaka | set | messages:
+ msg174131 |
2012-10-29 12:11:33 | serhiy.storchaka | set | files:
+ zlib_accum_unused_data_2.patch
messages:
+ msg174109 |
2012-10-28 18:10:38 | serhiy.storchaka | set | messages:
+ msg174065 |
2012-10-28 18:03:23 | serhiy.storchaka | set | stage: needs patch -> patch review |
2012-10-28 18:03:11 | serhiy.storchaka | set | files:
+ zlib_accum_unused_data.patch keywords:
+ patch messages:
+ msg174064
|
2012-10-28 17:29:12 | nadeem.vawda | set | messages:
+ msg174059 |
2012-10-28 17:13:52 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg174058
|
2012-10-28 16:25:19 | nadeem.vawda | create | |