Issue 16350: zlib.Decompress.decompress() after EOF discards existing value of unused_data

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60554

classification

Title:	zlib.Decompress.decompress() after EOF discards existing value of unused_data
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2, Python 3.3, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, nadeem.vawda, python-dev, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2012-10-28 16:25 by nadeem.vawda, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
zlib_unused_data_test.py	nadeem.vawda, 2012-10-28 16:25
zlib_accum_unused_data.patch	serhiy.storchaka, 2012-10-28 18:03		review
zlib_accum_unused_data_2.patch	serhiy.storchaka, 2012-10-29 12:11		review
zlib_unused_data_3.patch	serhiy.storchaka, 2012-11-07 06:23		review

Messages (18)
msg174056 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-10-28 16:25
From issue 5210: amaury.forgeotdarc wrote: > Hm, I tried a modified version of your first test, and I found another > problem with the current zlib library; > starting with the input: > x = x1 + x2 + HAMLET_SCENE # both compressed and uncompressed data > > The following scenario is OK: > dco.decompress(x) # returns HAMLET_SCENE > dco.unused_data # returns HAMLET_SCENE > > But this one: > for c in x: > dco.decompress(x) # will return HAMLET_SCENE, in several pieces > dco.unused_data # only one character, the last of (c in x)! > > This is a bug IMO: unused_data should accumulate all the extra uncompressed > data. Ideally, I would prefer to raise an EOFError if decompress() is called after end-of-stream is reached (for consistency with BZ2Decompressor). However, accumulating the data in unused_data is closer to being backward- compatible, so it's probably the better approach to take.
msg174058 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-28 17:13
Yes, accumulating the data will be backward-compatible. But what if add a special flag for zlib.decompress, which makes bz2 and lzma compatible decoder? I.e.: 1) unconsumed_tail is always empty (the unconsumed data accumulated in internal buffer, no need manually add it to next chunk of data). Or even non-existent. 2) EOFError raised if decompress() is called after end-of-stream is reached. Of course, it is a new feature.
msg174059 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-10-28 17:29
Interesting idea, but I'm not sure it would be worth the effort. It would make the code and API more complicated, so it wouldn't really help users, and would be an added maintenance burden.
msg174064 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-28 18:03
Here is a patch for your suggestion.
msg174065 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-28 18:10
Actually the current code contains a bug. If memory allocation for unused_data fails then unused_data will become NULL and using this properties can crash. The patch fixes this.
msg174109 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-29 12:11
From Nadeem Vawda's mail: > I wasn't suggesting that you try to resize the existing unused_data > object; I agree that this would be a bad idea. What I was thinking of > was something like this: > > size_t old_unused_size = PyBytes_GET_SIZE(self->unused_data); > size_t new_unused_size = old_unused_size + self->zst.avail_in; > PyObject *new_unused_data = PyBytes_FromStringAndSize(NULL, 0); > if (_PyBytes_Resize(&new_unused_data, new_unused_size) < 0) { > Py_DECREF(RetVal); > goto error; > } > memcpy(PyBytes_AS_STRING(new_unused_data), > PyBytes_AS_STRING(self->unused_data), > old_unused_size); > memcpy(PyBytes_AS_STRING(new_unused_data) + old_unused_size, > self->zst.next_in, self->zst.avail_in); > Py_DECREF(self->unused_data); > self->unused_data = new_unused_data; > > Basically, hacking around the fact that (AFAICT) there's no direct way to > allocate an uninitialized bytes object. That way we only do one allocation, > and only copy the new data once. This hacking is not needed, if first argument of PyBytes_FromStringAndSize() is NULL, the contents of the bytes object are uninitialized. And more, this hacking is invalid, because empty bytes object shared and can't be resized. Patch updated. The concatenation optimized as you suggested, one bug fixed.
msg174131 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-29 16:50
There is yet one memory management bug (with unconsumed_tail). flush() does not update unconsumed_tail and unused_data. >>> import zlib >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x, 1) b'a' >>> dco.flush() b'bcdefghijklmnopqrstuvwxyz' >>> dco.unconsumed_tail b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789' >>> dco.unused_data b'' What should unconsumed_tail be equal after EOF? b'' or unused_data? >>> import zlib >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x) b'abcdefghijklmnopqrstuvwxyz' >>> dco.flush() b'' >>> dco.unconsumed_tail b'' >>> dco.unused_data b'0123456789' >>> dco = zlib.decompressobj() >>> dco.decompress(x, 1000) b'abcdefghijklmnopqrstuvwxyz' >>> dco.flush() b'' >>> dco.unconsumed_tail b'0123456789' >>> dco.unused_data b'0123456789'
msg174845 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-04 23:44
New changeset be882735e0b6 by Nadeem Vawda in branch '3.2': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/be882735e0b6 New changeset 4182119c3f0a by Nadeem Vawda in branch '3.3': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/4182119c3f0a New changeset bcdb3836be72 by Nadeem Vawda in branch 'default': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/bcdb3836be72
msg174848 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-04 23:57
New changeset 1c54def5947c by Nadeem Vawda in branch '2.7': Issue #16350: Fix zlib decompressor handling of unused_data with multiple calls to decompress() after EOF. http://hg.python.org/cpython/rev/1c54def5947c
msg174849 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-11-04 23:58
Fixed. Thanks for the patch! > This hacking is not needed, if first argument of PyBytes_FromStringAndSize() > is NULL, the contents of the bytes object are uninitialized. Oh, cool. I didn't know about that. > What should unconsumed_tail be equal after EOF? b'' or unused_data? Definitely b''. unconsumed_tail is meant to hold compressed data that should be passed in to the next call to decompress(). If we are at EOF, then decompress() should not be called again, and so it would be misleading to have unconsumed_tail be non-empty.
msg174854 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-11-05 00:31
> flush() does not update unconsumed_tail and unused_data. > > >>> import zlib > >>> x = zlib.compress(b'abcdefghijklmnopqrstuvwxyz') + b'0123456789' > >>> dco = zlib.decompressobj() > >>> dco.decompress(x, 1) > b'a' > >>> dco.flush() > b'bcdefghijklmnopqrstuvwxyz' > >>> dco.unconsumed_tail > b'NIMK\xcf\xc8\xcc\xca\xce\xc9\xcd\xcb/(,*.)-+\xaf\xa8\xac\x02\x00\x90\x86\x0b 0123456789' > >>> dco.unused_data > b'' I see another bug here - described in issue 16411.
msg174910 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-05 13:58
These were not idle questions. I wrote the patch, and I had to know what behavior is correct. Here's the patch. It fixes potential memory bug (unconsumed_tail sets to NULL in case of out of memory), resets the unconsumed_tail to b'' after EOF, updates unconsumed_tail and unused_data in flush(). I a little changed test for the previous case (here was O(N^2) for large data). I checked it on non-fixed Python, bug was catched.
msg175031 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-11-06 23:34
> These were not idle questions. I wrote the patch, and I had to know > what behavior is correct. Ah, sorry. I assumed you were going to submit a separate patch to fix the unconsumed_tail issues. > Here's the patch. It fixes potential memory bug (unconsumed_tail sets > to NULL in case of out of memory), resets the unconsumed_tail to b'' > after EOF, updates unconsumed_tail and unused_data in flush(). Did you perhaps forget to attach the patch? The only ones I see are those that you uploaded last week.
msg175045 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-07 06:23
Sorry. Here is a patch.
msg175116 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-07 18:01
Also note that your variant of check for overflow causes undefined behavior. Python is gradually getting rid of such code.
msg175308 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-11 01:26
New changeset 94256d7804b8 by Nadeem Vawda in branch '2.7': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/94256d7804b8 New changeset 85886268c40b by Nadeem Vawda in branch '3.2': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/85886268c40b New changeset 654eb5163ba1 by Nadeem Vawda in branch '3.3': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/654eb5163ba1 New changeset 4440e45c10f9 by Nadeem Vawda in branch 'default': Issue #16350, part 2: Set unused_data (and unconsumed_tail) correctly in decompressobj().flush(). http://hg.python.org/cpython/rev/4440e45c10f9
msg175309 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-11-11 01:37
New patch committed. Once again, thanks for all your work on this issue!
msg175336 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-11 09:49
I see you much adjusted the patch. It looks good, thanks you.

History
Date	User	Action	Args
2022-04-11 14:57:37	admin	set	github: 60554
2012-11-11 09:49:38	serhiy.storchaka	set	messages: + msg175336
2012-11-11 01:37:56	nadeem.vawda	set	status: open -> closed messages: + msg175309 stage: patch review -> resolved
2012-11-11 01:26:27	python-dev	set	messages: + msg175308
2012-11-07 18:01:38	serhiy.storchaka	set	messages: + msg175116
2012-11-07 06:23:30	serhiy.storchaka	set	files: + zlib_unused_data_3.patch messages: + msg175045
2012-11-06 23:34:12	nadeem.vawda	set	messages: + msg175031
2012-11-05 13:58:46	serhiy.storchaka	set	status: closed -> open messages: + msg174910 stage: resolved -> patch review
2012-11-05 00:31:40	nadeem.vawda	set	messages: + msg174854
2012-11-04 23:58:58	nadeem.vawda	set	status: open -> closed resolution: fixed messages: + msg174849 stage: patch review -> resolved
2012-11-04 23:57:30	python-dev	set	messages: + msg174848
2012-11-04 23:44:27	python-dev	set	nosy: + python-dev messages: + msg174845
2012-10-29 16:50:50	serhiy.storchaka	set	messages: + msg174131
2012-10-29 12:11:33	serhiy.storchaka	set	files: + zlib_accum_unused_data_2.patch messages: + msg174109
2012-10-28 18:10:38	serhiy.storchaka	set	messages: + msg174065
2012-10-28 18:03:23	serhiy.storchaka	set	stage: needs patch -> patch review
2012-10-28 18:03:11	serhiy.storchaka	set	files: + zlib_accum_unused_data.patch keywords: + patch messages: + msg174064
2012-10-28 17:29:12	nadeem.vawda	set	messages: + msg174059
2012-10-28 17:13:52	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg174058
2012-10-28 16:25:19	nadeem.vawda	create