This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: zlib decompress of sync-flushed data fails
Type: Stage:
Components: Extension Modules Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: akuchling Nosy List: abo, akuchling, fdrake, gvanrossum
Priority: normal Keywords:

Created on 2000-12-08 07:25 by abo, last changed 2022-04-10 16:03 by admin. This issue is now closed.

Messages (16)
msg2593 - (view) Author: Donovan Baarda (abo) * Date: 2000-12-08 07:25
I'm not sure if this is just an undocumented limitation or a genuine bug. I'm using python 1.5.2 on winNT.

A single decompress of a large amount (16K+) of compressed data that has been sync-flushed fails to produce all the data up to the sync-flush. The data remains inside the decompressor untill further compressed data or a final flush is issued. Note that the 'unused_data' attribute does not show that there is further data in the decompressor to process (it shows ''). 

A workaround is to decompress the data in smaller chunks. Note that compressing data in smaller chunks is not required, as the problem is in the decompressor, not the compressor.

The following code demonstrates the problem, and raises an exception when the compressed data reaches 17K;

from zlib import *
from random import *

# create compressor and decompressor
c=compressobj(9)
d=decompressobj()

# try data sizes of 1-63K
for l in range(1,64):
    # generate random data stream
    a=''
    for i in range(l*1024):
        a=a+chr(randint(0,255))
    # compress, sync-flush, and decompress
    t=d.decompress(c.compress(a)+c.flush(Z_SYNC_FLUSH))
    # if decompressed data is different to input data, barf,
    if len(t) != len(a):
        print len(a),len(t),len(d.unused_data)
        raise error
msg2594 - (view) Author: Donovan Baarda (abo) * Date: 2000-12-08 07:28
Argh... SF killed all my indents... sorry about that. You should be able to figure it out, but if not email me and I can send a copy.
msg2595 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2000-12-08 15:44
I *think* this may have been fixed in Python 2.0.

I'm assigning this to Andrew who can confirm that and close the bug report (if it is fixed).
msg2596 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2000-12-08 17:26
Python 2.0 demonstrates the problem, too.

I'm not sure what this is: a zlibmodule bug/oversight or
simply problems with zlib's API.  Looking at zlib.h, 
it implies that you'd have to call inflate() with the
flush parameter set to Z_SYNC_FLUSH to get the remaining data.  Unfortunately this doesn't seem to help -- .flush() method doesn't support an argument, but when I patch zlibmodule.c to allow one, .flush(Z_SYNC_FLUSH) always fails with a -5: buffer error, perhaps because it expects there to be some new data.

(The DEFAULTALLOC constant in zlibmodule.c is 16K, but this 
seems to be unrelated to the problem showing up with more than 16K of data, since changing DEFAULTALLOC to 32K or 1K makes no difference to the size of data at which the bug shows up.)

In short, I have no idea what's at fault, or if it can or should be fixed.  Unless you or someone else submits a patch, I'll just leave it alone, and mark this bug as closed and "Won't fix".
msg2597 - (view) Author: Donovan Baarda (abo) * Date: 2000-12-08 22:50
I'm not that sure I'm happy with it just being marked closed. AFAIKT, the implementation definitely doesn't do what the documentation says, so to save people like me time when they hit it, I'prefer the bug at least be assigned to documentation so that the limitation is documented.

From my reading of the documentation as it stands, the fact that there is more pending data in the decompressor should be indicated by it's "unused_data" attribute. The tests seem to show that "decompress()" is only processing 16K of compressed data each call, which would suggest that "unused_data" should contain the rest. However, in all my tests that attribute has always been empty. Perhaps the bug is in there somewhere?

Another slight strangeness, even if "unused_data" did contain something, the only way to get it out is by feeding in more compressed data, or issuing a flush(), thus ending the decompression...

I guess that since I've been bitten by this, it's up to me to fix it. I've got the source to 2.0 and I'll have a look and see if I can submit a patch.

<sigh> and I was coding this app in python to avoid coding in C :-)
msg2598 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2000-12-12 21:32
OK, assigned to Fred. You may ask Andrew what to write. :-)
msg2599 - (view) Author: Donovan Baarda (abo) * Date: 2000-12-12 23:18
Further comments...

After looking at the C code, a few things became clear; I need to read more about C/Python interfacing, and the "unused_data" attribute will only contain data if additional data is fed to a de-compressor at the end of a complete compressed stream.

The purpose of the "unused_data" attribute is not clear in the documentation, so that should probably be clarified (mind you, I am looking at pre-2.0 docs so maybe it already has?).

The failure to produce all data up to a sync-flush is something else... I'm still looking into it. I'm not sure if it is an inherent limitation of zlib, something that needs to be fixed in zlib, or something that needs to be fixed in the python interface. If it is an inherent limitation, I'd like to characterise it a bit better before documenting it. If it is something that needs to be fixed in either zlib or the python interface, I'd like to fix it.

Unfortunately, this is a bit beyond me at the moment, mainly in time, but also a bit in skill (need to read the python/C interfacing documentation). Maybe over the christmas holidays I'll get a chance to fix it.
msg2600 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2000-12-19 04:13
Andrew, please summarize what doc changes are needed, or make the changes (whichever is easier for you is fine).
msg2601 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2000-12-19 19:48
.unused_data is really a red herring; the PyZlib_objdecompress()
loops until zst->avail_in is zero,  so .unused_data must always be zero by definition.  (The attribute is there to support gzip-format files that may contain multiple compressed streams concatenated together.)

I still have no idea what the documentation should say; "don't pass more than 16K of  compressed data when you're expecting a sync-flush."  I can't see a way to explain this coherently without a big long explanation that will confuse people who don't care about this problem.  (Add a special note, or known bugs subsection, maybe?)

A simple C test program should be written, in order to check if it's 
the zlib library itself that's doing this.    
msg2602 - (view) Author: Donovan Baarda (abo) * Date: 2000-12-20 23:17
I have had a look at this in more detail (Python C interfacing is actually pretty easy :-).

I dunno whether to go into details here, but I have noticed that inflate is being called with Z_NO_FLUSH when the zlib docs suggest Z_SYNC_FLUSH or Z_FINISH. However, the zlib implementation only treats Z_FINISH differently, so this should not make a difference (but it might in future versions of zlib).

As you have said, .unused_data will only contain data if something is appended to a complete compressed object in the stream fed to a decompressor. Perhaps the docs need something to clarify this, perhaps in the form of an example?

I am going to write some test progs to see if it's in zlib. At this stage I suspect that it is, and I'm hoping over christmas to get a patch done.
msg2603 - (view) Author: Donovan Baarda (abo) * Date: 2001-01-20 14:22
OK, Christmas is over, and I didn't really get a chance to look at it much.

However, some preliminary testing of the zlib library suggests that it is _not_ in zlib. A c program that does the same compress/decompress of random data using steadily increasing block sizes does not show the same fault...

I've noticed what might be a potential problem in the inflateInit2 call in PyZlib_decompressobj. The next_in and avail_in are not being initialized before it's called. The zlib manual says they should be init'ed because deflateInit will attempt to determine the compression method if it can. I'm not sure why this hasn't tripped already... does PyObject_New zero memory when it allocates things?

I also think I've found the main problem... the PyZlib_objdecompress routine's main decompression loop thinks decompression is compelete when all available input has been processed. However, it is possible that all available input has been processed but there is more pending output. The code as it stands does correctly allocate more space when this occurs, but the loop terminates without calling inflate one more time.

I'm putting together a patch tomorrow. Right now I'm going to bed...
msg2604 - (view) Author: Donovan Baarda (abo) * Date: 2001-01-22 23:31
I've got a patch that I'm currently testing. When it's ready how do you want me to submit it; through the SF patch manager?

I've analysed the code pretty extensively by now, and I believe I found a few other potential problems (though I haven't bothered to create test cases to trigger them). The original problem I found could affect compression as well as decompression. There were some memory allocations that didn't raise exceptions if they failed, and potential memory leaks when exceptions did occur. Also, deflateEnd would be called twice in a decompression object if it was flushed before it was destroyed.

The strange thing is compression, decompression, and both flush routines perform basicly the same function, but they were all structured totally differently (like they had been written by different people). Also, there didn't seem to be any consistancy in the coding style so I didn't know what style to use. I tried to resist fixing more than the initial problem I identified, but in the end couldn't help myself.

As an aside, I noticed that the 2.0 version of zlibmodule.c is Python 2.0 specific (It wouldn't work on my 1.5.2 system so I had to compile 2.0 to test my patch).

How do you want me to submit my patch; one big patch, or several smaller ones for each fix? I have been using prcs to track my changes, but I have been a bit lax in checking in each small change, so creating seperate patches will take a little work.
msg2605 - (view) Author: Donovan Baarda (abo) * Date: 2001-01-22 23:34
Oh yeah, since I now know that it's definitely a zlibmodule.c bug, not a documentation problem, you can probably change this bug's catagory from "Documentation" to something more appropriate.
msg2606 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2001-01-23 04:16
Yay!  Please submit your patch using the Patch Manager on SourceForge 
so that we can look at it and discuss it.  You can submit it as one big patch; obviously correct changes can be applied right away causing the patch to shrink, and disentangling the changes shouldn't be too difficult.

The zlib module is really ugly code, and could probably do with a restructuring; I've wondered before if it would be worth trashing it and rewriting from scratch with the same API.  Maybe for 2.2...  If you want to be ambitious and do a more serious restructuring, feel free to do so, but it probably wouldn't get into 2.1.

And the bug has been reclassified under Extension Modules.
msg2607 - (view) Author: Donovan Baarda (abo) * Date: 2001-01-23 16:35
OK, patch submitted. Fire away :-)

overall the code is fairly ugly, but not too bad. I dunno if it warrents a complete re-write, but certainly there is a heap of redundancy that could be cleaned up. It would be a nice simple project for someone with time, but that ain't me :-)
msg2608 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2001-02-21 02:17
Fixed by the checkin of patch #103373.
History
Date User Action Args
2022-04-10 16:03:32adminsetgithub: 33564
2000-12-08 07:25:46abocreate