Message 177213 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nadeem.vawda
Recipients	Arfrever, christian.heimes, eric.araujo, nadeem.vawda, pitrou, serhiy.storchaka
Date	2012-12-09.13:11:54
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1355058715.45.0.527712147983.issue15955@psf.upfronthosting.co.za>
In-reply-to

Content
>> # Using zlib's interface >> while not d.eof: >> compressed = d.unconsumed_tail or f.read(8192) >> if not compressed: >> raise ValueError('End-of-stream marker not found') >> output = d.decompress(compressed, 8192) >> # <process output> > > This is not usable with bzip2. Bzip2 uses large block size and unconsumed_tail > can be non empty but decompress() will return b''. With zlib you possible can > see the same effect on some input when read by one byte. I don't see how this is a problem. If (for some strange reason) the application-specific processing code can't handle empty blocks properly, you can just stick "if not output: continue" before it. > Actually it should be: > > # Using zlib's interface > while not d.eof: > output = d.decompress(d.unconsumed_tail, 8192) > while not output and not d.eof: > compressed = f.read(8192) > if not compressed: > raise ValueError('End-of-stream marker not found') > output = d.decompress(d.unconsumed_tail + compressed, 8192) > # <process output> > > Note that you should use d.unconsumed_tail + compressed as input, and therefore > do an unnecessary copy of the data. Why is this necessary? If unconsumed_tail is b'', then there's no need to prepend it (and the concatenation would be a no-op anyway). If unconsumed_tail does contain data, then we don't need to read additional compressed data from the file until we've finished decompressing the data we already have. > Without explicit unconsumed_tail you can write input data in the internal > mutable buffer, it will be more effective for large buffer (handreds of KB) > and small input chunks (several KB). Are you proposing that the decompressor object maintain its own buffer, and copy the input data into it before passing it to the decompression library? Doesn't that just duplicate work that the library is already doing for us?

>>     # Using zlib's interface
>>     while not d.eof:
>>         compressed = d.unconsumed_tail or f.read(8192)
>>         if not compressed:
>>             raise ValueError('End-of-stream marker not found')
>>         output = d.decompress(compressed, 8192)
>>         # <process output>
>
> This is not usable with bzip2. Bzip2 uses large block size and unconsumed_tail 
> can be non empty but decompress() will return b''. With zlib you possible can 
> see the same effect on some input when read by one byte.

I don't see how this is a problem. If (for some strange reason) the
application-specific processing code can't handle empty blocks properly, you can
just stick "if not output: continue" before it.


> Actually it should be:
>
>     # Using zlib's interface
>     while not d.eof:
>         output = d.decompress(d.unconsumed_tail, 8192)
>         while not output and not d.eof:
>             compressed = f.read(8192)
>             if not compressed:
>                 raise ValueError('End-of-stream marker not found')
>             output = d.decompress(d.unconsumed_tail + compressed, 8192)
>         # <process output>
>
> Note that you should use d.unconsumed_tail + compressed as input, and therefore
> do an unnecessary copy of the data.

Why is this necessary? If unconsumed_tail is b'', then there's no need to
prepend it (and the concatenation would be a no-op anyway). If unconsumed_tail
does contain data, then we don't need to read additional compressed data from
the file until we've finished decompressing the data we already have.


> Without explicit unconsumed_tail you can write input data in the internal
> mutable buffer, it will be more effective for large buffer (handreds of KB)
> and small input chunks (several KB).

Are you proposing that the decompressor object maintain its own buffer, and
copy the input data into it before passing it to the decompression library?
Doesn't that just duplicate work that the library is already doing for us?

History
Date	User	Action	Args
2012-12-09 13:11:55	nadeem.vawda	set	recipients: + nadeem.vawda, pitrou, christian.heimes, eric.araujo, Arfrever, serhiy.storchaka
2012-12-09 13:11:55	nadeem.vawda	set	messageid: <1355058715.45.0.527712147983.issue15955@psf.upfronthosting.co.za>
2012-12-09 13:11:55	nadeem.vawda	link	issue15955 messages
2012-12-09 13:11:54	nadeem.vawda	create