Message 209999 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nadeem.vawda
Recipients	Arfrever, christian.heimes, eric.araujo, martin.panter, nadeem.vawda, nikratio, pitrou, serhiy.storchaka
Date	2014-02-02.16:15:48
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1391357748.84.0.657370954238.issue15955@psf.upfronthosting.co.za>
In-reply-to

Content
After some consideration, I've come to agree with Serhiy that it would be better to keep a private internal buffer, rather than having the user manage unconsumed input data. I'm also in favor of having a flag to indicate whether the decompressor needs more input to produce more decompressed data. (I'd prefer to call it 'needs_input' or similar, though - 'data_ready' feels too vague to me.) In msg176883 and msg177228, Serhiy raises the possibility that the compressor might be unable to produce decompressed output from a given piece of (non-empty) input, but will still leave the input unconsumed. I do not think that this can actually happen (based on the libraries' documentation), but this API will work even if that situation can occur. So, to summarize, the API will look like this: class LZMADecompressor: ... def decompress(self, data, max_length=-1): """Decompresses data, returning uncompressed data as bytes. If max_length is nonnegative, returns at most max_length bytes of decompressed data. If this limit is reached and further output can be produced, self.needs_input will be set to False. In this case, the next call to decompress() should provide data as b'' to obtain more of the output. If all of the input data was decompressed and returned (either because this was less than max_length bytes, or because max_length was negative), self.needs_input will be set to True. """ ... Data not consumed due to the use of 'max_length' should be saved in an internal buffer (that is not exposed to Python code at all), which is then prepended to any data provided in the next call to decompress() before providing the data to the underlying compression library. The cases where either the internal buffer or the new data are empty should be optimized to avoid unnecessary allocations or copies, since these will be the most common cases. Note that this API does not need a Python-level 'unconsumed_tail' attribute - its role is served by the internal buffer (which is private to the C module implementation). This is not to be confused with the already-existing 'unused_data' attribute that stores data found after the end of the compressed stream. 'unused_data' should continue to work as before, regardless of whether decompress() is called with a max_length argument or not. As a starting point I would suggest writing a patch for LZMADecompressor first, since its implementation is a bit simpler than BZ2Decompressor. Once this patch and an analogous one for BZ2Decompressor have been committed, we can then convert GzipFile, BZ2File and LZMAFile to use this feature. If you have any questions while you're working on this issue, feel free to send them my way.

After some consideration, I've come to agree with Serhiy that it would be better
to keep a private internal buffer, rather than having the user manage unconsumed
input data. I'm also in favor of having a flag to indicate whether the
decompressor needs more input to produce more decompressed data. (I'd prefer to
call it 'needs_input' or similar, though - 'data_ready' feels too vague to me.)

In msg176883 and msg177228, Serhiy raises the possibility that the compressor
might be unable to produce decompressed output from a given piece of (non-empty)
input, but will still leave the input unconsumed. I do not think that this can
actually happen (based on the libraries' documentation), but this API will work
even if that situation can occur.

So, to summarize, the API will look like this:

    class LZMADecompressor:

        ...

        def decompress(self, data, max_length=-1):
            """Decompresses *data*, returning uncompressed data as bytes.

            If *max_length* is nonnegative, returns at most *max_length* bytes
            of decompressed data. If this limit is reached and further output
            can be produced, *self.needs_input* will be set to False. In this
            case, the next call to *decompress()* should provide *data* as b''
            to obtain more of the output.

            If all of the input data was decompressed and returned (either
            because this was less than *max_length* bytes, or because
            *max_length* was negative), *self.needs_input* will be set to True.
            """
            ...

Data not consumed due to the use of 'max_length' should be saved in an internal
buffer (that is not exposed to Python code at all), which is then prepended to
any data provided in the next call to decompress() before providing the data to
the underlying compression library. The cases where either the internal buffer
or the new data are empty should be optimized to avoid unnecessary allocations
or copies, since these will be the most common cases.

Note that this API does not need a Python-level 'unconsumed_tail' attribute -
its role is served by the internal buffer (which is private to the C module
implementation). This is not to be confused with the already-existing
'unused_data' attribute that stores data found after the end of the compressed
stream. 'unused_data' should continue to work as before, regardless of whether
decompress() is called with a max_length argument or not.

As a starting point I would suggest writing a patch for LZMADecompressor first,
since its implementation is a bit simpler than BZ2Decompressor. Once this patch
and an analogous one for BZ2Decompressor have been committed, we can then
convert GzipFile, BZ2File and LZMAFile to use this feature.

If you have any questions while you're working on this issue, feel free to send
them my way.

History
Date	User	Action	Args
2014-02-02 16:15:48	nadeem.vawda	set	recipients: + nadeem.vawda, pitrou, christian.heimes, eric.araujo, Arfrever, nikratio, martin.panter, serhiy.storchaka
2014-02-02 16:15:48	nadeem.vawda	set	messageid: <1391357748.84.0.657370954238.issue15955@psf.upfronthosting.co.za>
2014-02-02 16:15:48	nadeem.vawda	link	issue15955 messages
2014-02-02 16:15:48	nadeem.vawda	create