Author nadeem.vawda
Recipients Arfrever, christian.heimes, eric.araujo, martin.panter, nadeem.vawda, nikratio, pitrou, serhiy.storchaka
Date 2014-02-02.16:15:48
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1391357748.84.0.657370954238.issue15955@psf.upfronthosting.co.za>
In-reply-to
Content
After some consideration, I've come to agree with Serhiy that it would be better
to keep a private internal buffer, rather than having the user manage unconsumed
input data. I'm also in favor of having a flag to indicate whether the
decompressor needs more input to produce more decompressed data. (I'd prefer to
call it 'needs_input' or similar, though - 'data_ready' feels too vague to me.)

In msg176883 and msg177228, Serhiy raises the possibility that the compressor
might be unable to produce decompressed output from a given piece of (non-empty)
input, but will still leave the input unconsumed. I do not think that this can
actually happen (based on the libraries' documentation), but this API will work
even if that situation can occur.

So, to summarize, the API will look like this:

    class LZMADecompressor:

        ...

        def decompress(self, data, max_length=-1):
            """Decompresses *data*, returning uncompressed data as bytes.

            If *max_length* is nonnegative, returns at most *max_length* bytes
            of decompressed data. If this limit is reached and further output
            can be produced, *self.needs_input* will be set to False. In this
            case, the next call to *decompress()* should provide *data* as b''
            to obtain more of the output.

            If all of the input data was decompressed and returned (either
            because this was less than *max_length* bytes, or because
            *max_length* was negative), *self.needs_input* will be set to True.
            """
            ...

Data not consumed due to the use of 'max_length' should be saved in an internal
buffer (that is not exposed to Python code at all), which is then prepended to
any data provided in the next call to decompress() before providing the data to
the underlying compression library. The cases where either the internal buffer
or the new data are empty should be optimized to avoid unnecessary allocations
or copies, since these will be the most common cases.

Note that this API does not need a Python-level 'unconsumed_tail' attribute -
its role is served by the internal buffer (which is private to the C module
implementation). This is not to be confused with the already-existing
'unused_data' attribute that stores data found after the end of the compressed
stream. 'unused_data' should continue to work as before, regardless of whether
decompress() is called with a max_length argument or not.

As a starting point I would suggest writing a patch for LZMADecompressor first,
since its implementation is a bit simpler than BZ2Decompressor. Once this patch
and an analogous one for BZ2Decompressor have been committed, we can then
convert GzipFile, BZ2File and LZMAFile to use this feature.

If you have any questions while you're working on this issue, feel free to send
them my way.
History
Date User Action Args
2014-02-02 16:15:48nadeem.vawdasetrecipients: + nadeem.vawda, pitrou, christian.heimes, eric.araujo, Arfrever, nikratio, martin.panter, serhiy.storchaka
2014-02-02 16:15:48nadeem.vawdasetmessageid: <1391357748.84.0.657370954238.issue15955@psf.upfronthosting.co.za>
2014-02-02 16:15:48nadeem.vawdalinkissue15955 messages
2014-02-02 16:15:48nadeem.vawdacreate