Limit decompressed data when reading from LZMAFile and BZ2File #67717

vadmium · 2015-02-26T11:06:26Z

BPO	23529
Nosy	@pitrou, @vadmium, @serhiy-storchaka, @wm75
Files	LZMAFile.v2.patch LZMAFile-etc.v3.patch: LZMAFile and BZ2File done; no open(buffering) LZMAFile-etc.v4.patch: LZMAFile, BZ2File, GzipFile done LZMAFile-etc.v5.patch LZMAFile-etc.v6.patch: Dropped buffer_size parameter LZMAFile-etc.v7.patch LZMAFile-etc.v8.patch LZMAFile-etc.v9.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-04-10.22:33:50.397>
created_at = <Date 2015-02-26.11:06:26.181>
labels = ['type-feature', 'library']
title = 'Limit decompressed data when reading from LZMAFile and BZ2File'
updated_at = <Date 2015-04-10.22:33:50.396>
user = 'https://github.com/vadmium'

bugs.python.org fields:

activity = <Date 2015-04-10.22:33:50.396>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2015-04-10.22:33:50.397>
closer = 'pitrou'
components = ['Library (Lib)']
creation = <Date 2015-02-26.11:06:26.181>
creator = 'martin.panter'
dependencies = []
files = ['38245', '38367', '38397', '38514', '38619', '38674', '38686', '38733']
hgrepos = ['301']
issue_num = 23529
keywords = ['patch']
message_count = 23.0
messages = ['236663', '236747', '236862', '236902', '237424', '237586', '238070', '238124', '238250', '238372', '238675', '238742', '238773', '239195', '239215', '239245', '239297', '239369', '239370', '239449', '239562', '240453', '240454']
nosy_count = 7.0
nosy_names = ['pitrou', 'Arfrever', 'nikratio', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'wolma']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue23529'
versions = ['Python 3.5']

vadmium · 2015-02-26T11:06:24Z

This is a follow-on from bpo-15955, which has added low-level support for limiting the amount of data from the LZMA and bzip decompressors. The high-level LZMAFile and BZ2File reader APIs need to make use of the new low level max_length parameter.

I am starting off with a patch for LZMAFile, based on a patch I posted to bpo-15955. I split out a _RawReader class that does the actual decompress() calls, and then wrapped that in a BufferedReader. The LZMAFile then just delegates the read methods to the BufferedReader. This avoids needing any special code to implement buffering, readline(), etc. This involved some changes in the API though:

LZMAFile now uses BufferedReader.peek(). The current implementation seems appropriate, but I am not comfortable with the current specification in the documentation, which says it is allowed to not return any useful data. See bpo-5811.
read() now accepts size=None, because BufferedReader does. I had to change a test case for this.
BufferedReader.seek() raises a different exception for invalid “whence”

Once the LZMAFile patch sees a bit of review and looks like it might be acceptable, I plan to change the BZ2File class in a similar manner.

pitrou · 2015-02-27T13:39:43Z

LZMAFile now uses BufferedReader.peek(). The current implementation seems appropriate, but I am not comfortable with the current specification in the documentation, which says it is allowed to not return any useful data.

What do you mean with "useful data"? peek() should always return at least one byte (except on EOF or on non-blocking streams, of course).

vadmium · 2015-02-28T00:59:32Z

## BufferedReader.peek() ##
See <https://bugs.python.org/issue5811#msg233750\>, but basically my concern is that the documentation says “the number of bytes returned may be less or more than requested”, without any mention of EOF or other conditions.

## Buffer sizing ##
In the code review, Nikolaus raised the idea of allowing a custom “buffer_size” parameter for the BufferedReader. I think this would need a bit of consideration about how it should work:

Should it be a direct wrapper around BufferedReader(buffer_size=...)?
Should it also support an unbuffered reader mode like open(buffering=0), socket.makefile(buffering=0), and subprocess.Popen(bufsize=0)?
Should there be any consideration for buffering in write mode (mirroring the equivalent open(), etc parameters)?

## Common raw decompression stream class ##
Having a common base class mapping the generic decompressor object API to the RawIOBase API is a good thing. I will try to make one in my next patch iteration. In the mean time, here are a couple of issues to consider:

What module should it reside in? Perhaps “gzip”, because it is the most commonly-used of the three decompressors? Perhaps “io”, being the most relevant common dependency? Or a brand new internal module?!
The zlib.decompressobj() API is slightly different, since it has the “unconsumed_tail” attribute and flush() methods, and decompress(max_length=0) does not mean return zero bytes (bpo-23200).

In bpo-23528, Nikolaus also pointed out that “GzipFile would need to additionally overwrite read() and write() in order to handle the CRC and gzip header.”

nikratio · 2015-02-28T18:27:13Z

On Feb 27 2015, Martin Panter <report@bugs.python.org> wrote:

In the code review, Nikolaus raised the idea of allowing a custom
“buffer_size” parameter for the BufferedReader. I think this would
need a bit of consideration about how it should work:

Should it be a direct wrapper around
BufferedReader(buffer_size=...)?

I'm not 100% sure what you mean by that, but I think the answer is Yes.

Should it also support an unbuffered reader mode like
open(buffering=0), socket.makefile(buffering=0), and
subprocess.Popen(bufsize=0)?

Yes, in that case the DecompressReader should simply be used without
encapsulating it in BufferedReader.

Should there be any consideration for buffering in write mode
(mirroring the equivalent open(), etc parameters)?

I don't think so. There's not much to be gained by buffering data before
passing it to the compressor, so the only point of buffering is to avoid
short writes (or reads in case of decompressing). But the low-level
compressor code already avoids short writes.

Best,
-Nikolaus

--
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

         »Time flies like an arrow, fruit flies like a Banana.«

vadmium · 2015-03-07T07:18:40Z

I am posting LZMAFile-etc.v3.patch, where I have implemented a “buffer_size” parameter to the buffered LZMAFile etc classes. I have not implemented open(buffering=...) for the time being (which should probably delegate to the buffer_size parameter or return a raw _DecompressReader object, at least for read mode). Other changes:

Restored the _MODE_WRITE = 3 value
Explained the _pos and _size attributes in _DecompressReader
Factored out decomp_factory and args, decomp_error parameters to make _DecompressReader generic
BZ2File modified similarly to LZMAFile
I removed the deprecated and unused BZ2File(buffering=...) parameter; buffer_size takes its place. The old buffering parameter appears to have been documented, but never implemented in C Python, so hopefully this is not a big deal.

Still to do: Need to find a better home for the _DecompressReader and _BaseStream classes. Currently it lives in “lzma”, but apparently it is possible for any of the gzip, bz2, lzma modules to not be importable, so it would have to live elsewhere. Possible options are the io module, or a brand new internal module (e.g. Lib/_compression.py). Thoughts?

Also I am about to see if I can make GzipFile use the _DecompressReader class. I will have to add GzipFile(buffer_size=...) as a keyword-only parameter since the third parameter position is already taken. There are quite a few quirks with gzip and zlib compared to bz2 and lzma, so I will see how I go.

vadmium · 2015-03-09T00:51:16Z

I have decided not to reuse the same _DecompressReader for the "gzip" module. This is because the code for parsing the "gzip" header pulls in data by calling read(), and I would have to adapt this code to accept whatever data is pushed to it via a decompress() method. However I have rewritten the “gzip” module to use BufferedReader, removing its own readline() etc implementations.

I have split the BaseStream and DecompressReader classes into a new file called Lib/_compression.py. Let me know if there is anything else to be done to add a new file in the Lib/ directory, or if you have any better ideas where these classes should live.

Other changes in the new patch (LZMAFile-etc.v4.patch):

Fix documented GzipFile.peek(n) signature to match implementation
Removed unused and inconsistent code paths in gzip._PaddedFile
New code assumes zlib.decompressobj().flush() returns a limited amount of data, since it has no “max_length” parameter. In reality I think flush() does not do anything at all and should be deprecated or removed; see bpo-23200.

I hope that the patch does not need any more major changes, so it should be ready for review. There are a few more suggested enhancements that I have not implemented. While they would be nice, I would prefer to handle them in separate bugs and patches. I believe strengthening Python against gzip bombs is important, and any extra refactoring of the code will probably make things harder to review, more likely to break, and less likely to make it into 3.5.

Some possible enhancements I did not do:

Support open(buffering=...) parameter, passing through to the [...]File(buffer_size=...) parameter
open(buffering=0) returning an unbuffered reader object, probably just a direct DecompressReader instance
detach() method on the BufferedIOBase file classes
Further factoring out of a common CompressedFile base class
Apply the buffer size parameter to write mode
Rewrite the gzip module to use the common DecompressReader base class

nikratio · 2015-03-14T03:41:07Z

On Mar 06 2015, Martin Panter <report@bugs.python.org> wrote:

Still to do: Need to find a better home for the _DecompressReader and
_BaseStream classes. Currently it lives in “lzma”, but apparently it
is possible for any of the gzip, bz2, lzma modules to not be
importable, so it would have to live elsewhere.

Yes.

Possible options are
the io module, or a brand new internal module
(e.g. Lib/_compression.py). Thoughts?

I think a new internal module would be the right choice, but I don't
know what needs to be done to properly add it to the build system. So
for now I'd just put it in the io module and wait for a core committer
to complain :-).

Best,
-Nikolaus

GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

         »Time flies like an arrow, fruit flies like a Banana.«

vadmium · 2015-03-15T05:02:26Z

Well my last patch just added the _compression.py file without doing anything special, and it seems to be installed properly with “make install” so I will assume nothing else needs to be done.

vadmium · 2015-03-16T23:59:34Z

Posting LZMAFile-etc.v5.patch with the following changes:

Merged with current code
Changed BZ2File(buffer_size=...) to a keyword-only parameter and restored previous unused “buffering” parameter. Also changed the LZMAFile parameter to keyword-only for consistency.
Dropped documenting how buffer_size affects the compressed data chunk size or the fast-forward chunk size for seeking
Dropped backslash from \* in function signatures
Documented and tested that buffer_size=0 is not valid for the LZMAFile etc classes
Clarified a few code comments

There is still Nikolaus’s concern about setting the buffer size to zero and doing short reads, discussed on Rietveld. Apart from that, I think this patch addresses the rest of the comments. Let me know if I missed something!

nikratio · 2015-03-18T02:31:14Z

If you want to add support for buffer_size=0 in a separate patch/issue I think that's fine. But in that case I would not add a buffer_size parameter now at all. IMO, not having it is better as having it but not supporting zero (even if it's documented that's pretty surprising).

wm75 · 2015-03-20T14:07:11Z

The patch has a huge block replacement in the gzip.py module starting at GzipFile.write, which makes it large and hard to identify changes. Could that be ameliorated or is it too much work right now?

vadmium · 2015-03-21T00:02:07Z

Wolfgang: Unfortunately, I think that block of changes is largely unavoidable. The write() method should not have changed in reality, but all the init and read methods around it have been split into two classes, and the diff has decided to compare the old GzipFile methods to the new _DecompressReader methods. If you like I could try preparing a pseudo diff files that forces it to compare old GzipFile to new GzipFile to highlight the changes there.

Nikolaus: I will try to split my buffer_size parameter changes out and post them in a separate issue that can be enhanced further.

vadmium · 2015-03-21T08:16:58Z

I am posting v6, which basically drops the buffer_size parameters. The change is also pushed as a series of revisions to a Bit Bucket repository, see <https://bitbucket.org/vadmium/cpython/branches/compare/compression%0D@\>. I thought that might help eliminate the giant blocks of changes in gzip.py, but it did not.

The previous changes for the buffer_size parameters are in the “buffer_size” bookmark in that repository.

nikratio · 2015-03-24T22:52:00Z

As discussed in Rietveld, here's an attempt to reuse more DecompressReader for GzipFile. There is still an unexplained test failure (test_read_truncated).

vadmium · 2015-03-25T02:43:43Z

Wolfgang: If the patch here is too massive, perhaps you might like to review my patch at bpo-23528 first. :) It only fixes the “gzip” module, but is definitely simpler, so might have more chance of making it into 3.4 or 3.5 on time.

===

Changes introduced by patch v7 if anyone’s interested: <https://bitbucket.org/vadmium/cpython/commits/b7c43f7\>. A few problems and behaviour changes are introduced:

GzipFile(filename=...).close() now relies on the garbage collector to close the file
EOFError("Compressed file ended before the end-of-stream marker was reached") repaced with “struct.error” exception in some cases, e.g. GzipFile(fileobj=BytesIO(gzip_data[:3])).read()
Premature EOF when stream ends in some cases, e.g. GzipFile(fileobj=BytesIO(empty_gzip + gzip_data)).read()

However I think the general idea of inheriting the common reader class and overriding as necessary is sound. Just looks like the code got too messed around in the process.

vadmium · 2015-03-25T10:20:24Z

Another behaviour change in v7 is the seekable() return value is now inherited from underlying file, instead of always being True. I mentioned on Reitveld why this could be a bad idea, but as it is undocumented and untested and will leave the new behaviour unless someone complains.

Posting v8, with these changes from v7:

Reverted _read_exact() and read() loop changes to fix tests
Restored myfileobj attribute and test that it is closed
Clean up gzip read() loop so that it always returns data if available, rather than raising EOFError
Document the read(None) is now allowed
Other minor changes addressing review comments

nikratio · 2015-03-26T01:12:29Z

Except for the pointless 'myfileobj' stuff in gzip.py, rev 8 of the patch looks good to me.

(Btw, I'm not actually in favor of e.g. the seekable() change. The previous patch was intended as a proof-of-concept to see what would be necessary to inherit more from DecompressReader and if it's actually worth it. But having thought about it for a while more, I don't think there's a significant gain. But I'm happy to see that you were able to cherry-pick some useful pieces out of it nevertheless).

pitrou · 2015-03-27T00:25:43Z

Is it still work-in-progress or are you looking for a review?

nikratio · 2015-03-27T00:31:08Z

I believe Martin's patch (v8) is ready for a core committer review. At least I can't find anything to criticize anymore :-).

vadmium · 2015-03-28T02:22:18Z

Yes my patch should be ready, unless we want to work on factoring more common logic out of the gzip read() method.

vadmium · 2015-03-30T03:02:07Z

Patch v9:

Incorporated _PaddedFile.rewind() into seek() for simplicity
Reverted to support fast-forwarding in non-seekable streams again
Factored common checks into _check_can_seek()
Documented “mtime” attribute and implemented it as a read-only decompression-only property

Antoine highlighted the fact that BufferedReader.peek() does not guarantee how much data it returns, so I removed the code trying to return at least 100 bytes.

I revisited merging the gzip read() loop with the others, but gave up again. I tried adding calls to _start_member() and _end_member(), with more “if” statements and flags to account for the different edge cases. But then I remembered the “needs_input” flag of LZMA and bzip versus the “unconsumed_tail” of zlib. Handling this would probably require another delegation to a new _decompress_chunk() method, at which point I think the whole read() method will probably just be a mess of hacks trying to marry the two not-quite-compatible implementations. So I think it is better to override read() with a gzip-specific implementation, unless someone wants to prove me wrong with their own patch :)

python-dev · 2015-04-10T22:33:08Z

New changeset 62723172412c by Antoine Pitrou in branch 'default':
Issue bpo-23529: Limit the size of decompressed data when reading from
https://hg.python.org/cpython/rev/62723172412c

pitrou · 2015-04-10T22:33:50Z

Thank you very much for being so perseverant! The patch is now pushed into the default branch.

vadmium added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Feb 26, 2015

vadmium changed the title ~~Limit decompressed data when reading from LZMAFile and BZ2File~~ Limit decompressed data when reading from LZMAFile, BZ2File, GzipFile Mar 9, 2015

nikratio mannequin changed the title ~~Limit decompressed data when reading from LZMAFile, BZ2File, GzipFile~~ Limit decompressed data when reading from LZMAFile and BZ2File Mar 14, 2015

pitrou closed this as completed Apr 10, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit decompressed data when reading from LZMAFile and BZ2File #67717

Limit decompressed data when reading from LZMAFile and BZ2File #67717

vadmium commented Feb 26, 2015

vadmium commented Feb 26, 2015

pitrou commented Feb 27, 2015

vadmium commented Feb 28, 2015

nikratio mannequin commented Feb 28, 2015

vadmium commented Mar 7, 2015

vadmium commented Mar 9, 2015

nikratio mannequin commented Mar 14, 2015

vadmium commented Mar 15, 2015

vadmium commented Mar 16, 2015

nikratio mannequin commented Mar 18, 2015

wm75 mannequin commented Mar 20, 2015

vadmium commented Mar 21, 2015

vadmium commented Mar 21, 2015

nikratio mannequin commented Mar 24, 2015

vadmium commented Mar 25, 2015

vadmium commented Mar 25, 2015

nikratio mannequin commented Mar 26, 2015

pitrou commented Mar 27, 2015

nikratio mannequin commented Mar 27, 2015

vadmium commented Mar 28, 2015

vadmium commented Mar 30, 2015

python-dev mannequin commented Apr 10, 2015

pitrou commented Apr 10, 2015

Limit decompressed data when reading from LZMAFile and BZ2File #67717

Limit decompressed data when reading from LZMAFile and BZ2File #67717

Comments

vadmium commented Feb 26, 2015

vadmium commented Feb 26, 2015

pitrou commented Feb 27, 2015

vadmium commented Feb 28, 2015

nikratio mannequin commented Feb 28, 2015

vadmium commented Mar 7, 2015

vadmium commented Mar 9, 2015

nikratio mannequin commented Mar 14, 2015

Best, -Nikolaus

vadmium commented Mar 15, 2015

vadmium commented Mar 16, 2015

nikratio mannequin commented Mar 18, 2015

wm75 mannequin commented Mar 20, 2015

vadmium commented Mar 21, 2015

vadmium commented Mar 21, 2015

nikratio mannequin commented Mar 24, 2015

vadmium commented Mar 25, 2015

vadmium commented Mar 25, 2015

nikratio mannequin commented Mar 26, 2015

pitrou commented Mar 27, 2015

nikratio mannequin commented Mar 27, 2015

vadmium commented Mar 28, 2015

vadmium commented Mar 30, 2015

python-dev mannequin commented Apr 10, 2015

pitrou commented Apr 10, 2015

Best,
-Nikolaus