New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bz2, lzma: add option to limit output size #60159
Comments
The gzip, bz2 and lzma file reader have no method to get the actual size and compression ratio of a compressed file. In my opinion it's useful to know how much disk space a file will need before it's decompressed. |
As far as I can tell, there is no way to find this out reliably without decompressing the entire file. With gzip, the file trailer contains the uncompressed size modulo 2^32, but this seems less than useful. It appears that the other two formats do not store the total uncompressed data size in any form. For bz2 and lzma, one can get the uncompressed size by doing f.seek(0, 2) followed by f.tell(). However this approach is ugly and potentially very slow, so I would be reluctant to add a method based on it to the (BZ2|LZMA)File classes. |
I've checked the implementation of the gzip command. It uses some tricks to get the result size of a compressed file. The bzip2 command doesn't have the ability. lzmainfo can print the actual size of the file but it says "Unknown" for my test file. Also see bpo-16043 why this would be a useful feature. |
You don't need to know the final size. You just need to be able to pass an output size limit. This would be an easy feature to add to functions like zlib.decompress(), since they already handle sizing of the output buffer. |
Agree with Antoine. This would be a desirable feature. |
I agree that being able to limit output size is useful and desirable, but As an alternative, I propose a thin wrapper around the underlying C API: def decompress_into(self, src, dst, src_start=0, dst_start=0): ... This would store decompressed data in a caller-provided bytearray, and The implementation should be extremely simple - it does not need to do I think it could also be useful for optimizing the implementation of (Aside: if implemented for zlib, this could also be a nicer (I think) |
unconsumed_tail should be private hidden attribute, which automatically prepends any consumed data. This should not be very complicated. But benchmarks needed to show what kind of approach is more efficient. |
I suspect that it will be slower than the decompress_into() approach, but |
I've tried reimplementing LZMAFile in terms of the decompress_into() In addition, decompress_into() is more complicated to work with than I
I don't think this is a good idea. In order to have predictable memory while not d.eof:
output = d.decompress(b'', 8192)
if not output:
compressed = f.read(8192)
if not compressed:
raise ValueError('End-of-stream marker not found')
output = d.decompress(compressed, 8192)
# <process output> with: # Using zlib's interface
while not d.eof:
compressed = d.unconsumed_tail or f.read(8192)
if not compressed:
raise ValueError('End-of-stream marker not found')
output = d.decompress(compressed, 8192)
# <process output> A related, but orthogonal proposal: We might want to make unconsumed_tail |
This is not usable with bzip2. Bzip2 uses large block size and unconsumed_tail
It looks interesting. However the data should be copied anyway if the input |
Actually it should be: # Using zlib's interface
while not d.eof:
output = d.decompress(d.unconsumed_tail, 8192)
while not output and not d.eof:
compressed = f.read(8192)
if not compressed:
raise ValueError('End-of-stream marker not found')
output = d.decompress(d.unconsumed_tail + compressed, 8192)
# <process output> Note that you should use d.unconsumed_tail + compressed as input, and therefore do an unnecessary copy of the data. Without explicit unconsumed_tail you can write input data in the internal mutable buffer, it will be more effective for large buffer (handreds of KB) and small input chunks (several KB). |
I don't see how this is a problem. If (for some strange reason) the
Why is this necessary? If unconsumed_tail is b'', then there's no need to
Are you proposing that the decompressor object maintain its own buffer, and |
And then hang. Because d.unconsumed_tail is not empty and no new data will be
What if unconsumed_tail is not empty but less than needed to decompress at
unconsumed_tail is such buffer and when we call decompressor with new chunk of |
This is possible in zlib, but not in bz2. According to the manual [1], it is For xz, I'm not sure whether this problem could occur. I had assumed that it [As an aside, it would be nice if the documentation for the zlib module |
The lack of output size limiting has security implications as well. Without being able to limit the size of the uncompressed data returned per call, it is not possible to decompress untrusted lzma or bz2 data without becoming susceptible to a DoS attack, as the attacker can force allocation of gigantic buffers by sending just a tiny amount of compressed data. |
Nadeem, did you have a chance to look at this again, or do you have any partial patch already? If not, I'd like to try working on a patch. |
No, I'm afraid I haven't had a chance to do any work on this issue since my last I would be happy to review a patch for this, but before you start writing one, |
Is there any reason why unconsumed_tail needs to be exposted? I would suggest to instead introduce a boolean attribute data_ready than indicates that more decompressed data can be provided without additional compressed input. Example: # decomp = decompressor object
# infh = compressed input stream
# outfh = decompressed output stream
while not decomp.eof:
if decomp.data_ready:
buf = decomp.decompress(max_size=BUFSIZE)
# or maybe:
#buf = decomp.decompress(b'', max_size=BUFSIZE)
else:
buf = infh.read(BUFSIZE)
if not buf:
raise RuntimeError('Unexpected end of compressed stream')
buf = decomp.decompress(buf, max_size=BUFSIZE)
This is short, easily readable (in my opinion) and also avoids the problem where the decompressor blocks because it needs more data even though there still is an unconsumed tail. |
Let me be more precise: My suggestion is not to remove In other words, you could still do: while not decomp.eof
# ...
if decomp.unconsumed_tail:
raise RuntimeError('unexpected data after end of compressed stream') but as long as decomp.eof is True, decomp.unconsumed_tail could (as far as I can tell) be None, no matter if there is uncompressed data in the internal buffer or not. |
After some consideration, I've come to agree with Serhiy that it would be better In msg176883 and msg177228, Serhiy raises the possibility that the compressor So, to summarize, the API will look like this: class LZMADecompressor:
def decompress(self, data, max_length=-1):
"""Decompresses *data*, returning uncompressed data as bytes.
Data not consumed due to the use of 'max_length' should be saved in an internal Note that this API does not need a Python-level 'unconsumed_tail' attribute - As a starting point I would suggest writing a patch for LZMADecompressor first, If you have any questions while you're working on this issue, feel free to send |
Thanks Nadeem. I'll get going. |
I have attached a patch adding the max_length parameter to LZMADecompressor.decompress(). I have chosen to store the pointer to any unconsumed input in the lzma stream object itself. The new convention is: if lzs.next_in != NULL, then there is valid data that needs to be prepended. Since I expect that max_length will often be specified as a keyword argument, I had to change the argument clinic definition to allow all arguments to be specified as keyword arguments. I hope that's ok. Testcases seem to run fine. No reference leaks with -R : either. Comments? |
Thanks for the patch, Nikolaus. I'm afraid I haven't had a chance to look |
I've posted a review at http://bugs.python.org/review/15955/. (For some reason, it looks like Rietveld didn't send out email notifications. But maybe it never sends a notification to the sender? Hmm.) |
Yes, Rietveld never sends a notification to the sender. I've received a |
I've attached the second iteration of the patch. I've factored out a new function decompress_buf, and added two new (C only) instance variables input_buffer and input_buffer_size. I've tested the patch by enabling Py_USING_MEMORY_DEBUGGER in Objects/obmalloc.c and compiling --with-pydebug --without-pymalloc. Output is: $ valgrind --tool=memcheck --suppressions=Misc/valgrind-python.supp ./python -m test -R : -v test_lzma
==18635== Memcheck, a memory error detector
==18635== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==18635== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==18635== Command: ./python -m test -R : -v test_lzma
==18635==
== CPython 3.5.0a0 (default:a8f3ca72f703+, Apr 11 2014, 21:48:07) [GCC 4.8.2]
[....]
Ran 103 tests in 43.655s OK When running the tests only once (no -R option), the number of possibly lost bytes is 544,068 in 1,131 blocks. When running without the patch, the number is 545,740 bytes in 1,134 blocks (for one iteration). I guess this means that Python is using interior pointers, so these blocks are not truly lost? |
Nadeem, did you get a chance to look at this? |
If people are worried about the best low-level decompressor API, maybe leave that as a future enhancement, and just rely on using the existing file reader APIs. I would expect them to have a sensible decompressed buffer size limit, however “bzip2” and LZMA look susceptible to zip bombing: >>> GzipFile(fileobj=gzip_bomb).read(1)
b'\x00'
>>> BZ2File(bzip_bomb).read(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/bz2.py", line 293, in read
return self._read_block(size)
File "/usr/lib/python3.4/bz2.py", line 254, in _read_block
while n > 0 and self._fill_buffer():
File "/usr/lib/python3.4/bz2.py", line 218, in _fill_buffer
self._buffer = self._decompressor.decompress(rawblock)
MemoryError
>>> z = LZMAFile(lzma_bomb)
>>> z.read(1)
b'\x00' # Slight delay before returning
>>> len(z._buffer)
55675075 # Decompressed much more data than I asked for |
*ping*. It's been another 8 months. It would be nice if someone could review the patch. |
It turns out that GzipFile.read(<size>) etc is also susceptible to decompression bombing. Here is a patch to test and fix that, making use of the existing “max_length” parameter in the “zlib” module. |
Here is a patch for the higher-level LZMAFile implementation to use Nikolaus’s “max_length” parameter. It depends on Nikolaus’s patch also being applied. I split out a _RawReader class that does the actual decompress() calls, and then wrapped that in a BufferedReader. This avoids needing any special code to implement buffering, readline(), etc. The only significant changes in the API that I can see are:
|
We still need a patch for max_length in BZ2Decompressor, and to use it in BZ2File. Also, I think my GzipFile patch should be considered as a bug fix (rather than enhancement), e.g. the fix for bpo-16043 assumes GzipFile.read(<size>) is limited. I’m adding v4 of Nikolaus’s low-level LZMA patch. I merged v3 with the recent default branch and fixed the conflicts, since I couldn’t tell what revision it was originally based on. I also made some fixes based on my review comments. |
Martin Panter <report@bugs.python.org> writes:
I'm still quite interested to do this. The only reason I haven't done it Best, --
|
I've posted a review of the latest lzma decompressor patch. I think it'll be able to go in once the comments are addressed. |
Adding issue15955_lzma_r5.diff. Main changes from r4:
Nikolaus, I hope I am not getting in your way by updating this patch. I though I should take responsibility for removing the test I accidentally added in r4. |
New changeset 00e552a23bcc by Antoine Pitrou in branch 'default': |
I've committed the LZMADecompressor patch. Now let's tackle the rest. |
Nikolaus, do you still plan on doing the bzip module? If not, I could have a go when I get a chance. I’m also keen for the GzipFile decompression to be fixed, if anyone wants to review my gzip-bomb.patch. |
Yes, I still plan to do it, but I haven't started yet. That said, I certainly won't be offended if someone else implements the feature either. Please just let me know when you start working on this (I'll do the same), so there's no duplication of efforts. |
On a more practical note: I believe Nadeem at one point said that the bz2 module is not exactly an example for good stdlib coding, while the lzma module's implementation is quite clean. Therefore Therefore, one idea I had for the bz2 module was to "port" the lzma module to use the bzip2 library (instead of adding a max_size parameter to the bz2 module). Just something to consider if you find time to work on this before me. |
I've started to work on the bzip module. I'll attach I work-in-progress patch if I get stuck or run out of time. |
Attached is a patch for the bz2 module. |
The new bzip parameter still needs documenting in the library reference. However I reviewed the doc string, C code, and tests, and I can�’t find anything wrong. |
Updated patch attached. |
Attached rev3 with fixed indentation and explicit cast to avoid compiler warning on 32-bit systems. |
Rev3 still seems to have the same weird indentation as rev2. Are you using some sort of diff --ignore-all-space option by accident? |
On Feb 23 2015, Martin Panter <report@bugs.python.org> wrote:
I used "diff -w" for revision 2, but revision 3 should be good. Can you Best, --
|
One more try before I give up. |
grmblfhargs#@$@# |
Revision 7 of the bz2 patch is attached. Apologies for the flood of revisions to everyone and many thanks to Martin for noticing the problems every time, it boggles my mind that it took me 5 tries to add a simple cast. Antoine, I think this is ready for commit now. |
Martin, I'll try to review your GzipFile patch. But maybe it would make sense to open a separate issue for this? I think the LZMAFile patch has not yet been reviewed or committed, and we probably want a patch for BZ2File too. The review page is already pretty cluttered right now, so I think it would make sense to use this issue for the low-level compressor/decompressor API and handle the *File API separately. |
New changeset 10dab1b596d4 by Antoine Pitrou in branch 'default': |
Thank you, Nikolaus, and thank you Martin for the reviews. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: