Issue 15955: bz2, lzma: add option to limit output size

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60159

classification

Title:	bz2, lzma: add option to limit output size
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, christian.heimes, eric.araujo, martin.panter, nadeem.vawda, nikratio, pitrou, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-09-17 12:22 by christian.heimes, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue15955.diff	nikratio, 2014-03-25 04:03		review
issue15955_r2.diff	nikratio, 2014-04-12 05:04		review
issue15955_r3.diff	nikratio, 2014-04-12 18:39		review
gzip-bomb.patch	martin.panter, 2015-01-08 14:38	Moved to Issue 23528	review
LZMAFile.patch	martin.panter, 2015-01-10 03:31	New version in Issue 23529	review
issue15955_lzma_r4.diff	martin.panter, 2015-01-11 23:57	Adds max_length support in LZMADecompressor	review
issue15955_lzma_r5.diff	martin.panter, 2015-01-17 03:58	Adds max_length support in LZMADecompressor	review
issue15955_bz2.diff	nikratio, 2015-02-21 05:04	Adds max_length support in BZ2Decompressor	review
issue15955_bz2_rev2.diff	nikratio, 2015-02-22 20:11	Adds max_length support in BZ2Decompressor	review
issue15955_bz2_rev3.diff	nikratio, 2015-02-23 16:44		review
issue15955_bz2_rev5.diff	nikratio, 2015-02-24 01:39		review
issue15955_bz2_rev6.diff	nikratio, 2015-02-24 01:58		review
issue15955_bz2_rev7.diff	nikratio, 2015-02-24 16:34	Adds max_length support in BZ2Decompressor	review

Messages (56)
msg170598 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-09-17 12:22
The gzip, bz2 and lzma file reader have no method to get the actual size and compression ratio of a compressed file. In my opinion it's useful to know how much disk space a file will need before it's decompressed.
msg171059 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-09-23 16:46
As far as I can tell, there is no way to find this out reliably without decompressing the entire file. With gzip, the file trailer contains the uncompressed size modulo 2^32, but this seems less than useful. It appears that the other two formats do not store the total uncompressed data size in any form. For bz2 and lzma, one can get the uncompressed size by doing f.seek(0, 2) followed by f.tell(). However this approach is ugly and potentially very slow, so I would be reluctant to add a method based on it to the (BZ2\|LZMA)File classes.
msg171248 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-09-25 11:03
I've checked the implementation of the gzip command. It uses some tricks to get the result size of a compressed file. The bzip2 command doesn't have the ability. lzmainfo can print the actual size of the file but it says "Unknown" for my test file. Also see #16043 why this would be a useful feature.
msg171257 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-09-25 12:26
You don't need to know the final size. You just need to be able to pass an output size limit. This would be an easy feature to add to functions like zlib.decompress(), since they already handle sizing of the output buffer.
msg171304 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-09-25 18:12
Agree with Antoine. This would be a desirable feature.
msg174860 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-11-05 01:25
I agree that being able to limit output size is useful and desirable, but I'm not keen on copying the max_length/unconsumed_tail approach used by zlib's decompressor class. It feels awkward to use, and it complicates the implementation of the existing decompress() method, which is already unwieldy enough. As an alternative, I propose a thin wrapper around the underlying C API: def decompress_into(self, src, dst, src_start=0, dst_start=0): ... This would store decompressed data in a caller-provided bytearray, and return a pair of integers indicating the end points of the consumed and produced data in the respective buffers. The implementation should be extremely simple - it does not need to do any memory allocation or reference management. I think it could also be useful for optimizing the implementation of BZ2File and LZMAFile. I plan to write a prototype and run some benchmarks some time in the next few weeks. (Aside: if implemented for zlib, this could also be a nicer (I think) solution for the problem raised in issue 5804.)
msg174912 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-05 14:50
unconsumed_tail should be private hidden attribute, which automatically prepends any consumed data. This should not be very complicated. But benchmarks needed to show what kind of approach is more efficient.
msg175028 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-11-06 23:29
I suspect that it will be slower than the decompress_into() approach, but as you say, we need to do benchmarks to see for sure.
msg176811 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-12-02 21:52
I've tried reimplementing LZMAFile in terms of the decompress_into() method, and it has ended up not being any faster than the existing implementation. (It is _slightly_ faster for readinto() with a large buffer size, but all other cases it was either of equal performance or significantly slower.) In addition, decompress_into() is more complicated to work with than I had expected, so I withdraw my objection to the approach based on max_length/unconsumed_tail. > unconsumed_tail should be private hidden attribute, which automatically prepends any consumed data. I don't think this is a good idea. In order to have predictable memory usage, the caller will need to ensure that the current input is fully decompressed before passing in the next block of compressed data. This can be done more simply with the interface used by zlib. Compare: while not d.eof: output = d.decompress(b'', 8192) if not output: compressed = f.read(8192) if not compressed: raise ValueError('End-of-stream marker not found') output = d.decompress(compressed, 8192) # <process output> with: # Using zlib's interface while not d.eof: compressed = d.unconsumed_tail or f.read(8192) if not compressed: raise ValueError('End-of-stream marker not found') output = d.decompress(compressed, 8192) # <process output> A related, but orthogonal proposal: We might want to make unconsumed_tail a memoryview (provided the input data is know to be immutable), to avoid creating an unnecessary copy of the data.
msg176883 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-12-04 08:49
> # Using zlib's interface > while not d.eof: > compressed = d.unconsumed_tail or f.read(8192) > if not compressed: > raise ValueError('End-of-stream marker not found') > output = d.decompress(compressed, 8192) > # <process output> This is not usable with bzip2. Bzip2 uses large block size and unconsumed_tail can be non empty but decompress() will return b''. With zlib you possible can see the same effect on some input when read by one byte. > A related, but orthogonal proposal: We might want to make unconsumed_tail > a memoryview (provided the input data is know to be immutable), to avoid > creating an unnecessary copy of the data. It looks interesting. However the data should be copied anyway if the input data is not a bytes object.
msg176896 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-12-04 11:31
Actually it should be: # Using zlib's interface while not d.eof: output = d.decompress(d.unconsumed_tail, 8192) while not output and not d.eof: compressed = f.read(8192) if not compressed: raise ValueError('End-of-stream marker not found') output = d.decompress(d.unconsumed_tail + compressed, 8192) # <process output> Note that you should use d.unconsumed_tail + compressed as input, and therefore do an unnecessary copy of the data. Without explicit unconsumed_tail you can write input data in the internal mutable buffer, it will be more effective for large buffer (handreds of KB) and small input chunks (several KB).
msg177213 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2012-12-09 13:11
>> # Using zlib's interface >> while not d.eof: >> compressed = d.unconsumed_tail or f.read(8192) >> if not compressed: >> raise ValueError('End-of-stream marker not found') >> output = d.decompress(compressed, 8192) >> # <process output> > > This is not usable with bzip2. Bzip2 uses large block size and unconsumed_tail > can be non empty but decompress() will return b''. With zlib you possible can > see the same effect on some input when read by one byte. I don't see how this is a problem. If (for some strange reason) the application-specific processing code can't handle empty blocks properly, you can just stick "if not output: continue" before it. > Actually it should be: > > # Using zlib's interface > while not d.eof: > output = d.decompress(d.unconsumed_tail, 8192) > while not output and not d.eof: > compressed = f.read(8192) > if not compressed: > raise ValueError('End-of-stream marker not found') > output = d.decompress(d.unconsumed_tail + compressed, 8192) > # <process output> > > Note that you should use d.unconsumed_tail + compressed as input, and therefore > do an unnecessary copy of the data. Why is this necessary? If unconsumed_tail is b'', then there's no need to prepend it (and the concatenation would be a no-op anyway). If unconsumed_tail does contain data, then we don't need to read additional compressed data from the file until we've finished decompressing the data we already have. > Without explicit unconsumed_tail you can write input data in the internal > mutable buffer, it will be more effective for large buffer (handreds of KB) > and small input chunks (several KB). Are you proposing that the decompressor object maintain its own buffer, and copy the input data into it before passing it to the decompression library? Doesn't that just duplicate work that the library is already doing for us?
msg177228 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-12-09 17:20
> you can just stick "if not output: continue" before it. And then hang. Because d.unconsumed_tail is not empty and no new data will be read. > Why is this necessary? If unconsumed_tail is b'', then there's no need to > prepend it (and the concatenation would be a no-op anyway). If > unconsumed_tail does contain data, then we don't need to read additional > compressed data from the file until we've finished decompressing the data > we already have. What if unconsumed_tail is not empty but less than needed to decompress at least one byte? We need read more data until unconsumed_tail grow enought to be decompressed. > Are you proposing that the decompressor object maintain its own buffer, and > copy the input data into it before passing it to the decompression library? > Doesn't that just duplicate work that the library is already doing for us? unconsumed_tail is such buffer and when we call decompressor with new chunk of data we should allocate buffer of size (len(unconsumed_tail)+len(compressed)) and copy len(unconsumed_tail) bytes from unconsumed_tail and len(compressed) from gotten data. But when you use internal buffer, you should only copy new data.
msg180271 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2013-01-19 22:24
> What if unconsumed_tail is not empty but less than needed to decompress at > least one byte? We need read more data until unconsumed_tail grow enought to > be decompressed. This is possible in zlib, but not in bz2. According to the manual [1], it is perfectly OK to supply one byte at a time. For xz, I'm not sure whether this problem could occur. I had assumed that it could not, but I may be mistaken ;-). Unfortunately liblzma has no proper manual, so I'll have to dig into the implementation to find out, and I haven't had the time to do this yet. [As an aside, it would be nice if the documentation for the zlib module mentioned this problem. We can't assume that users of the Python module are familiar with the C API for zlib...] [1] http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html
msg187534 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2013-04-21 23:04
The lack of output size limiting has security implications as well. Without being able to limit the size of the uncompressed data returned per call, it is not possible to decompress untrusted lzma or bz2 data without becoming susceptible to a DoS attack, as the attacker can force allocation of gigantic buffers by sending just a tiny amount of compressed data.
msg208522 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-20 04:53
Nadeem, did you have a chance to look at this again, or do you have any partial patch already? If not, I'd like to try working on a patch.
msg208763 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2014-01-22 09:46
No, I'm afraid I haven't had a chance to do any work on this issue since my last message. I would be happy to review a patch for this, but before you start writing one, we should settle on how the API will look. I'll review the existing discussion in detail over the weekend and come up with something that avoids the potential problems raised by Serhiy.
msg209151 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-25 04:56
Is there any reason why unconsumed_tail needs to be exposted? I would suggest to instead introduce a boolean attribute data_ready than indicates that more decompressed data can be provided without additional compressed input. Example: # decomp = decompressor object # infh = compressed input stream # outfh = decompressed output stream while not decomp.eof: if decomp.data_ready: buf = decomp.decompress(max_size=BUFSIZE) # or maybe: #buf = decomp.decompress(b'', max_size=BUFSIZE) else: buf = infh.read(BUFSIZE) if not buf: raise RuntimeError('Unexpected end of compressed stream') buf = decomp.decompress(buf, max_size=BUFSIZE) assert len(buf) > 0 outfh.write(buf) This is short, easily readable (in my opinion) and also avoids the problem where the decompressor blocks because it needs more data even though there still is an unconsumed tail.
msg209228 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-25 18:46
Let me be more precise: My suggestion is not to remove `unconsumed_tail` entirely, but I think its value needs to be defined only when the end of the compressed stream has been reached. In other words, you could still do: while not decomp.eof # ... if decomp.unconsumed_tail: raise RuntimeError('unexpected data after end of compressed stream') but as long as decomp.eof is True, decomp.unconsumed_tail could (as far as I can tell) be None, no matter if there is uncompressed data in the internal buffer or not.
msg209999 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2014-02-02 16:15
After some consideration, I've come to agree with Serhiy that it would be better to keep a private internal buffer, rather than having the user manage unconsumed input data. I'm also in favor of having a flag to indicate whether the decompressor needs more input to produce more decompressed data. (I'd prefer to call it 'needs_input' or similar, though - 'data_ready' feels too vague to me.) In msg176883 and msg177228, Serhiy raises the possibility that the compressor might be unable to produce decompressed output from a given piece of (non-empty) input, but will still leave the input unconsumed. I do not think that this can actually happen (based on the libraries' documentation), but this API will work even if that situation can occur. So, to summarize, the API will look like this: class LZMADecompressor: ... def decompress(self, data, max_length=-1): """Decompresses data, returning uncompressed data as bytes. If max_length is nonnegative, returns at most max_length bytes of decompressed data. If this limit is reached and further output can be produced, self.needs_input will be set to False. In this case, the next call to decompress() should provide data as b'' to obtain more of the output. If all of the input data was decompressed and returned (either because this was less than max_length bytes, or because max_length was negative), self.needs_input will be set to True. """ ... Data not consumed due to the use of 'max_length' should be saved in an internal buffer (that is not exposed to Python code at all), which is then prepended to any data provided in the next call to decompress() before providing the data to the underlying compression library. The cases where either the internal buffer or the new data are empty should be optimized to avoid unnecessary allocations or copies, since these will be the most common cases. Note that this API does not need a Python-level 'unconsumed_tail' attribute - its role is served by the internal buffer (which is private to the C module implementation). This is not to be confused with the already-existing 'unused_data' attribute that stores data found after the end of the compressed stream. 'unused_data' should continue to work as before, regardless of whether decompress() is called with a max_length argument or not. As a starting point I would suggest writing a patch for LZMADecompressor first, since its implementation is a bit simpler than BZ2Decompressor. Once this patch and an analogous one for BZ2Decompressor have been committed, we can then convert GzipFile, BZ2File and LZMAFile to use this feature. If you have any questions while you're working on this issue, feel free to send them my way.
msg214773 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-25 02:09
Thanks Nadeem. I'll get going.
msg214774 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-25 04:03
I have attached a patch adding the max_length parameter to LZMADecompressor.decompress(). I have chosen to store the pointer to any unconsumed input in the lzma stream object itself. The new convention is: if lzs.next_in != NULL, then there is valid data that needs to be prepended. Since I expect that max_length will often be specified as a keyword argument, I had to change the argument clinic definition to allow all arguments to be specified as keyword arguments. I hope that's ok. Testcases seem to run fine. No reference leaks with -R : either. Comments?
msg215211 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2014-03-30 22:45
Thanks for the patch, Nikolaus. I'm afraid I haven't had a chance to look over it yet; this past week has been a bit crazy for me. I'll definitely get back to you with a review in the next week, though.
msg215657 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2014-04-06 14:10
I've posted a review at http://bugs.python.org/review/15955/. (For some reason, it looks like Rietveld didn't send out email notifications. But maybe it never sends a notification to the sender? Hmm.)
msg215661 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-04-06 15:59
Yes, Rietveld never sends a notification to the sender. I've received a notification.
msg215959 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-04-12 05:04
I've attached the second iteration of the patch. I've factored out a new function decompress_buf, and added two new (C only) instance variables input_buffer and input_buffer_size. I've tested the patch by enabling Py_USING_MEMORY_DEBUGGER in Objects/obmalloc.c and compiling --with-pydebug --without-pymalloc. Output is: $ valgrind --tool=memcheck --suppressions=Misc/valgrind-python.supp ./python -m test -R : -v test_lzma ==18635== Memcheck, a memory error detector ==18635== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==18635== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info ==18635== Command: ./python -m test -R : -v test_lzma ==18635== == CPython 3.5.0a0 (default:a8f3ca72f703+, Apr 11 2014, 21:48:07) [GCC 4.8.2] [....] Ran 103 tests in 43.655s OK beginning 9 repetitions 123456789 [...] OK . 1 test OK. ==18635== ==18635== HEAP SUMMARY: ==18635== in use at exit: 2,328,705 bytes in 13,625 blocks ==18635== total heap usage: 2,489,623 allocs, 2,475,998 frees, 91,305,463,593 bytes allocated ==18635== ==18635== LEAK SUMMARY: ==18635== definitely lost: 0 bytes in 0 blocks ==18635== indirectly lost: 0 bytes in 0 blocks ==18635== possibly lost: 963,533 bytes in 1,334 blocks ==18635== still reachable: 1,365,172 bytes in 12,291 blocks ==18635== suppressed: 0 bytes in 0 blocks ==18635== Rerun with --leak-check=full to see details of leaked memory ==18635== ==18635== For counts of detected and suppressed errors, rerun with: -v ==18635== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2) When running the tests only once (no -R option), the number of possibly lost bytes is 544,068 in 1,131 blocks. When running without the patch, the number is 545,740 bytes in 1,134 blocks (for one iteration). I guess this means that Python is using interior pointers, so these blocks are not truly lost?
msg220580 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-06-14 21:32
Nadeem, did you get a chance to look at this?
msg220675 - (view)	Author: Nadeem Vawda (nadeem.vawda) *	Date: 2014-06-15 21:30
Sorry, I just haven't had any free time lately, and may still not be able to give this the attention it deserves for another couple of weeks. Serhiy, would you be interested in reviewing Nikolaus' patch?
msg225219 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-08-12 04:38
ping
msg226685 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-09-10 07:11
If people are worried about the best low-level decompressor API, maybe leave that as a future enhancement, and just rely on using the existing file reader APIs. I would expect them to have a sensible decompressed buffer size limit, however “bzip2” and LZMA look susceptible to zip bombing: >>> GzipFile(fileobj=gzip_bomb).read(1) b'\x00' >>> BZ2File(bzip_bomb).read(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.4/bz2.py", line 293, in read return self._read_block(size) File "/usr/lib/python3.4/bz2.py", line 254, in _read_block while n > 0 and self._fill_buffer(): File "/usr/lib/python3.4/bz2.py", line 218, in _fill_buffer self._buffer = self._decompressor.decompress(rawblock) MemoryError >>> z = LZMAFile(lzma_bomb) >>> z.read(1) b'\x00' # Slight delay before returning >>> len(z._buffer) 55675075 # Decompressed much more data than I asked for
msg232953 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-12-20 00:36
ping. It's been another 8 months. It would be nice if someone could review the patch.
msg233661 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-08 14:38
It turns out that GzipFile.read(<size>) etc is also susceptible to decompression bombing. Here is a patch to test and fix that, making use of the existing “max_length” parameter in the “zlib” module.
msg233799 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-10 03:31
Here is a patch for the higher-level LZMAFile implementation to use Nikolaus’s “max_length” parameter. It depends on Nikolaus’s patch also being applied. I split out a _RawReader class that does the actual decompress() calls, and then wrapped that in a BufferedReader. This avoids needing any special code to implement buffering, readline(), etc. The only significant changes in the API that I can see are: * LZMAFile now inherits the useless specification of BufferedReader.peek(), losing the guarantee of returning at least a single byte. I questioned the BufferedReader specification at <https://bugs.python.org/issue5811#msg233750>. * read() now accepts size=None, because BufferedReader does. I had to change a test case for this.
msg233865 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-11 23:57
We still need a patch for max_length in BZ2Decompressor, and to use it in BZ2File. Also, I think my GzipFile patch should be considered as a bug fix (rather than enhancement), e.g. the fix for Issue 16043 assumes GzipFile.read(<size>) is limited. I’m adding v4 of Nikolaus’s low-level LZMA patch. I merged v3 with the recent default branch and fixed the conflicts, since I couldn’t tell what revision it was originally based on. I also made some fixes based on my review comments.
msg233948 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-01-13 16:30
Martin Panter <report@bugs.python.org> writes: > We still need a patch for max_length in BZ2Decompressor, and to use it > in BZ2File. I'm still quite interested to do this. The only reason I haven't done it yet is that I'm waiting for the LZMA patch to be reviewed and committed first (no need to make any mistakes twice). Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
msg234115 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-01-16 08:18
I've posted a review of the latest lzma decompressor patch. I think it'll be able to go in once the comments are addressed.
msg234160 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-17 03:58
Adding issue15955_lzma_r5.diff. Main changes from r4: * Consistent Py_ssize_t type for data_size * max_size → max_length to match Python parameter name * Arranged for EOF handling to occur before, and instead of, saving the input buffer * Removed my LZMAFile test Nikolaus, I hope I am not getting in your way by updating this patch. I though I should take responsibility for removing the test I accidentally added in r4.
msg234172 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-01-17 15:22
New changeset 00e552a23bcc by Antoine Pitrou in branch 'default': Issue #15955: Add an option to limit output size when decompressing LZMA data. https://hg.python.org/cpython/rev/00e552a23bcc
msg234173 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-01-17 15:23
I've committed the LZMADecompressor patch. Now let's tackle the rest.
msg235084 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-31 07:36
Nikolaus, do you still plan on doing the bzip module? If not, I could have a go when I get a chance. I’m also keen for the GzipFile decompression to be fixed, if anyone wants to review my gzip-bomb.patch.
msg235123 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-01-31 18:33
Yes, I still plan to do it, but I haven't started yet. That said, I certainly won't be offended if someone else implements the feature either. Please just let me know when you start working on this (I'll do the same), so there's no duplication of efforts.
msg235124 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-01-31 18:37
On a more practical note: I believe Nadeem at one point said that the bz2 module is not exactly an example for good stdlib coding, while the lzma module's implementation is quite clean. Therefore Therefore, one idea I had for the bz2 module was to "port" the lzma module to use the bzip2 library (instead of adding a max_size parameter to the bz2 module). Just something to consider if you find time to work on this before me.
msg236248 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-20 03:44
I've started to work on the bzip module. I'll attach I work-in-progress patch if I get stuck or run out of time.
msg236351 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-21 05:04
Attached is a patch for the bz2 module.
msg236356 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-02-21 10:10
The new bzip parameter still needs documenting in the library reference. However I reviewed the doc string, C code, and tests, and I can’t find anything wrong.
msg236416 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-22 20:11
Updated patch attached.
msg236450 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-23 16:44
Attached rev3 with fixed indentation and explicit cast to avoid compiler warning on 32-bit systems.
msg236463 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-02-23 20:53
Rev3 still seems to have the same weird indentation as rev2. Are you using some sort of diff --ignore-all-space option by accident?
msg236468 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-23 22:22
On Feb 23 2015, Martin Panter <report@bugs.python.org> wrote: > Rev3 still seems to have the same weird indentation as rev2. Are you > using some sort of diff --ignore-all-space option by accident? I used "diff -w" for revision 2, but revision 3 should be good. Can you give me a line number where you see problems in rev3? Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
msg236473 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-24 01:39
One more try before I give up.
msg236474 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-24 01:58
grmblfhargs#@$@#
msg236514 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-24 16:34
Revision 7 of the bz2 patch is attached. Apologies for the flood of revisions to everyone and many thanks to Martin for noticing the problems every time, it boggles my mind that it took me 5 tries to add a simple cast. Antoine, I think this is ready for commit now.
msg236647 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2015-02-26 05:35
Martin, I'll try to review your GzipFile patch. But maybe it would make sense to open a separate issue for this? I think the LZMAFile patch has not yet been reviewed or committed, and we probably want a patch for BZ2File too. The review page is already pretty cluttered right now, so I think it would make sense to use this issue for the low-level compressor/decompressor API and handle the *File API separately.
msg236664 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-02-26 11:12
Yes I think that is a good idea: * Moved gzip-bomb.patch to Issue 23528 * Opened Issue 23529 with a new version of the LZMAFile patch, and plan to post a BZ2File patch there as well when it is ready
msg236665 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-02-26 12:09
New changeset 10dab1b596d4 by Antoine Pitrou in branch 'default': Issue #15955: Add an option to limit the output size in bz2.decompress(). https://hg.python.org/cpython/rev/10dab1b596d4
msg236666 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-02-26 12:12
Thank you, Nikolaus, and thank you Martin for the reviews.

History
Date	User	Action	Args
2022-04-11 14:57:36	admin	set	github: 60159
2015-02-26 12:12:10	pitrou	set	status: open -> closed title: gzip, bz2, lzma: add option to limit output size -> bz2, lzma: add option to limit output size messages: + msg236666 resolution: fixed stage: patch review -> resolved
2015-02-26 12:09:58	python-dev	set	messages: + msg236665
2015-02-26 11:12:33	martin.panter	set	messages: + msg236664
2015-02-26 05:35:49	nikratio	set	messages: + msg236647
2015-02-24 16:34:52	nikratio	set	files: + issue15955_bz2_rev7.diff messages: + msg236514
2015-02-24 01:58:25	nikratio	set	files: + issue15955_bz2_rev6.diff messages: + msg236474
2015-02-24 01:39:55	nikratio	set	files: + issue15955_bz2_rev5.diff messages: + msg236473
2015-02-23 22:22:45	nikratio	set	messages: + msg236468
2015-02-23 20:53:06	martin.panter	set	messages: + msg236463
2015-02-23 16:44:15	nikratio	set	files: + issue15955_bz2_rev3.diff messages: + msg236450
2015-02-22 20:11:56	nikratio	set	files: + issue15955_bz2_rev2.diff messages: + msg236416
2015-02-21 10:10:13	martin.panter	set	messages: + msg236356
2015-02-21 05:04:05	nikratio	set	files: + issue15955_bz2.diff messages: + msg236351
2015-02-20 03:44:30	nikratio	set	messages: + msg236248
2015-01-31 18:37:26	nikratio	set	messages: + msg235124
2015-01-31 18:33:37	nikratio	set	messages: + msg235123
2015-01-31 07:36:45	martin.panter	set	messages: + msg235084
2015-01-17 15:23:09	pitrou	set	messages: + msg234173
2015-01-17 15:22:31	python-dev	set	nosy: + python-dev messages: + msg234172
2015-01-17 03:58:39	martin.panter	set	files: + issue15955_lzma_r5.diff messages: + msg234160
2015-01-16 08:18:42	pitrou	set	messages: + msg234115
2015-01-13 16:30:52	nikratio	set	messages: + msg233948
2015-01-11 23:57:38	martin.panter	set	files: + issue15955_lzma_r4.diff messages: + msg233865
2015-01-11 23:24:02	pitrou	set	stage: needs patch -> patch review
2015-01-10 03:32:01	martin.panter	set	files: + LZMAFile.patch messages: + msg233799
2015-01-08 14:38:37	martin.panter	set	files: + gzip-bomb.patch messages: + msg233661
2014-12-20 00:36:30	nikratio	set	messages: + msg232953
2014-09-10 07:11:22	martin.panter	set	messages: + msg226685
2014-08-12 04:38:33	nikratio	set	messages: + msg225219
2014-06-16 21:30:27	vstinner	set	nosy: + vstinner
2014-06-15 21:30:52	nadeem.vawda	set	messages: + msg220675
2014-06-14 21:32:02	nikratio	set	messages: + msg220580
2014-04-12 18:39:36	nikratio	set	files: + issue15955_r3.diff
2014-04-12 05:04:21	nikratio	set	files: + issue15955_r2.diff messages: + msg215959
2014-04-06 15:59:13	serhiy.storchaka	set	messages: + msg215661
2014-04-06 14:10:57	nadeem.vawda	set	messages: + msg215657
2014-03-30 22:45:59	nadeem.vawda	set	messages: + msg215211
2014-03-25 04:03:56	nikratio	set	files: + issue15955.diff keywords: + patch messages: + msg214774
2014-03-25 02:09:00	nikratio	set	messages: + msg214773
2014-02-02 16:15:48	nadeem.vawda	set	messages: + msg209999
2014-01-25 18:46:13	nikratio	set	messages: + msg209228
2014-01-25 04:56:53	nikratio	set	messages: + msg209151
2014-01-22 09:46:47	nadeem.vawda	set	messages: + msg208763 versions: + Python 3.5, - Python 3.4
2014-01-20 04:53:49	nikratio	set	messages: + msg208522
2013-06-17 06:27:57	martin.panter	set	nosy: + martin.panter
2013-04-21 23:04:35	nikratio	set	nosy: + nikratio messages: + msg187534
2013-04-21 22:09:34	serhiy.storchaka	link	issue17813 superseder
2013-01-20 15:14:03	christian.heimes	unlink	issue16043 dependencies
2013-01-19 22:24:11	nadeem.vawda	set	messages: + msg180271
2012-12-09 17:20:48	serhiy.storchaka	set	messages: + msg177228
2012-12-09 13:11:55	nadeem.vawda	set	messages: + msg177213
2012-12-04 11:31:31	serhiy.storchaka	set	messages: + msg176896
2012-12-04 08:49:43	serhiy.storchaka	set	messages: + msg176883
2012-12-03 07:57:32	Arfrever	set	nosy: + Arfrever
2012-12-02 21:52:49	nadeem.vawda	set	messages: + msg176811
2012-11-06 23:29:10	nadeem.vawda	set	messages: + msg175028
2012-11-05 14:50:34	serhiy.storchaka	set	messages: + msg174912
2012-11-05 01:25:16	nadeem.vawda	set	messages: + msg174860 stage: needs patch
2012-09-25 18:12:48	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg171304
2012-09-25 12:26:58	pitrou	set	nosy: + pitrou messages: + msg171257 title: gzip, bz2, lzma: add method to get decompressed size -> gzip, bz2, lzma: add option to limit output size
2012-09-25 11:03:15	christian.heimes	set	messages: + msg171248
2012-09-25 10:58:41	christian.heimes	link	issue16043 dependencies
2012-09-23 16:46:34	nadeem.vawda	set	messages: + msg171059
2012-09-18 15:59:21	eric.araujo	set	nosy: + eric.araujo
2012-09-17 12:22:42	christian.heimes	create