Message 254717 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	Ben Cipollini, Matthew.Brett, martin.panter, nadeem.vawda, twouters
Date	2015-11-16.06:54:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1447656880.84.0.182555307113.issue25626@psf.upfronthosting.co.za>
In-reply-to

Content
Ben: By adding the cap, it means when the lower-level zlib module is asked to decode say 5 GiB, it will decode at most 4 GiB (rather than raising an exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying zlib library; it’s not Python’s doing, and the limit was present before 3.5 as well. In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 5 GiB because there can only be 8ish MB data per internal decompress() call. Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk size up to 10 MiB in each internal iteration. If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching. === I propose this patch for the 3.5 branch, which implements the internal capping in the zlib module. It changes the zlib module to accept sizes greater than UINT_MAX, but internally limits them. I think this is the most practical way forward, because it does not require Python code to know the value of UINT_MAX (which could in theory vary by platform). I would be good to get a review of my patch. In particular, is this change okay for the 3.5 maintainence branch? It would also be good to confirm that my patch fixes the original problem, i.e. read(2**32) works on computers with more than 4 GiB of memory (which I cannot test). This regression is caused by Issue 23529, where I make sure that GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression bomb.

Ben: By adding the cap, it means when the lower-level zlib module is asked to decode say 5 GiB, it will decode at most 4 GiB (rather than raising an exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying zlib library; it’s not Python’s doing, and the limit was present before 3.5 as well.

In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 5 GiB because there can only be 8ish MB data per internal decompress() call.

Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk size up to 10 MiB in each internal iteration.

If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching.

===

I propose this patch for the 3.5 branch, which implements the internal capping in the zlib module. It changes the zlib module to accept sizes greater than UINT_MAX, but internally limits them. I think this is the most practical way forward, because it does not require Python code to know the value of UINT_MAX (which could in theory vary by platform).

I would be good to get a review of my patch. In particular, is this change okay for the 3.5 maintainence branch? It would also be good to confirm that my patch fixes the original problem, i.e. read(2**32) works on computers with more than 4 GiB of memory (which I cannot test).

This regression is caused by Issue 23529, where I make sure that GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression bomb.

History
Date	User	Action	Args
2015-11-16 06:54:42	martin.panter	set	recipients: + martin.panter, twouters, nadeem.vawda, Matthew.Brett, Ben Cipollini
2015-11-16 06:54:40	martin.panter	set	messageid: <1447656880.84.0.182555307113.issue25626@psf.upfronthosting.co.za>
2015-11-16 06:54:40	martin.panter	link	issue25626 messages
2015-11-16 06:54:39	martin.panter	create