Message 266464 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	Klamann, martin.panter, xiang.zhang
Date	2016-05-27.02:24:29
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1464315872.56.0.62086441181.issue27130@psf.upfronthosting.co.za>
In-reply-to

Content
This is similar, but different to the other bug. The other bug was only about output limits for incrementally decompressed data. Klamann’s bug is about the actual size of input (and possibly also output) buffers. The gzip.compress() implementation uses zlib.compressobj.compress(), which does not accept 2 or 4 GiB input either. The underlying zlib library uses “unsigned int” for the size of input and output chunks. It has to be called multiple times to handle 4 GiB. In both Python 2 and 3, the one-shot compress() function only does a single call to zlib. This explains why Python 3 cannot take 4 GiB. Python 2 uses an “int” for the input buffer size, hence the 2 GiB limit. I tend to think of these cases as bugs, which could be fixed in 3.5 and 2.7. Sometimes others also treat adding 64-bit support as a bug fix, e.g. file.read() on Python 2 (Issue 21932). But other times it is handled as a new feature for the next Python version, e.g. os.read() was fixed in 3.5, but not 2.7 (Issue 21932), random.getrandbits() proposed for 3.6 only (Issue 27072). This kind of bug is apparently already fixed for crc32() and adler32() in Python 2 and 3; see Issue 10276. This line from zlib.compress() also worries me: zst.avail_out = length + length/1000 + 12 + 1; /* unsigned ints */ I suspect it may overflow, but I don’t have enough memory to verify. You would need to compress just under 4 GiB of data that requires 5 MB or more when compressed (i.e. not all the same bytes, or maybe try level=0). Also, the logic for expanding the output buffer in each of zlib.decompress(), compressobj.compress(), decompressobj.decompress(), compressobj.flush(), and decompressobj.flush() looks faulty when it hits UINT_MAX. I suspect it may overwrite unallocated memory or do other funny stuff, but again I don’t have enough memory to verify. What happens when you decompress more than 4 GiB when the compressed input is less than 4 GiB? Code fixes that I think could be made: 1. Avoid the output buffer size overflow in the zlib.compress() function 2. Rewrite zlib.compress() to call deflate() in a loop, one iteration for each 4 GiB input or output chunk 3. Allow the zlib.decompress() function to expand the output buffer beyond 4 GiB 4. Rewrite zlib.decompress() to pass 4 GiB input chunks to inflate() 5. Allow the compressobj.compress() method to expand the output buffer beyond 4 GiB 6. Rewrite compressobj.compress() to pass 4 GiB input chunks to deflate() 7. Allow the decompressobj.decompress() method to expand the output buffer beyond 4 GiB 8. Rewrite decompressobj.decompress() to pass 4 GiB input chunks to inflate(), and to save 4 GiB in decompressobj.unconsumed_tail and unused_data 9. Change the two flush() methods to abort if they allocate UINT_MAX bytes, rather than pointing into unallocated memory (I don’t think this could happen in real usage, but the code shares the same problem as above.)

This is similar, but different to the other bug. The other bug was only about output limits for incrementally decompressed data. Klamann’s bug is about the actual size of input (and possibly also output) buffers.

The gzip.compress() implementation uses zlib.compressobj.compress(), which does not accept 2 or 4 GiB input either.

The underlying zlib library uses “unsigned int” for the size of input and output chunks. It has to be called multiple times to handle 4 GiB. In both Python 2 and 3, the one-shot compress() function only does a single call to zlib. This explains why Python 3 cannot take 4 GiB.

Python 2 uses an “int” for the input buffer size, hence the 2 GiB limit.

I tend to think of these cases as bugs, which could be fixed in 3.5 and 2.7. Sometimes others also treat adding 64-bit support as a bug fix, e.g. file.read() on Python 2 (Issue 21932). But other times it is handled as a new feature for the next Python version, e.g. os.read() was fixed in 3.5, but not 2.7 (Issue 21932), random.getrandbits() proposed for 3.6 only (Issue 27072).

This kind of bug is apparently already fixed for crc32() and adler32() in Python 2 and 3; see Issue 10276.

This line from zlib.compress() also worries me:

zst.avail_out = length + length/1000 + 12 + 1; /* unsigned ints */

I suspect it may overflow, but I don’t have enough memory to verify. You would need to compress just under 4 GiB of data that requires 5 MB or more when compressed (i.e. not all the same bytes, or maybe try level=0).

Also, the logic for expanding the output buffer in each of zlib.decompress(), compressobj.compress(), decompressobj.decompress(), compressobj.flush(), and decompressobj.flush() looks faulty when it hits UINT_MAX. I suspect it may overwrite unallocated memory or do other funny stuff, but again I don’t have enough memory to verify. What happens when you decompress more than 4 GiB when the compressed input is less than 4 GiB?

Code fixes that I think could be made:

1. Avoid the output buffer size overflow in the zlib.compress() function

2. Rewrite zlib.compress() to call deflate() in a loop, one iteration for each 4 GiB input or output chunk

3. Allow the zlib.decompress() function to expand the output buffer beyond 4 GiB

4. Rewrite zlib.decompress() to pass 4 GiB input chunks to inflate()

5. Allow the compressobj.compress() method to expand the output buffer beyond 4 GiB

6. Rewrite compressobj.compress() to pass 4 GiB input chunks to deflate()

7. Allow the decompressobj.decompress() method to expand the output buffer beyond 4 GiB

8. Rewrite decompressobj.decompress() to pass 4 GiB input chunks to inflate(), and to save 4 GiB in decompressobj.unconsumed_tail and unused_data

9. Change the two flush() methods to abort if they allocate UINT_MAX bytes, rather than pointing into unallocated memory (I don’t think this could happen in real usage, but the code shares the same problem as above.)

History
Date	User	Action	Args
2016-05-27 02:24:32	martin.panter	set	recipients: + martin.panter, xiang.zhang, Klamann
2016-05-27 02:24:32	martin.panter	set	messageid: <1464315872.56.0.62086441181.issue27130@psf.upfronthosting.co.za>
2016-05-27 02:24:32	martin.panter	link	issue27130 messages
2016-05-27 02:24:29	martin.panter	create