Message 23346 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eichin
Recipients
Date	2004-11-27.17:29:30
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
One of the values of the gzip format is that one can reopen for append and the file is, as a whole, still valid. This is accomplished by adding new headers on reopen. gzip.py (as tested on 2.1, 2.3, and 2.4rc1 freshly built) doesn't deal well with more than a certain number of appended headers. The included test case generates (using gzip.py) such a file, runs gzip -tv on it to show that it is valid, and then tries to read it with gzip.py -- and it blows out, with OverflowError: long int too large to convert to int in earlier releases, MemoryError in 2.4rc1 - what's going on is that gzip.GzipFile.read keeps doubling readsize and calling _read again; _read does call _read_gzip_header, and consumes one header. So, readsize doubling means that older pythons blow out by not autopromoting past 2**32, and 2.4 blows out trying to call file.read on a huge value - but basically, more than 30 or so headers and it fails. The test case below is based on a real-world queueing case that generates over 200 appended headers - and isn't bounded in any useful way. I'll think about ways to make GzipFile more clever, but I don't have a patch yet.

One of the values of the gzip format is that one can reopen for 
append and the file is, as a whole, still valid.  This is accomplished 
by adding new headers on reopen.  gzip.py (as tested on 2.1, 2.3, 
and 2.4rc1 freshly built) doesn't deal well with more than a certain 
number of appended headers.

The included test case generates (using gzip.py) such a file, runs 
gzip -tv on it to show that it is valid, and then tries to read it with 
gzip.py -- and it blows out, with 

OverflowError: long int too large to convert to int

in earlier releases, MemoryError in 2.4rc1 - what's going on is that 
gzip.GzipFile.read keeps doubling readsize and calling _read again; 
_read does call _read_gzip_header, and consumes *one* header.  
So, readsize doubling means that older pythons blow out by not 
autopromoting past 2**32, and 2.4 blows out trying to call file.read 
on a huge value - but basically, more than 30 or so headers and it 
fails.

The test case below is based on a real-world queueing case that 
generates over 200 appended headers - and isn't bounded in any 
useful way.  I'll think about ways to make GzipFile more clever, but 
I don't have a patch yet.

History
Date	User	Action	Args
2007-08-23 14:27:51	admin	link	issue1074261 messages
2007-08-23 14:27:51	admin	create