classification
Title: gzip.GzipFile to accept stream as fileobj.
Type: Stage:
Components: Extension Modules Versions: Python 2.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: antialize, belyi, georg.brandl, loewis, lucas_malor
Priority: normal Keywords: patch

Created on 2004-03-11 18:45 by belyi, last changed 2007-03-15 15:16 by lucas_malor. This issue is now closed.

Files
File name Uploaded Description Edit
gzip-stream.patch belyi, 2004-03-11 20:04
Messages (8)
msg45495 - (view) Author: Igor Belyi (belyi) Date: 2004-03-11 18:45
When gzip.GzipFile is initialized with a fileobj which
does not have
tell() and seek() methods (non-rewinding stream) it throws
exception. The interesting thing is that it doesn't
have to. The
following patch updates gzip.py to allow any stream
with just a
read() method to be used. This is helpful if you want
to be able to
do something like:
gzip.GzipFile(fileobj=urllib.urlopen("file:///README.gz")).readlines()
or use GzipFile with sys.stdin stream.

But keep in mind that seek() and rewind() methond of
the GzipFile()
won't for such stream even with the patch.

Igor
msg45496 - (view) Author: Igor Belyi (belyi) Date: 2004-03-11 20:04
Logged In: YES 
user_id=995711

Previous revision of the patch does not work correctly with
mutliple
compressed members in one stream. I've updated the patch file.
msg45497 - (view) Author: Igor Belyi (belyi) Date: 2004-03-19 04:27
Logged In: YES 
user_id=995711

I thought I need to add a little bit more verbose
explanation for
the changes...

Current implementation of GzipFile() uses tell() and seek()
to scroll stream of data in the following 2 cases:
1. When EOF is reached and the last 8 bytes of the file
contain checksum and uncompress data size
2. When after decompression there's left some 'unused_data'
meaning that a stream may contains more than one compressed
item.

What my change does it introduces 2 helper buffers:
'inputbuf' which keeps read but unused data from the stream and
'last8' which keeps last 8 'used' bytes

Plus, my change introduces helper method _read_internal()
which is used instead of the direct call to
self.fileobj.read(). In this method data from the stream are
read as needed with the call to self.fileobj.read() and
correct values of 'inputbuf' and ''last8' are maintained.

When case 1 above happen we use 'last8' buffer to read
checksum and size.
When case 2 above happen we add value of the 'unused_data'
to inputbuf.

There's one more instance of the self.fileobj.seek() call
left in rewind() method but it is used only when rewind() or
seek() methods of GzipFile class are used. And it won't be
logical to expect those methods to work if the underlying
fileobj does not support them.

Igor
msg45498 - (view) Author: Igor Belyi (belyi) Date: 2004-03-19 14:14
Logged In: YES 
user_id=995711

I thought I need to add a little bit more verbose
explanation for
the changes...

Current implementation of GzipFile() uses tell() and seek()
to scroll stream of data in the following 2 cases:
1. When EOF is reached and the last 8 bytes of the file
contain checksum and uncompress data size
2. When after decompression there's left some 'unused_data'
meaning that a stream may contains more than one compressed
item.

What my change does it introduces 2 helper buffers:
'inputbuf' which keeps read but unused data from the stream and
'last8' which keeps last 8 'used' bytes

Plus, my change introduces helper method _read_internal()
which is used instead of the direct call to
self.fileobj.read(). In this method data from the stream are
read as needed with the call to self.fileobj.read() and
correct values of 'inputbuf' and ''last8' are maintained.

When case 1 above happen we use 'last8' buffer to read
checksum and size.
When case 2 above happen we add value of the 'unused_data'
to inputbuf.

There's one more instance of the self.fileobj.seek() call
left in rewind() method but it is used only when rewind() or
seek() methods of GzipFile class are used. And it won't be
logical to expect those methods to work if the underlying
fileobj does not support them.

Igor
msg45499 - (view) Author: Jakob Truelsen (antialize) Date: 2006-06-19 08:35
Logged In: YES 
user_id=379876

Is there any reson this patch is not accepted? If this patch
is accepted then I have a patch to urlib2 to (automaticaly)
accept gzipped content as described here
http://www.http-compression.com/#client_request, if there is
some reson this patch is not acceptable please detail, so it
can be fixed, in tired of using popen and gunzip :) 
msg45500 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-03-06 14:51
The patch in this form is incomplete: it lacks test suite changes. Can somebody please provide patches to Lib/test/test_gzip.py that exercises this new functionality?
msg45501 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-03-08 20:59
It looks like Patch #1675951 provides the same feature, plus speedups.
msg45502 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-03-15 15:16
There's a problem with this path. If previously in my code I read some bytes of the the GzipFile object, _read_gzip_header returns IOError, 'Not a gzipped file', because it starts to read at the current position, not at the start. Unluckily seek() could not be used for urllib objects. I don't see any possible workaround.
History
Date User Action Args
2011-03-20 21:28:29ned.deilylinkissue11608 superseder
2004-03-11 18:45:17belyicreate