classification
Title: Make gzip module not require that underlying file object support seek
Type: behavior Stage:
Components: Versions: Python 3.2
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: doko, kraai, msuchy@redhat.com, r.david.murray, terry.reedy
Priority: normal Keywords:

Created on 2010-08-23 08:06 by doko, last changed 2010-11-25 13:16 by msuchy@redhat.com. This issue is now closed.

Files
File name Uploaded Description Edit
gzipstream.py msuchy@redhat.com, 2010-11-25 13:16 gzipstream module
Messages (6)
msg114728 - (view) Author: Matthias Klose (doko) * (Python committer) Date: 2010-08-23 08:06
[ forwarded from http://bugs.debian.org/571317 ]


"I'm writing a program that uses the popularity contest results.  Since
downloading the compressed results takes about a quarter of the time
it takes to download the uncompressed results, I'd like to use the
following construct to iterate over the results:

 for line in gzip.GzipFile(fileobj=urllib.request.urlopen('http://popcon.debian.org/by_vote.gz')):

Unfortunately, this fails with the following exception:

 Traceback (most recent call last):
   File "/home/kraai/bin/rc-bugs", line 76, in <module>
     main()
   File "/home/kraai/bin/rc-bugs", line 56, in main
     for line in gzip.GzipFile(fileobj=urllib.request.urlopen('http://popcon.debian.org/by_vote.gz')): 
   File "/usr/lib/python3.1/gzip.py", line 469, in __next__
     line = self.readline()
   File "/usr/lib/python3.1/gzip.py", line 424, in readline
     c = self.read(readsize)
   File "/usr/lib/python3.1/gzip.py", line 249, in read
     self._read(readsize)
   File "/usr/lib/python3.1/gzip.py", line 277, in _read
     pos = self.fileobj.tell()   # Save current position
 io.UnsupportedOperation: seek

I wish that the gzip module didn't require the underlying file object
to support seek so that this construct would work."
msg115122 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-27 19:39
I do not think your wish is sensibly possible. GzipFile wraps an object that is or simulates a file. This is necessary because GzipFile "simulates most of the methods of a file object, with the exception of the readinto() and truncate() methods." Note that seek, rewind, and tell are not excluded. For instance, I see no way that this:
    def rewind(self):
        '''Return the uncompressed stream file position indicator to the
        beginning of the file'''
        if self.mode != READ:
            raise IOError("Can't rewind in write mode")
        self.fileobj.seek(0)
        ...
could be implemented without seek, and without having a complete local copy of everything read.

urllib.request.urlopen returns a 'file-like' object that does not quite fully simulate a file. The downstream OP should save the gzip file locally using urlretrieve() and *then* open and iterate through it. Feel free to forward this suggestion to the OP.
msg115393 - (view) Author: Matt Kraai (kraai) Date: 2010-09-02 16:13
I don't know the gzip format well enough, but I was hoping that it would be possible to iterate through the lines of a gzip-compressed stream without having to use any of the functions that would require seeking.
msg115745 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-07 10:12
Matt: if you want to learn the file format and propose a patch, I think it would be OK for gzip to duck-type the file object and only raise an error when a seek is explicitly requested.  After all, that's the way real file objects work.  A quick glance at the code, though, indicates this isn't a trivial refactoring.  I think it should be possible in theory since one can pipe a gzipped file into gunzip, and I don't think it buffers the whole file to unzip it...but I don't know for sure. Another issue is that if the patch substantially changes the memory/performance footprint it might get rejected on that basis.

If you (or anyone else) wants to work on a patch let me know and I'll reopen the issue.
msg115776 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-09-07 15:24
It is possible that only a fixed-size buffer is needed. If so, use of an alternate read mechanism could be conditioned on the underlying file(like) object not having seek.

It is also possible to direct a stream to a temporary file, but I think having the user do so explicitly is better so there are no surprises and so that the user has file reference for any further work.

Or their could be a context manager class for creating temp files from streams (or urls specifically) and deleting when done. One could then write

with TempStreamFile(urlopen('xxx') as f:
  for line in Gzipfile(fileobj=f):
msg122360 - (view) Author: Miroslav Suchý (msuchy@redhat.com) Date: 2010-11-25 13:16
I'm proposing GzipStream class which inherit from gzip.GzipFile and handle streaming gzipped data.

You can use this module under both Python or GPLv2 license.

We use this module under python 2.6. Not sure if it will work under Python3.
History
Date User Action Args
2010-11-25 13:16:18msuchy@redhat.comsetfiles: + gzipstream.py
nosy: + msuchy@redhat.com
messages: + msg122360

2010-09-07 15:24:09terry.reedysetmessages: + msg115776
2010-09-07 10:12:52r.david.murraysetversions: - Python 3.3
nosy: + r.david.murray

messages: + msg115745

type: behavior
2010-09-02 16:13:01kraaisetnosy: + kraai
messages: + msg115393
2010-08-27 19:39:43terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg115122

resolution: wont fix
2010-08-23 08:06:48dokocreate