classification
Title: tarfile.extractfile in "r|" stream mode fails with filenames or members from getmembers()
Type: behavior Stage: needs patch
Components: Documentation Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: David.Nesting, docs@python, flying sheep, lars.gustaebel, martin.panter, wichert
Priority: low Keywords:

Created on 2010-11-16 18:17 by David.Nesting, last changed 2017-05-09 13:08 by flying sheep.

Messages (6)
msg121308 - (view) Author: David Nesting (David.Nesting) Date: 2010-11-16 18:17
When opening a tarfile with mode "r|" (streaming mode), extractfile("filename") and extractfile(mytarfile.getmembers()[0]) raise "tarfile.StreamError: seeking backwards is not allowed".  extractfile(mytarfile.next()) succeeds.  A more complete test case:

"""
import tarfile
import StringIO

# Create a simple tar file in memory.  This could easily be a real tar file
# though.
data = StringIO.StringIO()
tf = tarfile.open(fileobj=data, mode="w")
tarinfo = tarfile.TarInfo(name="testfile")
filedata = StringIO.StringIO("test data")
tarinfo.size = len(filedata.getvalue())
tf.addfile(tarinfo, fileobj=filedata)
tf.close()
data.seek(0)

# Open as an uncompressed stream
tf = tarfile.open(fileobj=data, mode="r|")

#f = tf.extractfile("testfile")
#print "%s: %s" % (f.name, f.read())
#
#Traceback (most recent call last):
#  File "./bug.py", line 19, in <module>
#    print "%s: %s" % (f.name, f.read())
#  File "/usr/lib/python2.7/tarfile.py", line 815, in read
#    buf += self.fileobj.read()
#  File "/usr/lib/python2.7/tarfile.py", line 735, in read
#    return self.readnormal(size)
#  File "/usr/lib/python2.7/tarfile.py", line 742, in readnormal
#    self.fileobj.seek(self.offset + self.position)
#  File "/usr/lib/python2.7/tarfile.py", line 554, in seek
#    raise StreamError("seeking backwards is not allowed")
#tarfile.StreamError: seeking backwards is not allowed

#for member in tf.getmembers():
#  f = tf.extractfile(member)
#  print "%s: %s" % (f.name, f.read())
#
# Same traceback

while True:
  member = tf.next()
  if member is None:
    break
  f = tf.extractfile(member)
  print "%s: %s" % (f.name, f.read())

# This works.
"""

It appears that extractfile("filename") invokes getmember("filename"), which invokes getmembers().  getmembers() scans the entire file before returning results, and by doing so, it's read past and discarded the actual file data, which makes it impossible for us to actually extract it.

If this is accurate, this seems tricky to completely fix.  You could make getmembers() a generator that doesn't read too far ahead so that the file's contents are still available if someone wants to retrieve them for each file yielded.  getmember("filename") could just scan forward through the file until it hits a match, but you'd still lose the ability to do a getmember("filename") on a file that we skipped over.

If nothing else, document that extractfile("filename"), getmember() and getmembers() won't work reliably in streaming mode, and possibly raise an exception whenever someone tries just to make behavior consistent.
msg121339 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-11-17 09:32
This behaviour is intentional. A tar archive does not contain a central directory structure, it is just a chain of files. As a side-effect it is possible to have multiple files with the same name in one archive, e.g. when append mode was used. That's why the archive must be scanned from the beginning to the end as soon as you reference an archive member by its name.
The best way to deal with this issue in my opinion is to improve the documentation for the stream interface.
msg121362 - (view) Author: David Nesting (David.Nesting) Date: 2010-11-17 16:03
Thanks, Lars.  And this does make complete sense to me in retrospect.

Better documentation here would help a lot.  I'm happy to take a stab at this.  Short of labeling methods as "safe for streaming" versus "unsafe for streaming", it occurs to me that it would be a lot cleaner if TarFile were actually broken up into two classes: one streaming-safe, and the other layering random access convenience methods on top of that.  For compatibility's sake the open method should probably still return an instance of the composite class, but at least it keeps these logically separate internally and makes it easier to document.
msg168938 - (view) Author: Wichert Akkerman (wichert) Date: 2012-08-23 12:47
You could also look for the first matching file and extract that. That way you can at least implement something similar to what standard tar can do:

[fog;/tmp]-10> tar tf x.tar 
docs/
docs/index.rst
docs/glossary.rst
docs/Makefile
docs/conf.py
docs/changes.rst
[fog;/tmp]-12> cat x.tar| tar xf - docs/index.rst
[fog;/tmp]-13> ls docs 
index.rst
msg264460 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-29 04:53
David, if you are still interested, I think specific suggestions or patches would be welcome (even if Lars assigned this to himself five years ago). I also like the idea of separate low-level and random access layers (I also thought of this).

One other problem with the documentation of this mode is it points to the Examples section, but it is not obvious why. The only example specific to non-seeking mode was removed in r63411, apparently because it was obsolete:

The _only_ way to extract an uncompressed tar stream from “sys.stdin”:

tar = tarfile.open(mode="r|", fileobj=sys.stdin)
for tarinfo in tar:
    tar.extract(tarinfo)
msg293307 - (view) Author: (flying sheep) * Date: 2017-05-09 13:08
well, we should just allow

extractall(members=['foo', 'bar'])

currently members only accepts TarInfo objects, not filenames, but it’s easy to accept both.

https://github.com/python/cpython/blob/74683fc6247c522ae955a6e7308b8ff51def35d8/Lib/tarfile.py#L1991-L1999

sth like:

filenames = set()
for member in members:
    if isinstance(member, TarInfo):
        # do what’s done now
    else:
        filenames.add(member)

for tarinfo in self:
    if tarinfo.name in filenames:
        self.extract(tarinfo)
History
Date User Action Args
2017-05-09 13:08:51flying sheepsetnosy: + flying sheep
messages: + msg293307
2016-04-29 04:53:15martin.pantersetnosy: + martin.panter
messages: + msg264460
2012-08-23 12:47:39wichertsetnosy: + wichert
messages: + msg168938
2011-07-22 19:45:42terry.reedysetversions: + Python 3.3, - Python 3.1
2010-11-19 14:18:17eric.araujosetnosy: + docs@python
stage: needs patch

components: + Documentation, - Library (Lib)
versions: - Python 2.6, Python 3.3
2010-11-17 16:03:15David.Nestingsetmessages: + msg121362
2010-11-17 09:32:29lars.gustaebelsetpriority: normal -> low

nosy: + lars.gustaebel
versions: + Python 3.1, Python 3.2, Python 3.3
messages: + msg121339

assignee: lars.gustaebel
2010-11-16 18:17:43David.Nestingcreate