This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile chokes on ipython archive on Windows
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: arve_knudsen, lars.gustaebel, nnorwitz
Priority: normal Keywords:

Created on 2006-07-24 21:00 by arve_knudsen, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg29260 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-24 21:00
I'm trying to extract files from the latest ipython tar
archive, available from
http://ipython.scipy.org/dist/ipython-0.7.2.tar.gz,
using tarfile. This is on Windows XP, using Python
2.4.3. There is only a problem if I open the archive in
stream mode (the "mode" argument to tarfile.open is
"r|gz"), in which case tarfile raises StreamError. I'd
be happy if this error could be sorted out.

The following script should trigger the error:

import tarfile

f = file(r"ipython-0.7.2.tar.gz", "rb")
tar = tarfile.open(fileobj=f, mode="r|gz")
try:
    for m in tar:
        tar.extract(m)
finally:
    tar.close()
    f.close(

The resulting exception:
Traceback (most recent call last):
  File "tst.py", line 7, in ?
    tar.extract(m)
  File "C:\Program Files\Python24\lib\tarfile.py", line
1335, in extract
    self._extract_member(tarinfo, os.path.join(path,
tarinfo.name))
  File "C:\Program Files\Python24\lib\tarfile.py", line
1431, in _extract_member

    self.makelink(tarinfo, targetpath)
  File "C:\Program Files\Python24\lib\tarfile.py", line
1515, in makelink
    self._extract_member(self.getmember(linkpath),
targetpath)
  File "C:\Program Files\Python24\lib\tarfile.py", line
1423, in _extract_member

    self.makefile(tarinfo, targetpath)
  File "C:\Program Files\Python24\lib\tarfile.py", line
1461, in makefile
    copyfileobj(source, target)
  File "C:\Program Files\Python24\lib\tarfile.py", line
158, in copyfileobj
    shutil.copyfileobj(src, dst)
  File "C:\Program Files\Python24\lib\shutil.py", line
22, in copyfileobj
    buf = fsrc.read(length)
  File "C:\Program Files\Python24\lib\tarfile.py", line
551, in _readnormal
    self.fileobj.seek(self.offset + self.pos)
  File "C:\Program Files\Python24\lib\tarfile.py", line
420, in seek
    raise StreamError, "seeking backwards is not allowed"
tarfile.StreamError: seeking backwards is not allowed
msg29261 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-07-25 03:35
Logged In: YES 
user_id=33168

I tested this on Linux with both 2.5 and 2.4.3+ without
problems.  I believe there were some fixes in this area. 
Could you try testing with the 2.4.3+ current which will
become 2.4.4 (or 2.5b2)?  If this is still a problem, it
looks like it may be Windows specific.
msg29262 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 07:29
Logged In: YES 
user_id=1522083

Well yeah, it appears to be Windows specific. I just tested
on Linux (Ubuntu), also with Python 2.4.3. I'll try 2.4.3+
on Windows to see if it makes any difference. Come to think
of it I think I experienced this problem in that past on
Linux, but then I solved it by repacking ipython. Also, if I
pack it myself on Windows using bsdtar it works fine.
msg29263 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 08:04
Logged In: YES 
user_id=1522083

Ok, I've verified now that the problem persists with Python
2.4.4 (from the 2.4 branch in svn). The exact same thing
happens.
msg29264 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2006-07-25 08:42
Logged In: YES 
user_id=642936

The traceback tells me that there is a hard link inside the
archive which means that a file in the archive is referenced
to twice. This hard link can be extracted only on platforms
that have an os.link() function. On Win32 they're not
supported by the file system, but tarfile works around this
by extracting the referenced file twice. In order to extract
the file the second time it is necessary that tarfile seeks
back in the input file to access the file's data again. But
"seeking backwards is not allowed" when a file is opened in
streaming mode ;-)
If you do not necessarily need streaming mode for your
application, better use "r:gz" or "r" and the problem will
be gone.
msg29265 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 08:59
Logged In: YES 
user_id=1522083

Thanks for the clarification, Lars. I'd prefer to continue
with my current approach however, since it allows me to
report progress as the tarfile is unpacked/decompressed.
Also, I don't think it would be satisfactory at all if
tarfile would just die with a mysterious error in such cases.

In order to resolve this, why must tarfile extract the file
again, can't it copy the already extracted file?
msg29266 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2006-07-25 09:31
Logged In: YES 
user_id=642936

Copying the previously extracted file is no option. When the
archive is extracted inside a loop, you never know what
happens between two extract() calls. The original file could
have been renamed, changed or removed. Suppose you want to
extract just those members which are hard links:

for tarinfo in tar:
    if tarinfo.islnk():
        tar.extract(tarinfo)

I agree with you that the error message is bad because it
does not give the slightest idea of what's going wrong. I'll
see what I can do about that.

To work around your particular problem, my idea is to
subclass the TarFile class and replace the makelink() method
with one that simply copies the file as you proposed.
msg29267 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 09:58
Logged In: YES 
user_id=1522083

Yes I admit that is a weakness to my proposed approach.
Perhaps it would be a better idea to extract hardlinked
files to a temporary location and copy those files when
needed, as a cache? The only problem that I can think of
with this approach is the overhead, but perhaps this could
be configurable through a keyword if you think it would pose
a significant problem (i.e. keeping extra copies of
potentially huge files)?

The temporary cache would be private to tarfile, so there
should be no need to worry about modifications to the
contained files.
msg29268 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-26 22:20
Logged In: YES 
user_id=1522083

Regarding my last comment, sorry about the noise. After 
giving it some more thought I realized it was not very 
realistic implementation wise, seeing as you can't know 
whether a file is being linked to when you encounter it in 
the stream (right?).

So I followed your suggestion instead and handled the links 
on the client level. What I think I'd like to see in 
TarFile though is an 'extractall' method with the ability 
to report progress to an optional callback, since I'm only 
opening in stream mode as a hack to implement this myself 
(by monitoring file position). From browsing tarfile's 
source it seems it might require some effort though (with 
e.g. BZ2File you can't know the amount of data without 
decompressing everything?).
msg59477 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2008-01-07 18:55
I close this issue because it is out of date. The new
TarFile.extractall() method in Python 2.5 provides a way to solve the
original problem IMO.
History
Date User Action Args
2022-04-11 14:56:19adminsetgithub: 43713
2008-01-07 18:55:05lars.gustaebelsetstatus: open -> closed
resolution: out of date
messages: + msg59477
2006-07-24 21:00:58arve_knudsencreate