Author Jussi Judin
Recipients Jussi Judin
Date 2017-02-21.10:10:45
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1487671846.82.0.103027933595.issue29612@psf.upfronthosting.co.za>
In-reply-to
Content
I managed to create a tarball that brought out quite nasty behavior with tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there are hard links inside a tarball that point to themselves with a file that is included in the tarball. In Python 2.7 it leads to an exception and with Python 3.4-3.6 it extracts the same file from the tarball multiple times.

First we create a tarball that causes this behavior:

$ mkdir -p tardata/1/2/3/4/5/6/7/8/9
$ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500
# tar by default adds all directories recursively multiple times to the archive, but duplicates are created as hard links:
$ find tardata | xargs tar cvfz tardata.tar.gz

Then let's extract the tarball with tarfile module
Let following commands demonstrate what happens with the attached tartest.py file

$ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
Traceback (most recent call last):
  File "tartest.py", line 17, in <module>
    unarchive(skip, archive, dest)
  File "tartest.py", line 12, in unarchive
    tar_fd.extract(info, dest)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member
    self.makelink(tarinfo, targetpath)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink
    os.link(tarinfo._link_target, targetpath)
OSError: [Errno 2] No such file or directory

And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have tested):

$ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times
...
real    0m42.747s
user    0m17.564s
sys     0m6.144s

If we then make the tarfile skip extraction of hard links that point to themselves:

$ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once
...
Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times
...
real    0m2.688s
user    0m1.816s
sys     0m0.532s

From the used user CPU time it's obvious that there is happening a lot of unneeded decompression when we compare Python 3.6 results. If I use TarFile.extractall(), it behaves similarly as using TarFile.extract() individually on TarInfo objects. GNU tar seems to behave in such fashion that it skips over the extraction of the actual file data when it encounters this situation.
History
Date User Action Args
2017-02-21 10:10:46Jussi Judinsetrecipients: + Jussi Judin
2017-02-21 10:10:46Jussi Judinsetmessageid: <1487671846.82.0.103027933595.issue29612@psf.upfronthosting.co.za>
2017-02-21 10:10:46Jussi Judinlinkissue29612 messages
2017-02-21 10:10:45Jussi Judincreate