Message 288284 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Jussi Judin
Recipients	Jussi Judin
Date	2017-02-21.10:10:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1487671846.82.0.103027933595.issue29612@psf.upfronthosting.co.za>
In-reply-to

Content
I managed to create a tarball that brought out quite nasty behavior with tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there are hard links inside a tarball that point to themselves with a file that is included in the tarball. In Python 2.7 it leads to an exception and with Python 3.4-3.6 it extracts the same file from the tarball multiple times. First we create a tarball that causes this behavior: $ mkdir -p tardata/1/2/3/4/5/6/7/8/9 $ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500 # tar by default adds all directories recursively multiple times to the archive, but duplicates are created as hard links: $ find tardata \| xargs tar cvfz tardata.tar.gz Then let's extract the tarball with tarfile module Let following commands demonstrate what happens with the attached tartest.py file $ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13 ... tardata/1/2/3/4/5/6/7/8/9/zeros.data ... tardata/1/2/3/4/5/6/7/8/9/zeros.data Traceback (most recent call last): File "tartest.py", line 17, in <module> unarchive(skip, archive, dest) File "tartest.py", line 12, in unarchive tar_fd.extract(info, dest) File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member self.makelink(tarinfo, targetpath) File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink os.link(tarinfo._link_target, targetpath) OSError: [Errno 2] No such file or directory And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have tested): $ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0 ... tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times ... real 0m42.747s user 0m17.564s sys 0m6.144s If we then make the tarfile skip extraction of hard links that point to themselves: $ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0 ... tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once ... Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times ... real 0m2.688s user 0m1.816s sys 0m0.532s From the used user CPU time it's obvious that there is happening a lot of unneeded decompression when we compare Python 3.6 results. If I use TarFile.extractall(), it behaves similarly as using TarFile.extract() individually on TarInfo objects. GNU tar seems to behave in such fashion that it skips over the extraction of the actual file data when it encounters this situation.

I managed to create a tarball that brought out quite nasty behavior with tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there are hard links inside a tarball that point to themselves with a file that is included in the tarball. In Python 2.7 it leads to an exception and with Python 3.4-3.6 it extracts the same file from the tarball multiple times.

First we create a tarball that causes this behavior:

$ mkdir -p tardata/1/2/3/4/5/6/7/8/9
$ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500
# tar by default adds all directories recursively multiple times to the archive, but duplicates are created as hard links:
$ find tardata | xargs tar cvfz tardata.tar.gz

Then let's extract the tarball with tarfile module
Let following commands demonstrate what happens with the attached tartest.py file

$ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
Traceback (most recent call last):
  File "tartest.py", line 17, in <module>
    unarchive(skip, archive, dest)
  File "tartest.py", line 12, in unarchive
    tar_fd.extract(info, dest)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member
    self.makelink(tarinfo, targetpath)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink
    os.link(tarinfo._link_target, targetpath)
OSError: [Errno 2] No such file or directory

And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have tested):

$ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times
...
real    0m42.747s
user    0m17.564s
sys     0m6.144s

If we then make the tarfile skip extraction of hard links that point to themselves:

$ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once
...
Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times
...
real    0m2.688s
user    0m1.816s
sys     0m0.532s

From the used user CPU time it's obvious that there is happening a lot of unneeded decompression when we compare Python 3.6 results. If I use TarFile.extractall(), it behaves similarly as using TarFile.extract() individually on TarInfo objects. GNU tar seems to behave in such fashion that it skips over the extraction of the actual file data when it encounters this situation.

History
Date	User	Action	Args
2017-02-21 10:10:46	Jussi Judin	set	recipients: + Jussi Judin
2017-02-21 10:10:46	Jussi Judin	set	messageid: <1487671846.82.0.103027933595.issue29612@psf.upfronthosting.co.za>
2017-02-21 10:10:46	Jussi Judin	link	issue29612 messages
2017-02-21 10:10:45	Jussi Judin	create