Message288284
I managed to create a tarball that brought out quite nasty behavior with tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there are hard links inside a tarball that point to themselves with a file that is included in the tarball. In Python 2.7 it leads to an exception and with Python 3.4-3.6 it extracts the same file from the tarball multiple times.
First we create a tarball that causes this behavior:
$ mkdir -p tardata/1/2/3/4/5/6/7/8/9
$ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500
# tar by default adds all directories recursively multiple times to the archive, but duplicates are created as hard links:
$ find tardata | xargs tar cvfz tardata.tar.gz
Then let's extract the tarball with tarfile module
Let following commands demonstrate what happens with the attached tartest.py file
$ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
Traceback (most recent call last):
File "tartest.py", line 17, in <module>
unarchive(skip, archive, dest)
File "tartest.py", line 12, in unarchive
tar_fd.extract(info, dest)
File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract
self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member
self.makelink(tarinfo, targetpath)
File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink
os.link(tarinfo._link_target, targetpath)
OSError: [Errno 2] No such file or directory
And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have tested):
$ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times
...
real 0m42.747s
user 0m17.564s
sys 0m6.144s
If we then make the tarfile skip extraction of hard links that point to themselves:
$ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once
...
Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times
...
real 0m2.688s
user 0m1.816s
sys 0m0.532s
From the used user CPU time it's obvious that there is happening a lot of unneeded decompression when we compare Python 3.6 results. If I use TarFile.extractall(), it behaves similarly as using TarFile.extract() individually on TarInfo objects. GNU tar seems to behave in such fashion that it skips over the extraction of the actual file data when it encounters this situation. |
|
Date |
User |
Action |
Args |
2017-02-21 10:10:46 | Jussi Judin | set | recipients:
+ Jussi Judin |
2017-02-21 10:10:46 | Jussi Judin | set | messageid: <1487671846.82.0.103027933595.issue29612@psf.upfronthosting.co.za> |
2017-02-21 10:10:46 | Jussi Judin | link | issue29612 messages |
2017-02-21 10:10:45 | Jussi Judin | create | |
|