classification
Title: TarFile.extract() suffers from hard links inside tarball
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.6, Python 3.5, Python 3.4, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jussi Judin, Larry Cook, guettli, jtrouverie, lars.gustaebel
Priority: normal Keywords: patch

Created on 2017-02-21 10:10 by Jussi Judin, last changed 2018-08-07 08:07 by joack.

Files
File name Uploaded Description Edit
tartest.py Jussi Judin, 2017-02-21 10:10
Pull Requests
URL Status Linked Edit
PR 5753 closed joack, 2018-02-19 12:47
PR 8700 open joack, 2018-08-07 08:07
Messages (7)
msg288284 - (view) Author: Jussi Judin (Jussi Judin) Date: 2017-02-21 10:10
I managed to create a tarball that brought out quite nasty behavior with tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there are hard links inside a tarball that point to themselves with a file that is included in the tarball. In Python 2.7 it leads to an exception and with Python 3.4-3.6 it extracts the same file from the tarball multiple times.

First we create a tarball that causes this behavior:

$ mkdir -p tardata/1/2/3/4/5/6/7/8/9
$ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500
# tar by default adds all directories recursively multiple times to the archive, but duplicates are created as hard links:
$ find tardata | xargs tar cvfz tardata.tar.gz

Then let's extract the tarball with tarfile module
Let following commands demonstrate what happens with the attached tartest.py file

$ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
Traceback (most recent call last):
  File "tartest.py", line 17, in <module>
    unarchive(skip, archive, dest)
  File "tartest.py", line 12, in unarchive
    tar_fd.extract(info, dest)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member
    self.makelink(tarinfo, targetpath)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink
    os.link(tarinfo._link_target, targetpath)
OSError: [Errno 2] No such file or directory

And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have tested):

$ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times
...
real    0m42.747s
user    0m17.564s
sys     0m6.144s

If we then make the tarfile skip extraction of hard links that point to themselves:

$ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once
...
Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times
...
real    0m2.688s
user    0m1.816s
sys     0m0.532s

From the used user CPU time it's obvious that there is happening a lot of unneeded decompression when we compare Python 3.6 results. If I use TarFile.extractall(), it behaves similarly as using TarFile.extract() individually on TarInfo objects. GNU tar seems to behave in such fashion that it skips over the extraction of the actual file data when it encounters this situation.
msg289360 - (view) Author: Thomas Guettler (guettli) Date: 2017-03-10 13:33
I have the same issue on Python 2.7.12 (Ubuntu 16.04)

I tried to execute tartest.py. But I could not find a way how to create the tar which is needed for tartest.py.
msg312235 - (view) Author: Larry Cook (Larry Cook) Date: 2018-02-16 15:04
I recently hit this with Python 2.7.5 and 2.7.13.  It has a very simple repro.  Just specify the same file twice on the command line to tar (GNU 1.26):

% tar cvf test.tar test.txt test.txt
test.txt
test.txt

% tar tvf test.tar
-rw-r--r-- root/root        24 2018-02-16 09:35 test.txt
hrw-r--r-- root/root         0 2018-02-16 09:35 test.txt link to test.txt

% python2.7
Python 2.7.5 (default, Aug  4 2017, 00:39:18) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tarfile
>>> tarball = tarfile.open("test.tar")
>>> tarball.extractall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/tarfile.py", line 2047, in extractall
    self.extract(tarinfo, path)
  File "/usr/lib64/python2.7/tarfile.py", line 2084, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
  File "/usr/lib64/python2.7/tarfile.py", line 2168, in _extract_member
    self.makelink(tarinfo, targetpath)
  File "/usr/lib64/python2.7/tarfile.py", line 2252, in makelink
    os.link(tarinfo._link_target, targetpath)
OSError: [Errno 2] No such file or directory
>>>
msg312345 - (view) Author: Joachim Trouverie (jtrouverie) * Date: 2018-02-19 09:25
Is there anybody working on this issue or can I create a branch concerning it ?
msg312900 - (view) Author: Joachim Trouverie (jtrouverie) * Date: 2018-02-26 09:10
I created a PR for this issue for Python 2.7 (https://github.com/python/cpython/pull/5753/files).

I just skip the link creation if the target path is equals to the link target. I don't see any corner case where this would be an unwanted behavior.

I am not sure either I should also create an unit test for this behavior.
msg314588 - (view) Author: Joachim Trouverie (jtrouverie) * Date: 2018-03-28 13:20
Anyone for a review ?
msg320140 - (view) Author: Joachim Trouverie (jtrouverie) * Date: 2018-06-21 07:47
Travis build failed for a reason unrelated to my changes. I relaunched it using an empty commit. 

If anyone could validate my changes I would rebase to validate my work.
History
Date User Action Args
2018-08-07 08:07:14joacksetpull_requests: + pull_request8192
2018-06-21 07:47:27jtrouveriesetmessages: + msg320140
2018-03-28 13:20:40jtrouveriesetmessages: + msg314588
2018-02-26 09:10:28jtrouveriesetmessages: + msg312900
2018-02-19 12:47:27joacksetkeywords: + patch
stage: patch review
pull_requests: + pull_request5532
2018-02-19 09:25:23jtrouveriesetnosy: + jtrouverie
messages: + msg312345
2018-02-16 15:04:18Larry Cooksetnosy: + Larry Cook
messages: + msg312235
2017-03-10 13:33:08guettlisetnosy: + guettli
messages: + msg289360
2017-02-21 15:39:45ned.deilysetnosy: + lars.gustaebel
2017-02-21 10:10:46Jussi Judincreate