classification
Title: tarfile cannot extract from stdin
Type: crash Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: Jonathan Hsu, Manjusaka, dtamuc, python-dev, taleinat
Priority: normal Keywords: patch

Created on 2020-03-23 15:54 by dtamuc, last changed 2020-09-19 15:42 by taleinat. This issue is now closed.

Files
File name Uploaded Description Edit
test.tar dtamuc, 2020-03-26 17:46
Pull Requests
URL Status Linked Edit
PR 19187 closed python-dev, 2020-03-27 01:35
Messages (7)
msg364860 - (view) Author: Danijel (dtamuc) Date: 2020-03-23 15:54
Hi,

I have the following code:

```
import tarfile
import sys

tar = tarfile.open(fileobj=sys.stdin.buffer, mode='r|*')
tar.extractall("tarout")
tar.close()
```

then doing the following on a debian 10 system:

```
$ python -m tarfile -c git.tar /usr/share/doc/git
$ python -V
Python 3.8.1
$ cat git.tar | python foo.py
$ cat git.tar | python foo.py
Traceback (most recent call last):
  File "foo.py", line 5, in <module>
    tar.extractall("tarout")
  File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2026, in extractall
    self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2067, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2139, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2178, in makefile
    source.seek(tarinfo.offset_data)
  File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 513, in seek
    raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed
```

The second extraction trys to seek, although the mode is 'r|*'.


For reference if I remove ".buffer" from the code above, I can run
it with python2 without problems:

```
$ cat foo2.py
import tarfile
import sys

tar = tarfile.open(fileobj=sys.stdin, mode='r|*')
tar.extractall("tarout")
tar.close()

$ cat git.tar | python2 foo2.py
$ cat git.tar | python2 foo2.py
$ cat git.tar | python2 foo2.py
$ cat git.tar | python2 foo2.py
$ cat git.tar | python2 foo2.py
```
msg365093 - (view) Author: Manjusaka (Manjusaka) * Date: 2020-03-26 16:38
Hello
 
I can't reproduce this issue on my Laptop from 3.8.1 to 3.9.0a4

I think maybe it depends on the file you use

would you mind to upload the file with the problem?
msg365102 - (view) Author: Danijel (dtamuc) Date: 2020-03-26 17:46
Hi,

well, it says entity too large. I've attached a smaller one, that throws a similar but slightly different error. (Note: only on the _second_ extraction, it looks like problems with symlinks)

You can find larger ones here:

https://data.rbfh.de/issue40049/

The typescript*.txt are showing a shell session with two different python versions. (3.4.2 and 3.8.2)
msg365128 - (view) Author: Jonathan Hsu (Jonathan Hsu) * Date: 2020-03-27 01:49
This is caused when tarfile tries to write a symlink that already exists. Any exceptions to os.symlink() as handled as if the platform doesn't support symlinks, so it scans the entire tar to try and find the linked files. When it resumes extraction, it needs to do a negative seek to pick up where it left off, which causes the exception.

I've reproduced the error on both Windows 10 and Ubuntu running on WSL. Python 2.7 handled this situation by checking if the symlink exists, but it looks like the entire tarfile library was replaced with an alternate implementation that doesn't check if the symlink exists. I've created a pull request to address this issue.
msg365192 - (view) Author: Danijel (dtamuc) Date: 2020-03-27 21:04
For me, this patch solves my problems. Thank you.
msg377139 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2020-09-18 21:00
GNU tar (v1.30, Ubuntu 20.04) does indeed overwrite files with symlinks upon extracting, while both `ln -s` and `os.symlink` do not. Therefore I agree that the appropriate behavior would seem to be to overwrite this way, as in the attached PR.
msg377170 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2020-09-19 15:42
This is actually a duplicate of issue12800, which itself describe precisely the same issue as in issue10761, which was fixed but then the fix was lost in a bad merge.

I'm closing this, as discussion should happen on the original issues.
History
Date User Action Args
2020-09-19 15:42:22taleinatsetstatus: open -> closed
resolution: duplicate
messages: + msg377170

stage: patch review -> resolved
2020-09-18 21:00:59taleinatsetnosy: + taleinat
messages: + msg377139
2020-03-27 21:04:33dtamucsetmessages: + msg365192
2020-03-27 01:49:28Jonathan Hsusetnosy: + Jonathan Hsu
messages: + msg365128
2020-03-27 01:35:20python-devsetkeywords: + patch
nosy: + python-dev

pull_requests: + pull_request18546
stage: patch review
2020-03-26 17:46:41dtamucsetfiles: + test.tar

messages: + msg365102
2020-03-26 16:38:52Manjusakasetnosy: + Manjusaka
messages: + msg365093
2020-03-24 13:44:32dtamucsettype: crash
2020-03-23 15:54:18dtamuccreate