Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile cannot extract from stdin #84230

Closed
dtamuc mannequin opened this issue Mar 23, 2020 · 7 comments
Closed

tarfile cannot extract from stdin #84230

dtamuc mannequin opened this issue Mar 23, 2020 · 7 comments
Labels
3.8 only security fixes stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@dtamuc
Copy link
Mannequin

dtamuc mannequin commented Mar 23, 2020

BPO 40049
Nosy @taleinat, @Zheaoli, @jonnyhsu
PRs
  • bpo-40049: Check if symlink exists when extracting from tarfile #19187
  • Files
  • test.tar
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-09-19.15:42:22.307>
    created_at = <Date 2020-03-23.15:54:18.010>
    labels = ['3.8', 'library', 'type-crash']
    title = 'tarfile cannot extract from stdin'
    updated_at = <Date 2020-09-19.15:42:22.307>
    user = 'https://bugs.python.org/dtamuc'

    bugs.python.org fields:

    activity = <Date 2020-09-19.15:42:22.307>
    actor = 'taleinat'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-09-19.15:42:22.307>
    closer = 'taleinat'
    components = ['Library (Lib)']
    creation = <Date 2020-03-23.15:54:18.010>
    creator = 'dtamuc'
    dependencies = []
    files = ['49003']
    hgrepos = []
    issue_num = 40049
    keywords = ['patch']
    message_count = 7.0
    messages = ['364860', '365093', '365102', '365128', '365192', '377139', '377170']
    nosy_count = 5.0
    nosy_names = ['taleinat', 'python-dev', 'Manjusaka', 'Jonathan Hsu', 'dtamuc']
    pr_nums = ['19187']
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue40049'
    versions = ['Python 3.8']

    @dtamuc
    Copy link
    Mannequin Author

    dtamuc mannequin commented Mar 23, 2020

    Hi,

    I have the following code:

    import tarfile
    import sys
    
    tar = tarfile.open(fileobj=sys.stdin.buffer, mode='r|*')
    tar.extractall("tarout")
    tar.close()
    

    then doing the following on a debian 10 system:

    $ python -m tarfile -c git.tar /usr/share/doc/git
    $ python -V
    Python 3.8.1
    $ cat git.tar | python foo.py
    $ cat git.tar | python foo.py
    Traceback (most recent call last):
      File "foo.py", line 5, in <module>
        tar.extractall("tarout")
      File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2026, in extractall
        self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
      File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2067, in extract
        self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
      File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2139, in _extract_member
        self.makefile(tarinfo, targetpath)
      File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 2178, in makefile
        source.seek(tarinfo.offset_data)
      File "/home/danielt/miniconda3/lib/python3.8/tarfile.py", line 513, in seek
        raise StreamError("seeking backwards is not allowed")
    tarfile.StreamError: seeking backwards is not allowed
    

    The second extraction trys to seek, although the mode is 'r|*'.

    For reference if I remove ".buffer" from the code above, I can run
    it with python2 without problems:

    $ cat foo2.py
    import tarfile
    import sys
    
    tar = tarfile.open(fileobj=sys.stdin, mode='r|*')
    tar.extractall("tarout")
    tar.close()
    
    $ cat git.tar | python2 foo2.py
    $ cat git.tar | python2 foo2.py
    $ cat git.tar | python2 foo2.py
    $ cat git.tar | python2 foo2.py
    $ cat git.tar | python2 foo2.py
    

    @dtamuc dtamuc mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 23, 2020
    @Zheaoli
    Copy link
    Mannequin

    Zheaoli mannequin commented Mar 26, 2020

    Hello

    I can't reproduce this issue on my Laptop from 3.8.1 to 3.9.0a4

    I think maybe it depends on the file you use

    would you mind to upload the file with the problem?

    @dtamuc
    Copy link
    Mannequin Author

    dtamuc mannequin commented Mar 26, 2020

    Hi,

    well, it says entity too large. I've attached a smaller one, that throws a similar but slightly different error. (Note: only on the _second_ extraction, it looks like problems with symlinks)

    You can find larger ones here:

    https://data.rbfh.de/issue40049/

    The typescript*.txt are showing a shell session with two different python versions. (3.4.2 and 3.8.2)

    @jonnyhsu
    Copy link
    Mannequin

    jonnyhsu mannequin commented Mar 27, 2020

    This is caused when tarfile tries to write a symlink that already exists. Any exceptions to os.symlink() as handled as if the platform doesn't support symlinks, so it scans the entire tar to try and find the linked files. When it resumes extraction, it needs to do a negative seek to pick up where it left off, which causes the exception.

    I've reproduced the error on both Windows 10 and Ubuntu running on WSL. Python 2.7 handled this situation by checking if the symlink exists, but it looks like the entire tarfile library was replaced with an alternate implementation that doesn't check if the symlink exists. I've created a pull request to address this issue.

    @dtamuc
    Copy link
    Mannequin Author

    dtamuc mannequin commented Mar 27, 2020

    For me, this patch solves my problems. Thank you.

    @taleinat
    Copy link
    Contributor

    GNU tar (v1.30, Ubuntu 20.04) does indeed overwrite files with symlinks upon extracting, while both ln -s and os.symlink do not. Therefore I agree that the appropriate behavior would seem to be to overwrite this way, as in the attached PR.

    @taleinat
    Copy link
    Contributor

    This is actually a duplicate of bpo-12800, which itself describe precisely the same issue as in bpo-10761, which was fixed but then the fix was lost in a bad merge.

    I'm closing this, as discussion should happen on the original issues.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant