Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile fails to extract archive (handled fine by gnu tar and bsdtar) #68702

Closed
pombredanne mannequin opened this issue Jun 26, 2015 · 10 comments
Closed

tarfile fails to extract archive (handled fine by gnu tar and bsdtar) #68702

pombredanne mannequin opened this issue Jun 26, 2015 · 10 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@pombredanne
Copy link
Mannequin

pombredanne mannequin commented Jun 26, 2015

BPO 24514
Nosy @gustaebel, @taleinat, @pombredanne
Files
  • commons-logging-1.1.2-src.tar.gz: Problematic archive from http://archive.apache.org/dist/commons/logging/source/commons-logging-1.1.2-src.tar.gz
  • issue24514.diff: Patch for 3.4
  • issue24514.diff: New version of the patch for 3.4
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gustaebel'
    closed_at = <Date 2015-07-02.17:45:30.823>
    created_at = <Date 2015-06-26.09:18:57.090>
    labels = ['type-bug', 'library']
    title = 'tarfile fails to extract archive (handled fine by gnu tar and bsdtar)'
    updated_at = <Date 2015-07-02.17:45:30.823>
    user = 'https://github.com/pombredanne'

    bugs.python.org fields:

    activity = <Date 2015-07-02.17:45:30.823>
    actor = 'lars.gustaebel'
    assignee = 'lars.gustaebel'
    closed = True
    closed_date = <Date 2015-07-02.17:45:30.823>
    closer = 'lars.gustaebel'
    components = ['Library (Lib)']
    creation = <Date 2015-06-26.09:18:57.090>
    creator = 'pombredanne'
    dependencies = []
    files = ['39814', '39815', '39832']
    hgrepos = []
    issue_num = 24514
    keywords = ['patch']
    message_count = 10.0
    messages = ['245839', '245840', '245844', '245845', '245846', '245847', '245848', '245934', '245936', '246090']
    nosy_count = 4.0
    nosy_names = ['lars.gustaebel', 'taleinat', 'python-dev', 'pombredanne']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue24514'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

    @pombredanne
    Copy link
    Mannequin Author

    pombredanne mannequin commented Jun 26, 2015

    The extraction fails when calling tarfile.open using this archive: http://archive.apache.org/dist/commons/logging/source/commons-logging-1.1.2-src.tar.gz

    After some investigation, the file can be extracted with gnu tar and bsdtar and the gzip compression is not the issue: if I gunzip the tar.gz to a tar and call tarfile on plain tar, the problem is the same.

    Also this archive was created most likely on Windows (based on the file command output) using some Java tools per http://commons.apache.org/proper/commons-logging/building.html from these original files: http://svn.apache.org/repos/asf/commons/proper/logging/tags/LOGGING_1_1_2/ ... that's all I could find out.

    The error trace is slightly different on 2.7 and 3.4 but similar.
    The problem has been verified on Linux 64 with Python 2.7 and 3.4 and on Windows with Python 2.7.

    On 2.7:

    >>> TarFile.taropen(name)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/tarfile.py", line 1705, in taropen
        return cls(name, mode, fileobj, **kwargs)
      File "/usr/lib/python2.7/tarfile.py", line 1574, in __init__
        self.firstmember = self.next()
      File "/usr/lib/python2.7/tarfile.py", line 2335, in next
        raise ReadError(str(e))
    tarfile.ReadError: invalid header

    On 3.4:

    >>> TarFile.taropen(name)
    Traceback (most recent call last):
      File "/usr/lib/python3.4/tarfile.py", line 180, in nti
        n = int(nts(s, "ascii", "strict") or "0", 8)
    ValueError: invalid literal for int() with base 8: '       '
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3.4/tarfile.py", line 2248, in next
        tarinfo = self.tarinfo.fromtarfile(self)
      File "/usr/lib/python3.4/tarfile.py", line 1083, in fromtarfile
        obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
      File "/usr/lib/python3.4/tarfile.py", line 1032, in frombuf
        obj.uid = nti(buf[108:116])
      File "/usr/lib/python3.4/tarfile.py", line 182, in nti
        raise InvalidHeaderError("invalid header")
    tarfile.InvalidHeaderError: invalid header
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.4/tarfile.py", line 1595, in taropen
        return cls(name, mode, fileobj, **kwargs)
      File "/usr/lib/python3.4/tarfile.py", line 1469, in __init__
        self.firstmember = self.next()
      File "/usr/lib/python3.4/tarfile.py", line 2260, in next
        raise ReadError(str(e))
    tarfile.ReadError: invalid header

    @pombredanne pombredanne mannequin added the stdlib Python modules in the Lib dir label Jun 26, 2015
    @pombredanne
    Copy link
    Mannequin Author

    pombredanne mannequin commented Jun 26, 2015

    Note: the traceback above are from calling taropen on the gunzipped tar.gz
    The error are similar but a tar less informative when using the tgz and open.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jun 26, 2015

    The problem is that the tar archive has empty uid and gid fields, i.e. 7 spaces terminated with a null-byte.

    I attached a patch that solves the problem.

    @pombredanne
    Copy link
    Mannequin Author

    pombredanne mannequin commented Jun 26, 2015

    lars: you are my hero! you rock. I picture you being able to read through tar binary headers while you sleep. I am in awe.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jun 26, 2015

    You're welcome :-D

    @gustaebel gustaebel mannequin self-assigned this Jun 26, 2015
    @gustaebel gustaebel mannequin added the type-bug An unexpected behavior, bug, or error label Jun 26, 2015
    @pombredanne
    Copy link
    Mannequin Author

    pombredanne mannequin commented Jun 26, 2015

    I verified that the patch bpo-24514.diff (adding .rstrip() ) works also on Python 2.7. I verified it also works on Python 3.4

    I ran it on 2.7 against a fairly large test suite of tar files without problems.

    This is a +1 for me.

    Lars: Do you think you could apply it to 2.7 too?

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jun 26, 2015

    Yes, Python 2.7 still gets bugfixes.

    However, there's still some work to do on the patch (maybe clean the code, write a test, add a NEWS entry).

    @taleinat
    Copy link
    Contributor

    The patch is very simple, but this needs tests. At the very least, a simple tar file which reproduces this issue could be added to the tests.

    Taking this a step further would be writing some unit tests for the internal nti() and itn() functions, and perhaps also stn() and nts().

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jun 29, 2015

    I think a simple addition to the existing unittest for nti() will be enough. itn() seems well-tested, and nts() and stn() are not affected, because they don't operate on numbers.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 2, 2015

    New changeset 301d7efac3de by Lars Gustäbel in branch '2.7':
    Issue bpo-24514: tarfile now tolerates number fields consisting of only whitespace.
    https://hg.python.org/cpython/rev/301d7efac3de

    New changeset 140b4b7b84bd by Lars Gustäbel in branch '3.4':
    Issue bpo-24514: tarfile now tolerates number fields consisting of only whitespace.
    https://hg.python.org/cpython/rev/140b4b7b84bd

    New changeset 1692065524cc by Lars Gustäbel in branch '3.5':
    Merge with 3.4: Issue bpo-24514: tarfile now tolerates number fields consisting of only whitespace.
    https://hg.python.org/cpython/rev/1692065524cc

    New changeset 08fad9037206 by Lars Gustäbel in branch 'default':
    Merge with 3.5: Issue bpo-24514: tarfile now tolerates number fields consisting of only whitespace.
    https://hg.python.org/cpython/rev/08fad9037206

    @gustaebel gustaebel mannequin closed this as completed Jul 2, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant