Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar.extractall() does not recognize unexpected EOF #68447

Closed
ThomasGttler mannequin opened this issue May 21, 2015 · 21 comments
Closed

tar.extractall() does not recognize unexpected EOF #68447

ThomasGttler mannequin opened this issue May 21, 2015 · 21 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ThomasGttler
Copy link
Mannequin

ThomasGttler mannequin commented May 21, 2015

BPO 24259
Nosy @gustaebel, @ethanfurman, @vadmium
Files
  • 01-issue24259-test.diff
  • tar_which_is_cut.tar
  • issue24259-3.x.diff
  • issue24259-2.x.diff
  • issue24259-3.x-2.diff
  • issue24259-2.x-2.diff
  • issue24259-3.x-3.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gustaebel'
    closed_at = <Date 2015-07-06.07:34:50.217>
    created_at = <Date 2015-05-21.15:18:16.111>
    labels = ['type-bug', 'library']
    title = 'tar.extractall() does not recognize unexpected EOF'
    updated_at = <Date 2015-07-06.07:34:50.217>
    user = 'https://bugs.python.org/ThomasGttler'

    bugs.python.org fields:

    activity = <Date 2015-07-06.07:34:50.217>
    actor = 'lars.gustaebel'
    assignee = 'lars.gustaebel'
    closed = True
    closed_date = <Date 2015-07-06.07:34:50.217>
    closer = 'lars.gustaebel'
    components = ['Library (Lib)']
    creation = <Date 2015-05-21.15:18:16.111>
    creator = 'Thomas G\xc3\xbcttler'
    dependencies = []
    files = ['39528', '39531', '39543', '39544', '39579', '39580', '39837']
    hgrepos = []
    issue_num = 24259
    keywords = ['patch']
    message_count = 21.0
    messages = ['243755', '244152', '244233', '244235', '244236', '244237', '244239', '244248', '244254', '244282', '244287', '244289', '244290', '244360', '244361', '244433', '244455', '244572', '245584', '245979', '246350']
    nosy_count = 6.0
    nosy_names = ['guettli', 'lars.gustaebel', 'ethan.furman', 'python-dev', 'martin.panter', 'Thomas G\xc3\xbcttler']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue24259'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

    @ThomasGttler
    Copy link
    Mannequin Author

    ThomasGttler mannequin commented May 21, 2015

    The Python tarfile library does not detect a broken tar.

    user@host$ wc -c good.tar
    143360 good.tar

    user@host$ head -c 130000 good.tar > cut.tar

    user@host$ tar -tf cut.tar
    ...
    tar: Unexpected EOF in archive
    tar: Error is not recoverable: exiting now

    Very nice, the command line tool recognizes an unexpected EOF.

    user@host$ python
    Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
    >>> import tarfile
    >>> tar=tarfile.open('cut.tar')
    >>> tar.extractall()

    Not nice. The Python library decodes the file, but raises no exception.

    Is this a bug or feature?

    Source: http://stackoverflow.com/questions/30302204/tar-extractall-does-not-recognize-unexpected-eof

    @guettli
    Copy link
    Mannequin

    guettli mannequin commented May 27, 2015

    Who has enough knowledge of the tarfile module to create a good patch?

    @vadmium
    Copy link
    Member

    vadmium commented May 28, 2015

    I might be able to make a patch, but what should the patch do exactly?

    • Raise an exception as soon as something wrong is found
    • Defer exceptions until close() is called, to allow partial data recovery
    • Add some sort of defects API that you can optionally inspect, like the email message library has
    • A new tarfile.open(strict=True) mode

    @vadmium vadmium added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels May 28, 2015
    @vadmium
    Copy link
    Member

    vadmium commented May 28, 2015

    Actually, looking closer at the module, perhaps you just need to set the errorlevel=1 option:

    >>> with tarfile.open("truncated.tar", errorlevel=1) as tar:...     tar.extractall("test-dir")
    ... 
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/home/proj/python/cpython/Lib/tarfile.py", line 1996, in extractall
        numeric_owner=numeric_owner)
      File "/home/proj/python/cpython/Lib/tarfile.py", line 2038, in extract
        numeric_owner=numeric_owner)
      File "/home/proj/python/cpython/Lib/tarfile.py", line 2108, in _extract_member
        self.makefile(tarinfo, targetpath)
      File "/home/proj/python/cpython/Lib/tarfile.py", line 2154, in makefile
        copyfileobj(source, target, tarinfo.size)
      File "/home/proj/python/cpython/Lib/tarfile.py", line 242, in copyfileobj
        raise OSError("end of file reached")
    OSError: end of file reached

    @ethanfurman
    Copy link
    Member

    On the SO question [1] the OP stated that he tried errorlevel of both 1 and 2 with no effect... however, he was using Python2.6.

    Martin, can you run that same test with 2.6 to verify that errorlevel did not work there, but does now?

    [1] http://stackoverflow.com/q/30302204/208880

    @ethanfurman
    Copy link
    Member

    Actually, the OP was using 2.7.6.

    @ethanfurman
    Copy link
    Member

    I ran the OP's code in 2.7, 3.3, 3.4, and 3.5, as well as using ubuntu's gnu tar 1.27.1:

    gnu tar did not report any errors.

    Python (all tested versions) did not report any errors (with the errorlevel parameter missing, set to 1, and set to 2).

    @vadmium
    Copy link
    Member

    vadmium commented May 28, 2015

    I guess it depends on the particular tar file and where it gets truncated. Just now I tested with a tar file created from Python’s Tools directory:

    $ tar c Tools/ > good.tar
    $ ls -gG good.tar 
    -rw-r--r-- 1 17397760 May 28 02:43 good.tar
    $ head -c 130000 good.tar > cut.tar
    $ tar -tf cut.tar
    Tools/
    [. . .]
    Tools/hg/hgtouch.py
    tar: Unexpected EOF in archive
    tar: Error is not recoverable: exiting now
    [Exit 2]

    Using Python 2.7.9, I can reproduce Thomas’s behaviour. It extracts the files with no error indication, and actually truncated the last extracted file to 2512 bytes. Adding errorlevel=1 or errorlevel=2 did not make a difference.

    Then I tried Python 3.3, 3.4 and 3.6 in various forms. They all seem to raise OSError, even with no errorlevel specified, and create the last file with zero bytes. So in this case I think it is a Python 2 bug (which I’m personally not so motivated to fix).

    Ethan: how did you create and truncate your tar file? It almost sounds like it is intact, or maybe didn’t get truncated enough to completely wipe out the EOF block.

    @vadmium vadmium added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels May 28, 2015
    @ethanfurman
    Copy link
    Member

    I took an existing tar file and chopped it in half with head -c. I was able to extract half the files, but I didn't check the viability of the last file as I was looking for tar or python error feedback.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented May 28, 2015

    I have written a test for the issue, so that we have a basis for discussion.

    There are four different scenarios where an unexpected eof can occur: inside a metadata block, directly after a metadata block, inside a data segment or directly after a data segment (i.e. missing end of archive marker).

    Case #1 is taken care of (TruncatedHeaderError).

    Case #4 is merely a violation of standard, which is neglectable.

    Case #2 and #3 are essentially the same. If a data segment is empty or incomplete this means data was lost when the archive was created which should not go unnoticed when reading it. (see _FileInFile.read() for the code in question)

    The problem is that, even after we have fixed case #2 and #4, we have no reliable way to detect an incomplete data segment unless we read it and count the bytes. If we simply iterate over the TarFile (e.g. do a TarFile.list()) the archive will appear intact. That is because in the TarFile.next() method we seek from one metadata block to the next, but we cannot simply detect if we seek beyond the end of the archive - except if we insist on the premise that each tar that we read is standards-compliant and comes with an end of archive marker (see case #4), which we probably should not.

    Three possible options come to my mind:

    1. Add a warning to the documentation that in order to test the integrity of an archive the user has to read through all the data segments.
    2. Instead of using seek() in TarFile.next() use read() to advance the file pointer. This is a negative impact on the performance in most cases.
    3. Insist on an end of archive marker. This has the disadvantage that users may get an exception although everything is fine.

    @gustaebel gustaebel mannequin self-assigned this May 28, 2015
    @vadmium
    Copy link
    Member

    vadmium commented May 28, 2015

    If you are already seeking in the file, can’t you seek to the end to determine the length of the file, and then use that to verify if a data segment is truncated? And if you can’t seek, I guess you have to read all the bytes anyway.

    I guess Ethan’s test was an instance of case #4 (EOF directly after data block).

    @guettli
    Copy link
    Mannequin

    guettli mannequin commented May 28, 2015

    I uploaded a broken tar for testing:

    tguettler@aptguettler:~/tmp
    ===> LANG=C tar -xf tar_which_is_cut.tar
    tar: Unexpected EOF in archive
    tar: Unexpected EOF in archive
    tar: Error is not recoverable: exiting now

    tguettler@aptguettler:~/tmp
    ===> python
    Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
    [GCC 4.8.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import tarfile
    >>> tarfile.TarFile('tar_which_is_cut.tar', errorlevel=2).extractall()
    >>> 
    (Sad, no error)

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented May 28, 2015

    @martin:

    Yes, that's right, but only for cases where the TarFile.fileobj attribute is an actual file object. But, most of the time it is something special, e.g. GzipFile or sys.stdin, which makes random seeking either impossible or perform very badly.

    But thanks for your objection, I have to withdraw the statement I made under option 2.: compressed archives are much more common than uncompressed ones. We probably wouldn't lose too much if we no longer use seek() but read() in TarFile.next(). Reading in an uncompressed file is fast anyway. I have to think about this.

    @guettli
    Copy link
    Mannequin

    guettli mannequin commented May 29, 2015

    I thought about this again.

    It could be solved with the help of a ByteCountingStreamReader.

    With ByteCountingStreamReader I mean a wrapper around a stream like codescs.StreamReader. But the ByteCountingStreamReader should not changes the content, but just count the bytes it passed.

    The ByteCountingStreamReader would be wrapped around ExFileObject.

    This class could be handy in other situations.

    What do you think?

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented May 29, 2015

    @thomas:

    I think your proposal adds a little too much complexity. Also, ExFileObject is not used during iteration, and we would like to detect broken archives without unpacking all the data segments first.

    I have written patches for Python 2 and 3.

    @vadmium
    Copy link
    Member

    vadmium commented May 30, 2015

    For the record, the difference between Python 2 and 3 is probably a side effect of revision 050f0f7be11e. Python 2 copies data from the ExFileObject returned by extractfile(), while Python 3 copies directly from the underlying file.

    The patches to the file reading class look good.

    I would be a bit hesitant about the bit that reads all the file data in the next() method. I guess if someone had an uncompressed tar file with only a couple of large files, and they just wanted to list the file names or extract a small file at the end, reading all the data would have a significant impact. Perhaps there is a way to seek almost to the end, and then just read one byte or something.

    Also, beware that according to the documentation, os.truncate() only supports Windows in 3.5+, so you might have to adjust the test if applying this to 3.4.

    @guettli
    Copy link
    Mannequin

    guettli mannequin commented May 30, 2015

    With Python 3.4.0 you get an OSError if you try to extractall() the uploaded tar_which_is_cut.tar. That's nice.

    Seems like only 2.7 seems to be buggy.

    ===> python3
    Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
    [GCC 4.8.2] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import tarfile
    >>> tarfile.TarFile('tar_which_is_cut.tar').extractall()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.4/tarfile.py", line 1979, in extractall
        self.extract(tarinfo, path, set_attrs=not tarinfo.isdir())
      File "/usr/lib/python3.4/tarfile.py", line 2018, in extract
        set_attrs=set_attrs)
      File "/usr/lib/python3.4/tarfile.py", line 2087, in _extract_member
        self.makefile(tarinfo, targetpath)
      File "/usr/lib/python3.4/tarfile.py", line 2133, in makefile
        copyfileobj(source, target, tarinfo.size)
      File "/usr/lib/python3.4/tarfile.py", line 247, in copyfileobj
        raise OSError("end of file reached")
    OSError: end of file reached

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jun 1, 2015

    @martin:

    This is actually a nice idea that I hadn't thought of. I updated the Python 3 patch to use a seek() that moves to one byte before the next header block, reads the remaining byte and raises an error if it hits eof. The code looks rather clean compared to the previous patch, and it should perform like it always did.

    I am not quite sure about which exception type to use, ReadError is used in tarfile's header parsing code, but OSError is already used in tarfile.copyfileobj() and might be more like what the user expects.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 21, 2015

    From the current documentation and limited experience with the module, ReadError (or a subclass) sounds best. I would only expect OSError only for OS-level things, like file not found, disk error, etc.

    The patches look good. One last suggestion is to use assertRaisesRegex, to be sure that you are getting exactly the exception you expect. Too many times I have found tests in the test suite that were completely broken, but happened to pass because the error happened to match the exception being tested.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Jun 30, 2015

    Martin, I followed your suggestion to raise ReadError. This needed an additional change in copyfileobj() because it is used both for adding file data to an archive and extracting file data from an archive.

    But I think the patch is in good shape now.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 6, 2015

    New changeset 372aa98eb72e by Lars Gustäbel in branch '2.7':
    Issue bpo-24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
    https://hg.python.org/cpython/rev/372aa98eb72e

    New changeset c7f4f61697b7 by Lars Gustäbel in branch '3.4':
    Issue bpo-24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
    https://hg.python.org/cpython/rev/c7f4f61697b7

    New changeset 59cbdc9eb3d9 by Lars Gustäbel in branch '3.5':
    Merge with 3.4: Issue bpo-24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
    https://hg.python.org/cpython/rev/59cbdc9eb3d9

    New changeset 6be8fa47c002 by Lars Gustäbel in branch 'default':
    Merge with 3.5: Issue bpo-24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
    https://hg.python.org/cpython/rev/6be8fa47c002

    @gustaebel gustaebel mannequin closed this as completed Jul 6, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants