classification
Title: tar.extractall() does not recognize unexpected EOF
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: Thomas Güttler, ethan.furman, guettli, lars.gustaebel, martin.panter, python-dev
Priority: normal Keywords: patch

Created on 2015-05-21 15:18 by Thomas Güttler, last changed 2015-07-06 07:34 by lars.gustaebel. This issue is now closed.

Files
File name Uploaded Description Edit
01-issue24259-test.diff lars.gustaebel, 2015-05-28 07:42 review
tar_which_is_cut.tar guettli, 2015-05-28 08:41
issue24259-3.x.diff lars.gustaebel, 2015-05-29 07:43
issue24259-2.x.diff lars.gustaebel, 2015-05-29 07:43
issue24259-3.x-2.diff lars.gustaebel, 2015-06-01 06:49
issue24259-2.x-2.diff lars.gustaebel, 2015-06-01 06:53
issue24259-3.x-3.diff lars.gustaebel, 2015-06-30 06:54
Messages (21)
msg243755 - (view) Author: Thomas Güttler (Thomas Güttler) * Date: 2015-05-21 15:18
The Python tarfile library does not detect a broken tar.

user@host$ wc -c good.tar
143360 good.tar

user@host$ head -c 130000 good.tar > cut.tar

user@host$ tar -tf cut.tar 
...
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Very nice, the command line tool recognizes an unexpected EOF.

user@host$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
>>> import tarfile
>>> tar=tarfile.open('cut.tar')
>>> tar.extractall()

Not nice. The Python library decodes the file, but raises no exception.

Is this a bug or feature?

Source: http://stackoverflow.com/questions/30302204/tar-extractall-does-not-recognize-unexpected-eof
msg244152 - (view) Author: Thomas Guettler (guettli) Date: 2015-05-27 09:12
Who has enough knowledge of the tarfile module to create a good patch?
msg244233 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-28 00:57
I might be able to make a patch, but what should the patch do exactly?

* Raise an exception as soon as something wrong is found
* Defer exceptions until close() is called, to allow partial data recovery
* Add some sort of defects API that you can optionally inspect, like the email message library has
* A new tarfile.open(strict=True) mode
msg244235 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-28 01:08
Actually, looking closer at the module, perhaps you just need to set the errorlevel=1 option:

>>> with tarfile.open("truncated.tar", errorlevel=1) as tar:...     tar.extractall("test-dir")
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/proj/python/cpython/Lib/tarfile.py", line 1996, in extractall
    numeric_owner=numeric_owner)
  File "/home/proj/python/cpython/Lib/tarfile.py", line 2038, in extract
    numeric_owner=numeric_owner)
  File "/home/proj/python/cpython/Lib/tarfile.py", line 2108, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/home/proj/python/cpython/Lib/tarfile.py", line 2154, in makefile
    copyfileobj(source, target, tarinfo.size)
  File "/home/proj/python/cpython/Lib/tarfile.py", line 242, in copyfileobj
    raise OSError("end of file reached")
OSError: end of file reached
msg244236 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2015-05-28 01:15
On the SO question [1] the OP stated that he tried errorlevel of both 1 and 2 with no effect... however, he was using Python2.6.

Martin, can you run that same test with 2.6 to verify that errorlevel did not work there, but does now?


[1] http://stackoverflow.com/q/30302204/208880
msg244237 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2015-05-28 01:21
Actually, the OP was using 2.7.6.
msg244239 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2015-05-28 01:28
I ran the OP's code in 2.7, 3.3, 3.4, and 3.5, as well as using ubuntu's gnu tar 1.27.1:

gnu tar did not report any errors.

Python (all tested versions) did not report any errors (with the errorlevel parameter missing, set to 1, and set to 2).
msg244248 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-28 03:04
I guess it depends on the particular tar file and where it gets truncated. Just now I tested with a tar file created from Python’s Tools directory:

$ tar c Tools/ > good.tar
$ ls -gG good.tar 
-rw-r--r-- 1 17397760 May 28 02:43 good.tar
$ head -c 130000 good.tar > cut.tar
$ tar -tf cut.tar
Tools/
[. . .]
Tools/hg/hgtouch.py
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
[Exit 2]

Using Python 2.7.9, I can reproduce Thomas’s behaviour. It extracts the files with no error indication, and actually truncated the last extracted file to 2512 bytes. Adding errorlevel=1 or errorlevel=2 did not make a difference.

Then I tried Python 3.3, 3.4 and 3.6 in various forms. They all seem to raise OSError, even with no errorlevel specified, and create the last file with zero bytes. So in this case I think it is a Python 2 bug (which I’m personally not so motivated to fix).

Ethan: how did you create and truncate your tar file? It almost sounds like it is intact, or maybe didn’t get truncated enough to completely wipe out the EOF block.
msg244254 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2015-05-28 03:42
I took an existing tar file and chopped it in half with `head -c`.  I was able to extract half the files, but I didn't check the viability of the last file as I was looking for tar or python error feedback.
msg244282 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-05-28 07:42
I have written a test for the issue, so that we have a basis for discussion.

There are four different scenarios where an unexpected eof can occur: inside a metadata block, directly after a metadata block, inside a data segment or directly after a data segment (i.e. missing end of archive marker).

Case #1 is taken care of (TruncatedHeaderError).

Case #4 is merely a violation of standard, which is neglectable.

Case #2 and #3 are essentially the same. If a data segment is empty or incomplete this means data was lost when the archive was created which should not go unnoticed when reading it. (see _FileInFile.read() for the code in question)

The problem is that, even after we have fixed case #2 and #4, we have no reliable way to detect an incomplete data segment unless we read it and count the bytes. If we simply iterate over the TarFile (e.g. do a TarFile.list()) the archive will appear intact. That is because in the TarFile.next() method we seek from one metadata block to the next, but we cannot simply detect if we seek beyond the end of the archive - except if we insist on the premise that each tar that we read is standards-compliant and comes with an end of archive marker (see case #4), which we probably should not.

Three possible options come to my mind:

1. Add a warning to the documentation that in order to test the integrity of an archive the user has to read through all the data segments.
2. Instead of using seek() in TarFile.next() use read() to advance the file pointer. This is a negative impact on the performance in most cases.
3. Insist on an end of archive marker. This has the disadvantage that users may get an exception although everything is fine.
msg244287 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-28 08:20
If you are already seeking in the file, can’t you seek to the end to determine the length of the file, and then use that to verify if a data segment is truncated? And if you can’t seek, I guess you have to read all the bytes anyway.

I guess Ethan’s test was an instance of case #4 (EOF directly after data block).
msg244289 - (view) Author: Thomas Guettler (guettli) Date: 2015-05-28 08:41
I uploaded a broken tar for testing:

tguettler@aptguettler:~/tmp
===> LANG=C tar -xf tar_which_is_cut.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

tguettler@aptguettler:~/tmp
===> python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tarfile
>>> tarfile.TarFile('tar_which_is_cut.tar', errorlevel=2).extractall()
>>> 
(Sad, no error)
msg244290 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-05-28 09:03
@Martin:

Yes, that's right, but only for cases where the TarFile.fileobj attribute is an actual file object. But, most of the time it is something special, e.g. GzipFile or sys.stdin, which makes random seeking either impossible or perform very badly.

But thanks for your objection, I have to withdraw the statement I made under option 2.: compressed archives are much more common than uncompressed ones. We probably wouldn't lose too much if we no longer use seek() but read() in TarFile.next(). Reading in an uncompressed file is fast anyway. I have to think about this.
msg244360 - (view) Author: Thomas Guettler (guettli) Date: 2015-05-29 06:25
I thought about this again. 

It could be solved with the help of a ByteCountingStreamReader.

With ByteCountingStreamReader I mean a wrapper around a stream like codescs.StreamReader. But the ByteCountingStreamReader should not changes the content, but just count the bytes it passed.

The ByteCountingStreamReader would be wrapped around ExFileObject.

This class could be handy in other situations.

What do you think?
msg244361 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-05-29 07:43
@Thomas:

I think your proposal adds a little too much complexity. Also, ExFileObject is not used during iteration, and we would like to detect broken archives without unpacking all the data segments first.

I have written patches for Python 2 and 3.
msg244433 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-30 01:00
For the record, the difference between Python 2 and 3 is probably a side effect of revision 050f0f7be11e. Python 2 copies data from the ExFileObject returned by extractfile(), while Python 3 copies directly from the underlying file.

The patches to the file reading class look good.

I would be a bit hesitant about the bit that reads all the file data in the next() method. I guess if someone had an uncompressed tar file with only a couple of large files, and they just wanted to list the file names or extract a small file at the end, reading all the data would have a significant impact. Perhaps there is a way to seek almost to the end, and then just read one byte or something.

Also, beware that according to the documentation, os.truncate() only supports Windows in 3.5+, so you might have to adjust the test if applying this to 3.4.
msg244455 - (view) Author: Thomas Guettler (guettli) Date: 2015-05-30 11:00
With Python 3.4.0 you get an OSError if you try to extractall() the uploaded tar_which_is_cut.tar. That's nice.

Seems like only 2.7 seems to be buggy.

===> python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tarfile
>>> tarfile.TarFile('tar_which_is_cut.tar').extractall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/tarfile.py", line 1979, in extractall
    self.extract(tarinfo, path, set_attrs=not tarinfo.isdir())
  File "/usr/lib/python3.4/tarfile.py", line 2018, in extract
    set_attrs=set_attrs)
  File "/usr/lib/python3.4/tarfile.py", line 2087, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/usr/lib/python3.4/tarfile.py", line 2133, in makefile
    copyfileobj(source, target, tarinfo.size)
  File "/usr/lib/python3.4/tarfile.py", line 247, in copyfileobj
    raise OSError("end of file reached")
OSError: end of file reached
msg244572 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-01 06:49
@Martin:

This is actually a nice idea that I hadn't thought of. I updated the Python 3 patch to use a seek() that moves to one byte before the next header block, reads the remaining byte and raises an error if it hits eof. The code looks rather clean compared to the previous patch, and it should perform like it always did.

I am not quite sure about which exception type to use, ReadError is used in tarfile's header parsing code, but OSError is already used in tarfile.copyfileobj() and might be more like what the user expects.
msg245584 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-06-21 08:55
From the current documentation and limited experience with the module, ReadError (or a subclass) sounds best. I would only expect OSError only for OS-level things, like file not found, disk error, etc.

The patches look good. One last suggestion is to use assertRaisesRegex, to be sure that you are getting exactly the exception you expect. Too many times I have found tests in the test suite that were completely broken, but happened to pass because the error happened to match the exception being tested.
msg245979 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-30 06:54
Martin, I followed your suggestion to raise ReadError. This needed an additional change in copyfileobj() because it is used both for adding file data to an archive and extracting file data from an archive.

But I think the patch is in good shape now.
msg246350 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-07-06 07:33
New changeset 372aa98eb72e by Lars Gustäbel in branch '2.7':
Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/372aa98eb72e

New changeset c7f4f61697b7 by Lars Gustäbel in branch '3.4':
Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/c7f4f61697b7

New changeset 59cbdc9eb3d9 by Lars Gustäbel in branch '3.5':
Merge with 3.4: Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/59cbdc9eb3d9

New changeset 6be8fa47c002 by Lars Gustäbel in branch 'default':
Merge with 3.5: Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/6be8fa47c002
History
Date User Action Args
2015-07-06 07:34:50lars.gustaebelsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2015-07-06 07:33:13python-devsetnosy: + python-dev
messages: + msg246350
2015-06-30 06:54:22lars.gustaebelsetfiles: + issue24259-3.x-3.diff

messages: + msg245979
2015-06-21 08:55:37martin.pantersetmessages: + msg245584
2015-06-01 06:53:26lars.gustaebelsetfiles: + issue24259-2.x-2.diff
2015-06-01 06:49:39lars.gustaebelsetfiles: + issue24259-3.x-2.diff

messages: + msg244572
2015-05-30 11:00:26guettlisetmessages: + msg244455
2015-05-30 01:00:38martin.pantersetmessages: + msg244433
2015-05-29 07:43:18lars.gustaebelsetfiles: + issue24259-2.x.diff
2015-05-29 07:43:09lars.gustaebelsetfiles: + issue24259-3.x.diff

stage: patch review
messages: + msg244361
versions: + Python 3.4, Python 3.5, Python 3.6
2015-05-29 06:25:38guettlisetmessages: + msg244360
2015-05-28 09:03:43lars.gustaebelsetmessages: + msg244290
2015-05-28 08:41:20guettlisetfiles: + tar_which_is_cut.tar

messages: + msg244289
2015-05-28 08:20:06martin.pantersetmessages: + msg244287
2015-05-28 07:42:54lars.gustaebelsetfiles: + 01-issue24259-test.diff
assignee: lars.gustaebel
messages: + msg244282

keywords: + patch
2015-05-28 03:42:37ethan.furmansetmessages: + msg244254
2015-05-28 03:04:57martin.pantersettype: enhancement -> behavior
messages: + msg244248
versions: + Python 2.7
2015-05-28 01:28:19ethan.furmansetmessages: + msg244239
2015-05-28 01:21:58ethan.furmansetmessages: + msg244237
2015-05-28 01:15:28ethan.furmansetmessages: + msg244236
2015-05-28 01:08:19martin.pantersetmessages: + msg244235
2015-05-28 00:57:17martin.pantersetnosy: + martin.panter
messages: + msg244233

components: + Library (Lib)
type: enhancement
2015-05-27 09:12:24guettlisetnosy: + guettli
messages: + msg244152
2015-05-22 01:37:55ethan.furmansetnosy: + ethan.furman
2015-05-21 21:26:32ned.deilysetnosy: + lars.gustaebel
2015-05-21 15:18:16Thomas Güttlercreate