msg243755 - (view) |
Author: Thomas Güttler (Thomas Güttler) * |
Date: 2015-05-21 15:18 |
The Python tarfile library does not detect a broken tar.
user@host$ wc -c good.tar
143360 good.tar
user@host$ head -c 130000 good.tar > cut.tar
user@host$ tar -tf cut.tar
...
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Very nice, the command line tool recognizes an unexpected EOF.
user@host$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
>>> import tarfile
>>> tar=tarfile.open('cut.tar')
>>> tar.extractall()
Not nice. The Python library decodes the file, but raises no exception.
Is this a bug or feature?
Source: http://stackoverflow.com/questions/30302204/tar-extractall-does-not-recognize-unexpected-eof
|
msg244152 - (view) |
Author: Thomas Guettler (guettli) * |
Date: 2015-05-27 09:12 |
Who has enough knowledge of the tarfile module to create a good patch?
|
msg244233 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-05-28 00:57 |
I might be able to make a patch, but what should the patch do exactly?
* Raise an exception as soon as something wrong is found
* Defer exceptions until close() is called, to allow partial data recovery
* Add some sort of defects API that you can optionally inspect, like the email message library has
* A new tarfile.open(strict=True) mode
|
msg244235 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-05-28 01:08 |
Actually, looking closer at the module, perhaps you just need to set the errorlevel=1 option:
>>> with tarfile.open("truncated.tar", errorlevel=1) as tar:... tar.extractall("test-dir")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/proj/python/cpython/Lib/tarfile.py", line 1996, in extractall
numeric_owner=numeric_owner)
File "/home/proj/python/cpython/Lib/tarfile.py", line 2038, in extract
numeric_owner=numeric_owner)
File "/home/proj/python/cpython/Lib/tarfile.py", line 2108, in _extract_member
self.makefile(tarinfo, targetpath)
File "/home/proj/python/cpython/Lib/tarfile.py", line 2154, in makefile
copyfileobj(source, target, tarinfo.size)
File "/home/proj/python/cpython/Lib/tarfile.py", line 242, in copyfileobj
raise OSError("end of file reached")
OSError: end of file reached
|
msg244236 - (view) |
Author: Ethan Furman (ethan.furman) * |
Date: 2015-05-28 01:15 |
On the SO question [1] the OP stated that he tried errorlevel of both 1 and 2 with no effect... however, he was using Python2.6.
Martin, can you run that same test with 2.6 to verify that errorlevel did not work there, but does now?
[1] http://stackoverflow.com/q/30302204/208880
|
msg244237 - (view) |
Author: Ethan Furman (ethan.furman) * |
Date: 2015-05-28 01:21 |
Actually, the OP was using 2.7.6.
|
msg244239 - (view) |
Author: Ethan Furman (ethan.furman) * |
Date: 2015-05-28 01:28 |
I ran the OP's code in 2.7, 3.3, 3.4, and 3.5, as well as using ubuntu's gnu tar 1.27.1:
gnu tar did not report any errors.
Python (all tested versions) did not report any errors (with the errorlevel parameter missing, set to 1, and set to 2).
|
msg244248 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-05-28 03:04 |
I guess it depends on the particular tar file and where it gets truncated. Just now I tested with a tar file created from Python’s Tools directory:
$ tar c Tools/ > good.tar
$ ls -gG good.tar
-rw-r--r-- 1 17397760 May 28 02:43 good.tar
$ head -c 130000 good.tar > cut.tar
$ tar -tf cut.tar
Tools/
[. . .]
Tools/hg/hgtouch.py
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
[Exit 2]
Using Python 2.7.9, I can reproduce Thomas’s behaviour. It extracts the files with no error indication, and actually truncated the last extracted file to 2512 bytes. Adding errorlevel=1 or errorlevel=2 did not make a difference.
Then I tried Python 3.3, 3.4 and 3.6 in various forms. They all seem to raise OSError, even with no errorlevel specified, and create the last file with zero bytes. So in this case I think it is a Python 2 bug (which I’m personally not so motivated to fix).
Ethan: how did you create and truncate your tar file? It almost sounds like it is intact, or maybe didn’t get truncated enough to completely wipe out the EOF block.
|
msg244254 - (view) |
Author: Ethan Furman (ethan.furman) * |
Date: 2015-05-28 03:42 |
I took an existing tar file and chopped it in half with `head -c`. I was able to extract half the files, but I didn't check the viability of the last file as I was looking for tar or python error feedback.
|
msg244282 - (view) |
Author: Lars Gustäbel (lars.gustaebel) * |
Date: 2015-05-28 07:42 |
I have written a test for the issue, so that we have a basis for discussion.
There are four different scenarios where an unexpected eof can occur: inside a metadata block, directly after a metadata block, inside a data segment or directly after a data segment (i.e. missing end of archive marker).
Case #1 is taken care of (TruncatedHeaderError).
Case #4 is merely a violation of standard, which is neglectable.
Case #2 and #3 are essentially the same. If a data segment is empty or incomplete this means data was lost when the archive was created which should not go unnoticed when reading it. (see _FileInFile.read() for the code in question)
The problem is that, even after we have fixed case #2 and #4, we have no reliable way to detect an incomplete data segment unless we read it and count the bytes. If we simply iterate over the TarFile (e.g. do a TarFile.list()) the archive will appear intact. That is because in the TarFile.next() method we seek from one metadata block to the next, but we cannot simply detect if we seek beyond the end of the archive - except if we insist on the premise that each tar that we read is standards-compliant and comes with an end of archive marker (see case #4), which we probably should not.
Three possible options come to my mind:
1. Add a warning to the documentation that in order to test the integrity of an archive the user has to read through all the data segments.
2. Instead of using seek() in TarFile.next() use read() to advance the file pointer. This is a negative impact on the performance in most cases.
3. Insist on an end of archive marker. This has the disadvantage that users may get an exception although everything is fine.
|
msg244287 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-05-28 08:20 |
If you are already seeking in the file, can’t you seek to the end to determine the length of the file, and then use that to verify if a data segment is truncated? And if you can’t seek, I guess you have to read all the bytes anyway.
I guess Ethan’s test was an instance of case #4 (EOF directly after data block).
|
msg244289 - (view) |
Author: Thomas Guettler (guettli) * |
Date: 2015-05-28 08:41 |
I uploaded a broken tar for testing:
tguettler@aptguettler:~/tmp
===> LANG=C tar -xf tar_which_is_cut.tar
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
tguettler@aptguettler:~/tmp
===> python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tarfile
>>> tarfile.TarFile('tar_which_is_cut.tar', errorlevel=2).extractall()
>>>
(Sad, no error)
|
msg244290 - (view) |
Author: Lars Gustäbel (lars.gustaebel) * |
Date: 2015-05-28 09:03 |
@Martin:
Yes, that's right, but only for cases where the TarFile.fileobj attribute is an actual file object. But, most of the time it is something special, e.g. GzipFile or sys.stdin, which makes random seeking either impossible or perform very badly.
But thanks for your objection, I have to withdraw the statement I made under option 2.: compressed archives are much more common than uncompressed ones. We probably wouldn't lose too much if we no longer use seek() but read() in TarFile.next(). Reading in an uncompressed file is fast anyway. I have to think about this.
|
msg244360 - (view) |
Author: Thomas Guettler (guettli) * |
Date: 2015-05-29 06:25 |
I thought about this again.
It could be solved with the help of a ByteCountingStreamReader.
With ByteCountingStreamReader I mean a wrapper around a stream like codescs.StreamReader. But the ByteCountingStreamReader should not changes the content, but just count the bytes it passed.
The ByteCountingStreamReader would be wrapped around ExFileObject.
This class could be handy in other situations.
What do you think?
|
msg244361 - (view) |
Author: Lars Gustäbel (lars.gustaebel) * |
Date: 2015-05-29 07:43 |
@Thomas:
I think your proposal adds a little too much complexity. Also, ExFileObject is not used during iteration, and we would like to detect broken archives without unpacking all the data segments first.
I have written patches for Python 2 and 3.
|
msg244433 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-05-30 01:00 |
For the record, the difference between Python 2 and 3 is probably a side effect of revision 050f0f7be11e. Python 2 copies data from the ExFileObject returned by extractfile(), while Python 3 copies directly from the underlying file.
The patches to the file reading class look good.
I would be a bit hesitant about the bit that reads all the file data in the next() method. I guess if someone had an uncompressed tar file with only a couple of large files, and they just wanted to list the file names or extract a small file at the end, reading all the data would have a significant impact. Perhaps there is a way to seek almost to the end, and then just read one byte or something.
Also, beware that according to the documentation, os.truncate() only supports Windows in 3.5+, so you might have to adjust the test if applying this to 3.4.
|
msg244455 - (view) |
Author: Thomas Guettler (guettli) * |
Date: 2015-05-30 11:00 |
With Python 3.4.0 you get an OSError if you try to extractall() the uploaded tar_which_is_cut.tar. That's nice.
Seems like only 2.7 seems to be buggy.
===> python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tarfile
>>> tarfile.TarFile('tar_which_is_cut.tar').extractall()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/tarfile.py", line 1979, in extractall
self.extract(tarinfo, path, set_attrs=not tarinfo.isdir())
File "/usr/lib/python3.4/tarfile.py", line 2018, in extract
set_attrs=set_attrs)
File "/usr/lib/python3.4/tarfile.py", line 2087, in _extract_member
self.makefile(tarinfo, targetpath)
File "/usr/lib/python3.4/tarfile.py", line 2133, in makefile
copyfileobj(source, target, tarinfo.size)
File "/usr/lib/python3.4/tarfile.py", line 247, in copyfileobj
raise OSError("end of file reached")
OSError: end of file reached
|
msg244572 - (view) |
Author: Lars Gustäbel (lars.gustaebel) * |
Date: 2015-06-01 06:49 |
@Martin:
This is actually a nice idea that I hadn't thought of. I updated the Python 3 patch to use a seek() that moves to one byte before the next header block, reads the remaining byte and raises an error if it hits eof. The code looks rather clean compared to the previous patch, and it should perform like it always did.
I am not quite sure about which exception type to use, ReadError is used in tarfile's header parsing code, but OSError is already used in tarfile.copyfileobj() and might be more like what the user expects.
|
msg245584 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-06-21 08:55 |
From the current documentation and limited experience with the module, ReadError (or a subclass) sounds best. I would only expect OSError only for OS-level things, like file not found, disk error, etc.
The patches look good. One last suggestion is to use assertRaisesRegex, to be sure that you are getting exactly the exception you expect. Too many times I have found tests in the test suite that were completely broken, but happened to pass because the error happened to match the exception being tested.
|
msg245979 - (view) |
Author: Lars Gustäbel (lars.gustaebel) * |
Date: 2015-06-30 06:54 |
Martin, I followed your suggestion to raise ReadError. This needed an additional change in copyfileobj() because it is used both for adding file data to an archive and extracting file data from an archive.
But I think the patch is in good shape now.
|
msg246350 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2015-07-06 07:33 |
New changeset 372aa98eb72e by Lars Gustäbel in branch '2.7':
Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/372aa98eb72e
New changeset c7f4f61697b7 by Lars Gustäbel in branch '3.4':
Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/c7f4f61697b7
New changeset 59cbdc9eb3d9 by Lars Gustäbel in branch '3.5':
Merge with 3.4: Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/59cbdc9eb3d9
New changeset 6be8fa47c002 by Lars Gustäbel in branch 'default':
Merge with 3.5: Issue #24259: tarfile now raises a ReadError if an archive is truncated inside a data segment.
https://hg.python.org/cpython/rev/6be8fa47c002
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:17 | admin | set | github: 68447 |
2015-07-06 07:34:50 | lars.gustaebel | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2015-07-06 07:33:13 | python-dev | set | nosy:
+ python-dev messages:
+ msg246350
|
2015-06-30 06:54:22 | lars.gustaebel | set | files:
+ issue24259-3.x-3.diff
messages:
+ msg245979 |
2015-06-21 08:55:37 | martin.panter | set | messages:
+ msg245584 |
2015-06-01 06:53:26 | lars.gustaebel | set | files:
+ issue24259-2.x-2.diff |
2015-06-01 06:49:39 | lars.gustaebel | set | files:
+ issue24259-3.x-2.diff
messages:
+ msg244572 |
2015-05-30 11:00:26 | guettli | set | messages:
+ msg244455 |
2015-05-30 01:00:38 | martin.panter | set | messages:
+ msg244433 |
2015-05-29 07:43:18 | lars.gustaebel | set | files:
+ issue24259-2.x.diff |
2015-05-29 07:43:09 | lars.gustaebel | set | files:
+ issue24259-3.x.diff
stage: patch review messages:
+ msg244361 versions:
+ Python 3.4, Python 3.5, Python 3.6 |
2015-05-29 06:25:38 | guettli | set | messages:
+ msg244360 |
2015-05-28 09:03:43 | lars.gustaebel | set | messages:
+ msg244290 |
2015-05-28 08:41:20 | guettli | set | files:
+ tar_which_is_cut.tar
messages:
+ msg244289 |
2015-05-28 08:20:06 | martin.panter | set | messages:
+ msg244287 |
2015-05-28 07:42:54 | lars.gustaebel | set | files:
+ 01-issue24259-test.diff assignee: lars.gustaebel messages:
+ msg244282
keywords:
+ patch |
2015-05-28 03:42:37 | ethan.furman | set | messages:
+ msg244254 |
2015-05-28 03:04:57 | martin.panter | set | type: enhancement -> behavior messages:
+ msg244248 versions:
+ Python 2.7 |
2015-05-28 01:28:19 | ethan.furman | set | messages:
+ msg244239 |
2015-05-28 01:21:58 | ethan.furman | set | messages:
+ msg244237 |
2015-05-28 01:15:28 | ethan.furman | set | messages:
+ msg244236 |
2015-05-28 01:08:19 | martin.panter | set | messages:
+ msg244235 |
2015-05-28 00:57:17 | martin.panter | set | nosy:
+ martin.panter messages:
+ msg244233
components:
+ Library (Lib) type: enhancement |
2015-05-27 09:12:24 | guettli | set | nosy:
+ guettli messages:
+ msg244152
|
2015-05-22 01:37:55 | ethan.furman | set | nosy:
+ ethan.furman
|
2015-05-21 21:26:32 | ned.deily | set | nosy:
+ lars.gustaebel
|
2015-05-21 15:18:16 | Thomas Güttler | create | |