classification
Title: tarfile module next() method hides exceptions
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: JieGhost, lars.gustaebel, r.david.murray, rhettinger, socketpair
Priority: normal Keywords:

Created on 2016-07-22 15:11 by JieGhost, last changed 2016-07-28 06:42 by lars.gustaebel.

Messages (9)
msg270990 - (view) Author: Yujie Chen (JieGhost) Date: 2016-07-22 15:11
I have seen a similar ticket, however that was opened 2 years ago and has nothing more than a brief description. So I opened this new one here, hoping to get some answers.

tarfile.TarFile object is iterable and has a next() method. next() will parse the header and save parsed info. During parsing, a lot of checks are done, to make sure the header is valid. And if there is something wrong with the header, exceptions will be thrown. next() catches a lot of them but not reraise what it catches in all cases.

I have a tgz file, one of the headers is corrupted with a wrong checksum section. thus during parsing, InvalidHeaderError was thrown. next() catches that but hide it silently. From source code (https://hg.python.org/cpython/file/2.7/Lib/tarfile.py#l2335), we can see that InvalidHeaderError will ONLY be raised if it happens in the beginning of the tar file. Actually, a lot of exceptions are hidden by tarfile module. tarfile module simply thinks these exceptions mark the end of tarball.

Why does tarfile module hide so many exceptions? or in other words, why does tarfile treat these exceptions as the end marker of tarball but not errors?

Is it because of this from GNU doc:
"At the end of the archive file there are two 512-byte blocks filled with binary zeros as an end-of-file marker. A reasonable system should write such end-of-file marker at the end of an archive, but must not assume that such a block exists when reading an archive."?

Thanks!
msg270998 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-22 16:25
That would be my guess.  If we are reading along and we hit garbage data, we assume we've reached the end of the tar.  That doesn't mean there isn't room for improvement, or perhaps issuing a warning message about why we think we hit the end of the tar.

What is the issue number of the other issue?  If it is still open we should consolidate the issues if appropriate.
msg271005 - (view) Author: Yujie Chen (JieGhost) Date: 2016-07-22 17:58
The other issue is 
http://bugs.python.org/issue16858
msg271011 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-22 19:41
OK, I've closed #16858 in favor of this one, since we at least had some discussion here.

I see you selected 2.7.  Does python3 have the same issues? (I'm guessing it does, though there has been some work done on the module.)
msg271031 - (view) Author: Yujie Chen (JieGhost) Date: 2016-07-22 20:51
Yeah, I just tried on Python3.5 and it didn't report any errors either.
msg271033 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-07-22 21:19
Lars Gustäbel did most of the work on this and it would be nice to get his thoughts.  The exception swallowing is explicit here rather than accidental. See http://bugs.python.org/issue6123
msg271235 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2016-07-25 05:25
The question is what you're trying to accomplish. If you just want to prevent tarfile from stopping at the first invalid header in order to extract everything following it, you may use the ignore_zeros=True keyword argument.
msg271261 - (view) Author: Yujie Chen (JieGhost) Date: 2016-07-25 12:54
I do want tarfile module to stop at the first invalid header. My question is why does tarfile module NOT throw exception about the error in header, instead it just hide it silently.
msg271505 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2016-07-28 06:42
After all these years, it is not that easy to say why the decision to swallow this exception was made. One part surely was a lack of experience with the tar format itself and all of its implementations. The other part I guess was that it was supposed to avoid problems in case users did not use TarFile as an iterator. tarfile was developed on Python 2.2 which was the first release to feature iterators. The problem if you do random access on a tarfile or call TarFile.getmembers() is that first of all all the headers must be collected. If this fails somewhere in the middle, there is no way to resume the current operation and you get nothing out of the archive.
History
Date User Action Args
2016-07-28 06:42:54lars.gustaebelsetmessages: + msg271505
2016-07-25 12:54:17JieGhostsetmessages: + msg271261
2016-07-25 05:25:38lars.gustaebelsetmessages: + msg271235
2016-07-22 21:19:09rhettingersetnosy: + rhettinger, lars.gustaebel
messages: + msg271033
2016-07-22 20:51:25JieGhostsetmessages: + msg271031
2016-07-22 19:41:07r.david.murraysetnosy: + socketpair
messages: + msg271011
2016-07-22 19:38:45r.david.murraylinkissue16858 superseder
2016-07-22 17:58:50JieGhostsetmessages: + msg271005
2016-07-22 16:25:29r.david.murraysetnosy: + r.david.murray
messages: + msg270998
2016-07-22 15:11:06JieGhostcreate