classification
Title: tarfile fails to extract archive (handled fine by gnu tar and bsdtar)
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: lars.gustaebel, pombreda, python-dev, taleinat
Priority: low Keywords: patch

Created on 2015-06-26 09:18 by pombreda, last changed 2015-07-02 17:45 by lars.gustaebel. This issue is now closed.

Files
File name Uploaded Description Edit
commons-logging-1.1.2-src.tar.gz pombreda, 2015-06-26 09:18 Problematic archive from http://archive.apache.org/dist/commons/logging/source/commons-logging-1.1.2-src.tar.gz
issue24514.diff lars.gustaebel, 2015-06-26 10:00 Patch for 3.4 review
issue24514.diff lars.gustaebel, 2015-06-29 13:32 New version of the patch for 3.4
Messages (10)
msg245839 - (view) Author: Philippe (pombreda) Date: 2015-06-26 09:18
The extraction fails when calling tarfile.open using this archive: http://archive.apache.org/dist/commons/logging/source/commons-logging-1.1.2-src.tar.gz

After some investigation, the file can be extracted with gnu tar and bsdtar and the gzip compression is not the issue: if I gunzip the tar.gz to a tar and call tarfile on plain tar, the problem is the same.

Also this archive was created most likely on Windows (based on the `file` command output) using some Java tools per http://commons.apache.org/proper/commons-logging/building.html from these original files: http://svn.apache.org/repos/asf/commons/proper/logging/tags/LOGGING_1_1_2/ ... that's all I could find out.


The error trace is slightly different on 2.7 and 3.4 but similar. 
The problem has been verified on Linux 64 with Python 2.7 and 3.4 and on Windows with Python 2.7.

On 2.7:

>>> TarFile.taropen(name)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/tarfile.py", line 1705, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python2.7/tarfile.py", line 1574, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2335, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header


On 3.4:

>>> TarFile.taropen(name)
Traceback (most recent call last):
  File "/usr/lib/python3.4/tarfile.py", line 180, in nti
    n = int(nts(s, "ascii", "strict") or "0", 8)
ValueError: invalid literal for int() with base 8: '       '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/tarfile.py", line 2248, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib/python3.4/tarfile.py", line 1083, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/usr/lib/python3.4/tarfile.py", line 1032, in frombuf
    obj.uid = nti(buf[108:116])
  File "/usr/lib/python3.4/tarfile.py", line 182, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/tarfile.py", line 1595, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python3.4/tarfile.py", line 1469, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python3.4/tarfile.py", line 2260, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header
msg245840 - (view) Author: Philippe (pombreda) Date: 2015-06-26 09:21
Note: the traceback above are from calling taropen on the gunzipped tar.gz
The error are similar but a tar less informative when using the tgz and open.
msg245844 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-26 10:00
The problem is that the tar archive has empty uid and gid fields, i.e. 7 spaces terminated with a null-byte.

I attached a patch that solves the problem.
msg245845 - (view) Author: Philippe (pombreda) Date: 2015-06-26 10:03
lars: you are my hero! you rock. I picture you being able to read through tar binary headers while you sleep. I am in awe.
msg245846 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-26 10:10
You're welcome :-D
msg245847 - (view) Author: Philippe (pombreda) Date: 2015-06-26 10:17
I verified that the patch  issue24514.diff (adding .rstrip() ) works also on Python 2.7. I verified it also works on Python 3.4

I ran it on 2.7 against a fairly large test suite of tar files without problems.

This is a +1 for me.

Lars: Do you think you could apply it to 2.7 too?
msg245848 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-26 10:35
Yes, Python 2.7 still gets bugfixes.

However, there's still some work to do on the patch (maybe clean the code, write a test, add a NEWS entry).
msg245934 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2015-06-29 12:56
The patch is very simple, but this needs tests. At the very least, a simple tar file which reproduces this issue could be added to the tests.

Taking this a step further would be writing some unit tests for the internal nti() and itn() functions, and perhaps also stn() and nts().
msg245936 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-29 13:32
I think a simple addition to the existing unittest for nti() will be enough. itn() seems well-tested, and nts() and stn() are not affected, because they don't operate on numbers.
msg246090 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-07-02 17:44
New changeset 301d7efac3de by Lars Gustäbel in branch '2.7':
Issue #24514: tarfile now tolerates number fields consisting of only whitespace.
https://hg.python.org/cpython/rev/301d7efac3de

New changeset 140b4b7b84bd by Lars Gustäbel in branch '3.4':
Issue #24514: tarfile now tolerates number fields consisting of only whitespace.
https://hg.python.org/cpython/rev/140b4b7b84bd

New changeset 1692065524cc by Lars Gustäbel in branch '3.5':
Merge with 3.4: Issue #24514: tarfile now tolerates number fields consisting of only whitespace.
https://hg.python.org/cpython/rev/1692065524cc

New changeset 08fad9037206 by Lars Gustäbel in branch 'default':
Merge with 3.5: Issue #24514: tarfile now tolerates number fields consisting of only whitespace.
https://hg.python.org/cpython/rev/08fad9037206
History
Date User Action Args
2015-12-08 21:40:46martin.panterlinkissue15858 superseder
2015-07-02 17:45:30lars.gustaebelsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2015-07-02 17:44:38python-devsetnosy: + python-dev
messages: + msg246090
2015-06-29 13:32:23lars.gustaebelsetfiles: + issue24514.diff

messages: + msg245936
2015-06-29 12:56:45taleinatsetnosy: + taleinat
messages: + msg245934
2015-06-26 10:35:47lars.gustaebelsetmessages: + msg245848
2015-06-26 10:17:17pombredasetmessages: + msg245847
2015-06-26 10:10:34lars.gustaebelsetpriority: normal -> low
versions: + Python 3.5, Python 3.6
messages: + msg245846

assignee: lars.gustaebel
type: behavior
stage: patch review
2015-06-26 10:03:13pombredasetmessages: + msg245845
2015-06-26 10:00:10lars.gustaebelsetfiles: + issue24514.diff
keywords: + patch
messages: + msg245844
2015-06-26 09:21:27pombredasetmessages: + msg245840
2015-06-26 09:18:57pombredacreate