Title: tarfile fails to extract archive (handled fine by gnu tar and bsdtar)
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: lars.gustaebel, pombredanne, python-dev, taleinat
Priority: low Keywords: patch

Created on 2015-06-26 09:18 by pombredanne, last changed 2022-04-11 14:58 by admin. This issue is now closed.

File name Uploaded Description Edit
commons-logging-1.1.2-src.tar.gz pombredanne, 2015-06-26 09:18 Problematic archive from
issue24514.diff lars.gustaebel, 2015-06-26 10:00 Patch for 3.4
issue24514.diff lars.gustaebel, 2015-06-29 13:32 New version of the patch for 3.4
Messages (10)
msg245839 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2015-06-26 09:18
The extraction fails when calling using this archive:

After some investigation, the file can be extracted with gnu tar and bsdtar and the gzip compression is not the issue: if I gunzip the tar.gz to a tar and call tarfile on plain tar, the problem is the same.

Also this archive was created most likely on Windows (based on the `file` command output) using some Java tools per from these original files: ... that's all I could find out.

The error trace is slightly different on 2.7 and 3.4 but similar. 
The problem has been verified on Linux 64 with Python 2.7 and 3.4 and on Windows with Python 2.7.

On 2.7:

>>> TarFile.taropen(name)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/", line 1705, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python2.7/", line 1574, in __init__
    self.firstmember =
  File "/usr/lib/python2.7/", line 2335, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

On 3.4:

>>> TarFile.taropen(name)
Traceback (most recent call last):
  File "/usr/lib/python3.4/", line 180, in nti
    n = int(nts(s, "ascii", "strict") or "0", 8)
ValueError: invalid literal for int() with base 8: '       '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/", line 2248, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib/python3.4/", line 1083, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/usr/lib/python3.4/", line 1032, in frombuf
    obj.uid = nti(buf[108:116])
  File "/usr/lib/python3.4/", line 182, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/", line 1595, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python3.4/", line 1469, in __init__
    self.firstmember =
  File "/usr/lib/python3.4/", line 2260, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header
msg245840 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2015-06-26 09:21
Note: the traceback above are from calling taropen on the gunzipped tar.gz
The error are similar but a tar less informative when using the tgz and open.
msg245844 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-26 10:00
The problem is that the tar archive has empty uid and gid fields, i.e. 7 spaces terminated with a null-byte.

I attached a patch that solves the problem.
msg245845 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2015-06-26 10:03
lars: you are my hero! you rock. I picture you being able to read through tar binary headers while you sleep. I am in awe.
msg245846 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-26 10:10
You're welcome :-D
msg245847 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2015-06-26 10:17
I verified that the patch  issue24514.diff (adding .rstrip() ) works also on Python 2.7. I verified it also works on Python 3.4

I ran it on 2.7 against a fairly large test suite of tar files without problems.

This is a +1 for me.

Lars: Do you think you could apply it to 2.7 too?
msg245848 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-26 10:35
Yes, Python 2.7 still gets bugfixes.

However, there's still some work to do on the patch (maybe clean the code, write a test, add a NEWS entry).
msg245934 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2015-06-29 12:56
The patch is very simple, but this needs tests. At the very least, a simple tar file which reproduces this issue could be added to the tests.

Taking this a step further would be writing some unit tests for the internal nti() and itn() functions, and perhaps also stn() and nts().
msg245936 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-29 13:32
I think a simple addition to the existing unittest for nti() will be enough. itn() seems well-tested, and nts() and stn() are not affected, because they don't operate on numbers.
msg246090 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-07-02 17:44
New changeset 301d7efac3de by Lars Gustäbel in branch '2.7':
Issue #24514: tarfile now tolerates number fields consisting of only whitespace.

New changeset 140b4b7b84bd by Lars Gustäbel in branch '3.4':
Issue #24514: tarfile now tolerates number fields consisting of only whitespace.

New changeset 1692065524cc by Lars Gustäbel in branch '3.5':
Merge with 3.4: Issue #24514: tarfile now tolerates number fields consisting of only whitespace.

New changeset 08fad9037206 by Lars Gustäbel in branch 'default':
Merge with 3.5: Issue #24514: tarfile now tolerates number fields consisting of only whitespace.
