classification
Title: tarfile missing entries due to omitted uid/gid fields
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder: tarfile fails to extract archive (handled fine by gnu tar and bsdtar)
View: 24514
Assigned To: lars.gustaebel Nosy List: efloehr, lars.gustaebel, martin.panter, tlynn, whitemice
Priority: normal Keywords: patch

Created on 2012-09-03 19:47 by tlynn, last changed 2015-12-08 21:40 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
bad.tar tlynn, 2012-09-05 19:13 Example tarfile with one dir and one unlisted file
patch.py tlynn, 2012-09-05 19:16
tarfile-patch.txt akuchling, 2013-11-11 19:10 review
test_tarfile.patch efloehr, 2014-02-24 03:10 New test to test_tarfile.py to test this issue and its solution review
Messages (11)
msg169799 - (view) Author: Tom Lynn (tlynn) Date: 2012-09-03 19:47
The tarfile module silently truncates the list of entries when reading a tar file if it sees an entry with a uid/gid field containing only spaces/NULs.  I got such a tarball from Java Maven/plexus-archiver.  I don't know whether they write such fields deliberately, but it seems reasonable to me, especially since they were providing the user/group names textually.

I'd like to see two fixes - a None/-1/0 value for the uid/gid and not silently swallowing HeaderErrors in TarFile.next() (or at least documenting why it's being done).  0 would be consistent with the default value when writing, but None seems more honest.  -1 seems hard to defend.

Only tested on silly Python versions (2.6, PyPy-1.8), sorry.  It's what I've got to hand, but I think this issue also applies to recent Python too going by looking at the hg trunk.
msg169822 - (view) Author: Tom Lynn (tlynn) Date: 2012-09-04 10:58
I think the default has to be 0 for consistency with how other empty numeric fields are handled.

In theory spaces and NULs are supposed to be equivalent terminators in numeric fields, but I've just noticed that plexus-archiver is also using leading spaces rather than leading zeros (against the spec), e.g. ' 10422\x00 '.  tarfile currently supports this, which I think is good, so I think the right approach is to lstrip(' ') fields and then treat space as a terminator.
msg169868 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2012-09-05 12:18
Could you provide some sample data and code? I see the problem, but I cannot quite reproduce the behaviour you describe. In all of my testcases tarfile either throws an exception or successfully reads the archive, but never silently stops.
msg169887 - (view) Author: Tom Lynn (tlynn) Date: 2012-09-05 19:13
See attached bad.tar.

$ less bad.tar | cat
drwxr-xr-x 0/0               0 2012-09-05 20:04 foo/
-rw-rw-r-- uname/gname       0 2012-09-05 20:04 foo/a
$ python -c 'import tarfile; print(tarfile.open("bad.tar").getnames())'
['foo']
$ python -c 'import tarfile, patch; patch.patch_tarfile(); print (tarfile.open("bad.tar").getnames())'
['foo', 'foo/a']

I'm only allowed to attach one file via the tracker web UI, so patch.py will follow.

Creation code for bad.tar, largely for my benefit:

import java.io.FileOutputStream;
import java.io.IOException;
import org.codehaus.plexus.archiver.tar.TarOutputStream;
import org.codehaus.plexus.archiver.tar.TarEntry;

class TarTest {
    public static void main(String[] args) throws IOException {
        FileOutputStream fos = new FileOutputStream("bad.tar");
        TarOutputStream tos = new TarOutputStream(fos);

        TarEntry entry = new TarEntry("foo/");
        entry.setMode(16877); // 0o40755
        entry.setUserId(0);
        entry.setGroupId(0);
        entry.setUserName("");
        entry.setGroupName("");
        tos.putNextEntry(entry);

        TarEntry entry2 = new TarEntry("foo/a");
        entry2.setMode(33204); // 0o100664
        entry2.setUserId(-1);  // XXX: dodgy
        entry2.setGroupId(-1); // XXX: dodgy
        entry2.setUserName("uname");
        entry2.setGroupName("gname");
        tos.putNextEntry(entry2);

        tos.close();
        fos.close();
    }
}
msg169888 - (view) Author: Tom Lynn (tlynn) Date: 2012-09-05 19:16
patch.py attached - what I'm using as a workaround at the moment.
msg202641 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2013-11-11 19:10
Here's the changes from patch.py, put into patch format.  I took out the inlining for ntb() and support for different tarfile APIs, and also replaced the use of .split(,1)[0] by .partition().
msg208225 - (view) Author: Tom Lynn (tlynn) Date: 2014-01-16 00:21
The secondary issue, which the patch doesn't address, is that TarFile.next() seems unpythonic; it treats any {Invalid,Empty,Truncated}HeaderError after offset 0 as EOF rather than propagating the exception.  It looks deliberate, but I'm not sure why it would be done like that or if it should be changed.
msg212057 - (view) Author: Eric Floehr (efloehr) Date: 2014-02-24 03:10
I have attached a patch file for test_tarfile.py that uses the attached 'bad.tar' to test this issue.

After applying this test patch, put 'bad.tar' in Lib/test with the name 'issue15858.tar'.

This tests Tarfile.next(), but the issue can be seen via the command line interface, which uses .next() under the covers:

Pre-patch:
$ ./python -m tarfile -l Lib/test/issue15858.tar 
foo/ 

Post-patch:
$ ./python -m tarfile -l Lib/test/issue15858.tar 
foo/ 
foo/a
msg224013 - (view) Author: Adam Tauno Williams (whitemice) Date: 2014-07-25 23:50
test fails for me with provided bad.tar [as described in comment] but test passed after applying patch to tarfile.
msg256115 - (view) Author: Tom Lynn (tlynn) Date: 2015-12-08 15:21
I think issue24514 (fixed in Py2.7.11) is a duplicate of this issue.
msg256128 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-08 21:40
Yes I think you are right. With Python 3.5.0:

$ python3 -c 'import tarfile; print(tarfile.open("bad.tar").getnames())'
['foo', 'foo/a']
$ python3 -m tarfile --list bad.tar
foo/ 
foo/a 

The proposed fix here is slightly different: truncate from the first space, rather than trailing whitespace.
History
Date User Action Args
2015-12-08 21:40:46martin.pantersetstatus: open -> closed

superseder: tarfile fails to extract archive (handled fine by gnu tar and bsdtar)

nosy: + martin.panter
messages: + msg256128
resolution: out of date
stage: test needed -> resolved
2015-12-08 15:21:12tlynnsetmessages: + msg256115
2014-12-31 16:21:35akuchlingsetnosy: - akuchling
2014-07-25 23:50:18whitemicesetnosy: + whitemice
messages: + msg224013
2014-02-24 03:10:48efloehrsetfiles: + test_tarfile.patch

nosy: + efloehr
messages: + msg212057

keywords: + patch
2014-01-16 00:21:47tlynnsetmessages: + msg208225
2014-01-15 16:56:49serhiy.storchakasetstage: test needed
versions: + Python 2.7, Python 3.3, Python 3.4, - Python 2.6, 3rd party
2013-11-11 19:10:36akuchlingsetfiles: + tarfile-patch.txt
nosy: + akuchling
messages: + msg202641

2012-09-05 19:16:29tlynnsetfiles: + patch.py

messages: + msg169888
2012-09-05 19:13:34tlynnsetfiles: + bad.tar

messages: + msg169887
2012-09-05 12:18:50lars.gustaebelsetassignee: lars.gustaebel
messages: + msg169868
2012-09-04 10:58:44tlynnsetmessages: + msg169822
2012-09-03 23:25:05ned.deilysetnosy: + lars.gustaebel
2012-09-03 19:47:40tlynncreate