classification
Title: tarfile: incorrectly treats regular file as directory
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Joe Tsai, lars.gustaebel, nitishch, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-09-22 21:42 by Joe Tsai, last changed 2017-10-04 17:21 by Joe Tsai.

Files
File name Uploaded Description Edit
test.tar Joe Tsai, 2017-10-03 22:07
Messages (5)
msg302778 - (view) Author: Joe Tsai (Joe Tsai) Date: 2017-09-22 21:42
The original V7 header only allocates 100B to store the file path. If a path exceeds this length, then either the PAX format or GNU formats must be used, which can represent arbitrarily long file paths. When doing so, most tar writers just store the first 100B of the file path in the V7 header.

When reading, a proper reader should disregard the contents of the V7 field if a previous and corresponding PAX or GNU header overrode it.

This currently not the case with the tarfile module, which has the following check (https://github.com/python/cpython/blob/c7cc14a825ec156c76329f65bed0d0bd6e03d035/Lib/tarfile.py#L1054-L1057):
    # Old V7 tar format represents a directory as a regular
    # file with a trailing slash.
    if obj.type == AREGTYPE and obj.name.endswith("/"):
        obj.type = DIRTYPE

This check should be further constrained to only activate when there were no prior PAX or GNU records that override that value of obj.name. This check was the source of a bug that caused tarfile to report a regular as a directory because the file path was extra long, and when the tar write truncated the path to the first 100B, it so happened to end on a slash.
msg303431 - (view) Author: Nitish (nitishch) * Date: 2017-09-30 21:30
> This check was the source of a bug that caused tarfile to report a regular as a directory because the file path was extra long, and when the tar write truncated the path to the first 100B, it so happened to end on a slash.

AFAIK, '/' character is not allowed as part of a filename on Linux systems. Is this bug platform specific? Can you give the testcase you are referring to.
msg303655 - (view) Author: Joe Tsai (Joe Tsai) Date: 2017-10-03 22:07
This bug is not platform specific.

I've attached a reproduction:
$ python
>>> import tarfile
>>> tarfile.open("test.tar", "r").next().isdir()
True

$ tar -tvf test.tar
-rw-rw-r-- 0/0               0 1969-12-31 16:00 123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/123456789/foo.txt
$ tar --version
tar (GNU tar) 1.27.1

For some background, this bug was original filed against the Go standard library (for which I am the maintainer of the Go implementation of tar). When I investigated the issue, I discovered that Go was doing the right thing, and that the discrepancy was due to the check I pointed to earlier. The GNU tool indicates that this is a regular file as well.
msg303676 - (view) Author: Nitish (nitishch) * Date: 2017-10-04 06:40
Try 'tar xvf test.tar'. On Linux machine at least, it is in fact producing a tree of directories. Not a single file. So - in a way what Python is reporting is correct.
msg303715 - (view) Author: Joe Tsai (Joe Tsai) Date: 2017-10-04 17:21
It creates a number of nested directories only because GNU (and BSD) tar implicitly create missing parent directories. If you cd into the bottom-most folder, you will see "foo.txt".
History
Date User Action Args
2017-10-04 17:21:19Joe Tsaisetmessages: + msg303715
2017-10-04 06:40:06nitishchsetmessages: + msg303676
2017-10-03 22:07:23Joe Tsaisetfiles: + test.tar

messages: + msg303655
2017-09-30 21:30:44nitishchsetnosy: + nitishch
messages: + msg303431
2017-09-30 07:00:40serhiy.storchakasetversions: + Python 2.7, Python 3.6, Python 3.7
nosy: + serhiy.storchaka

components: + Library (Lib)
type: behavior
stage: needs patch
2017-09-29 22:26:36terry.reedysetnosy: + lars.gustaebel
2017-09-22 21:42:01Joe Tsaicreate