classification
Title: tarfile.extractall(readaccess=True)
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.0, Python 3.1, Python 2.7, Python 2.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: lars.gustaebel, r.david.murray, srid
Priority: low Keywords:

Created on 2009-06-04 21:45 by srid, last changed 2009-06-06 07:09 by lars.gustaebel. This issue is now closed.

Messages (10)
msg88908 - (view) Author: Sridhar Ratnakumar (srid) Date: 2009-06-04 21:45
If a tarball has a-x perms set on its root directory, one cannot access
its contents. 

$ tar zxf generator_tools-0.3.5.tar.gz.
$ ls generator_tools-0.3.5/
ls: cannot access generator_tools-0.3.5/README.txt: Permission denied
...
sridharr@double:/tmp/i$ 

This is fine for GNU tar (the user can always do a chmod +x later). But
for the tarfile library, it would be better to have a flag such as
readaccess=True that will force ``extractall`` to enforce *minimum*
permissions required for the basic read access. This means, tarfile
would ignore u-x on directories and u-r on files.

The reason I make this feature request (instead of working around the
issue myself in a verbose way) is that the very reason to write a
program to extract tarball (instead of doing it manually) is to automate
it .. which automation is more effective and simple if ``extractall``
had a flag such as readaccess=True.
msg88909 - (view) Author: Sridhar Ratnakumar (srid) Date: 2009-06-04 21:46
Here's a test data from PyPI:
http://pypi.python.org/packages/source/g/generator_tools/generator_tools-0.3.5.tar.gz
msg88911 - (view) Author: Sridhar Ratnakumar (srid) Date: 2009-06-04 21:52
Considering this bug where tarfile fails to set g+s,

  https://bugs.launchpad.net/pyopenssl/+bug/236190

a more general approach could be:

  tarfile.extractall(safe_perms=True)

where if safe_perms is set, tarfile can 1) ignore +/-s on all files, 2)
ignore u-x on directories and 3) ignore u-r on files.
msg88912 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-06-04 22:12
I don't see why the tarfile case should be different from the tar case.
 You can "always chmod it later" in python, too (with os.walk and
os.chmod).  Perhaps the real need is for a recursive chmod in shutil?
msg88915 - (view) Author: Sridhar Ratnakumar (srid) Date: 2009-06-04 22:28
[David] I don't see why the tarfile case should be different from the
tar case. (...)

As I explained, Viz:

[quote]'(...)the very reason to write a program to extract tarball
(instead of doing it manually) is to automate it .. which automation is
*more effective and simple* if ``extractall`` had a flag such as
readaccess=True'[endquote] (emphasis added)

[David] You can "always chmod it later" in python, too (with os.walk and
os.chmod). (...)

Of course, I can. Or:

        EXECUTE = 0100
        READ = 0400
        dir_perm = EXECUTE
        file_perm = EXECUTE | READ
        for tarinfo in f.getmembers():
            tarinfo.mode |= (dir_perm if tarinfo.isdir() else file_perm)

As you can see, for a tarfile with huge list of files.. this can be a
performance issue.

[David] (...) Perhaps the real need is for a recursive chmod in shutil?

The real need is to fix the weird permissions on some tarballs (such as
generator_tools-0.3.5.tar.gz in PyPI and the above mentioned pyopenssl
tarball).

This need usually leads to designing workarounds. 

I just think it is not simple (as in, keeping the code off from such
hacks that are tangential to the problem being solved) and effective (as
in, not having to deal with potential unintended side effects like bugs
in the post-fix chmoding or in the pre-fix tarinfo mode modifications).

Hence the feature request.
msg88934 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2009-06-05 11:00
I am still not convinced why tarfile needs this kind of a work-around
built in. We talk about a very small number of cases here and the
generator_tools-0.3.5.tar.gz is really broken beyond repair. It is the
only thing that should be fixed here IMO ;-)
I agree with David here. It is easy to manipulate the tarfile in
advance, as you have shown yourself. The performance argument does not
convince me either.
msg88956 - (view) Author: Sridhar Ratnakumar (srid) Date: 2009-06-05 17:45
[Lars] (...) We talk about a very small number of cases here and the
generator_tools-0.3.5.tar.gz is really broken beyond repair. It is the
only thing that should be fixed here IMO ;-)

Sure, that is what the pyopenssl folks did - fix their tarball. However,
it is reasonable expect certain tarballs to be 'broken beyond repair'
when you are running tarfile.extracall over a huge number of tarballs
such as the ones in PyPI.

Indeed, the tarfile module already has several fixes for such 'broken'
cases, Viz:

[quote]'If ignore_zeros is False, treat an empty block as the end of the
archive. If it is True, skip empty (and invalid) blocks and try to get
as many members as possible. This is only useful for reading
concatenated or **damaged** archives.'[endquote] [emphasis added]

[quote]'(...)Directory information like owner, modification time and
permissions are set after all members have been extracted. This is done
to work around two problems: A directory’s modification time is reset
each time a file is created in it. And, if a directory’s **permissions
do not allow writing**, extracting files to it will fail'[endquote]
[emphasis added]

[Lars] I agree with David here. It is easy to manipulate the tarfile in
advance, as you have shown yourself. The performance argument does not
convince me either.

Ok. Can you comment on this argument?

[quote]'(...)the very reason to write a program to extract tarball
(instead of doing it manually) is to automate it .. which automation is
*more effective and simple* if ``extractall`` had a flag such as
readaccess=True'[endquote] (emphasis added)

[quote]'I just think it is not simple (as in, keeping the code off from
such hacks that are tangential to the problem being solved) and
effective (as in, not having to deal with potential unintended side
effects like bugs in the post-fix chmoding or in the pre-fix tarinfo
mode modifications).'[endquote]
msg88967 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2009-06-05 19:18
Sure, tarfile contains numerous work-arounds for quirky and buggy
archives. Otherwise, it would not be usable in real-life.

But we should not mix up different issues here. tarfile reads and
extracts your generator_tools.tar just fine. Formally, the data is okay.
It's the stored information that is useless and that you are not happy
with. But as we both agree it is rather simple to fix this information
in advance:

import tarfile
tar = tarfile.open("generator_tools-0.3.5.tar.gz")
for t in tar:
    if t.isdir():
        t.mode = 0755
    else:
        t.mode = 0644
tar.extractall()
tar.close()

Sure, there is some functionality in extractall() that addresses issues
with inappropriate permissions, but without this functionality the
archive would not even *extract* cleanly. That is very different from
your problem.

In my opinion, the code above illustrates quite well, that tarfile was
designed to be high-level and flexible at the same time. Make use of
that. I honestly think that extractall() can do well without a
readaccess argument.
msg88985 - (view) Author: Sridhar Ratnakumar (srid) Date: 2009-06-06 00:26
[Lars] Sure, there is some functionality in extractall() that addresses
issues with inappropriate permissions, but without this functionality the
archive would not even *extract* cleanly. That is very different from
your problem.

Fair enough.

'tis time to creating a pypi package out of my high-level wrapper:

  http://gist.github.com/124597
msg88990 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2009-06-06 07:09
I close this issue then.
History
Date User Action Args
2009-06-06 07:09:26lars.gustaebelsetstatus: open -> closed
resolution: rejected
messages: + msg88990
2009-06-06 00:26:24sridsetmessages: + msg88985
2009-06-05 19:18:28lars.gustaebelsetmessages: + msg88967
2009-06-05 17:45:07sridsetmessages: + msg88956
2009-06-05 11:00:17lars.gustaebelsetassignee: lars.gustaebel

messages: + msg88934
nosy: + lars.gustaebel
2009-06-04 22:28:16sridsetmessages: + msg88915
2009-06-04 22:12:06r.david.murraysetpriority: low
nosy: + r.david.murray
messages: + msg88912

2009-06-04 21:52:02sridsetmessages: + msg88911
2009-06-04 21:46:30sridsetmessages: + msg88909
2009-06-04 21:45:51sridcreate