classification
Title: tarfile can't add in memory files (reopened)
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: lars.gustaebel, markgrandi, martin.panter, pitrou, python-dev, r.david.murray
Priority: low Keywords:

Created on 2014-08-15 23:29 by markgrandi, last changed 2016-02-20 00:18 by python-dev.

Messages (12)
msg225374 - (view) Author: Mark Grandi (markgrandi) * Date: 2014-08-15 23:29
So I ran into this problem today, where near impossible to create a tarfile.TarFile object, then add files to the archive, when the files are in memory file-like objects (like io.BytesIO, io.StringIO, etc)

code example:

###################
import tarfile, io

tarFileIo = io.BytesIO()

tarFileObj = tarfile.open(fileobj=tarFileIo, mode="w:xz")

fileToAdd = io.BytesIO("hello world!".encode("utf-8"))

# fixes "AttributeError: '_io.BytesIO' object has no attribute 'name'"
fileToAdd.name="helloworld.txt"

# fails with 'io.UnsupportedOperation: fileno'
tarInfo = tarFileObj.gettarinfo(arcname="helloworld.txt", fileobj=fileToAdd)

# never runs
tarFileObj.addfile(tarInfo, fileobj=fileToAdd)
###################

this was previously reported as this bug: http://bugs.python.org/issue10369 but I am unhappy with the resolution of "its not a bug", and the 'hack' that Lars posted as a solution. My reasons:

1: The zipfile module supports writing in memory files / bytes , using the following code (which is weird but it works)

tmp = zipfile.ZipFile("tmp.zip", mode="w")
import io
x = io.BytesIO("hello world!".encode("utf-8"))
tmp.writestr("helloworld.txt", x.getbuffer())
tmp.close()

2: the 'hack' that Lars posted, while it works, this is unintuitive and confusing, and isn't the intended behavior. What happens if your script is cross platform, what file do you open to give to os.stat()? In the code posted it uses open('/etc/passwd/') for the fileobj parameter to gettarinfo(), but that file doesn't exist on windows, now not only are you doing this silly hack, you have to have code that checks platform.system() to get a valid file that is known to exist for every system, or use sys.executable, except the documentation for that says it can return None or an empty string.

3: it is easy to fix (at least to me), in tarfile.gettarinfo(), if fileobj is passed in, and it doesn't have a fileno, then to create the TarInfo object, you set 'name' to be the arcname parameter, size = len(fileobj), then have default (maybe overridden by keyword args to gettarinfo()) values for uid/gid/uname/gname.

On a random tar.gz file that I downloaded from sourceforge, the uid/gid are '500' (when my gid is 20 and uid is 501), and the gname/uname are just empty strings. So its obvious that those don't matter most of the time, and when they do matter, you can modify the TarInfo object after creation or pass in values for them in a theoretical keywords argument to gettarinfo().

If no one wants to code this I can provide a patch, I just want the previous bug report's status of "not a bug" to be reconsidered.
msg225389 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-08-16 10:53
Why overcomplicate things?

import io, tarfile

with tarfile.open("foo.tar", mode="w") as tar:
    b = "hello world!".encode("utf-8")

    t = tarfile.TarInfo("helloworld.txt")
    t.size = len(b) # this is crucial
    tar.addfile(t, io.BytesIO(b))


My answer to issue10369 was never supposed to be used as a reference on how to add file-like objects to a TarFile. I posted it as a simpler but equivalent version of the code of the original poster, which is why it looks "hackish".

I think the documentation on TarFile.gettarinfo() is rather clear on how to use it (i.e. that it needs a file object with a valid file descriptor). Also, I think that the code above is intuitive and simple.
msg225516 - (view) Author: Mark Grandi (markgrandi) * Date: 2014-08-19 06:41
I still don't see why TarFile.add() can't be changed to accept a file like object so users don't have to fumble around with a TarInfo object when they just want add a file like object, and don't care about the permission bits and whatnot. It also says un the TarInfo section that you get those objects from gettarinfo(), and its not clear (at least to me) that you can just manually construct one, and that you only need the size property set.

I still the easiest way to solve this is to have gettarinfo() or add() accept file like objects that don't have file descriptors, as it would go a long way to simplifying the use of this module as well. (in zip file you don't ever have to worry about ZipInfo objects unless you need them...)
msg225524 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-08-19 09:47
tarfile needs to know the size of a file object beforehand because the tar header is written first followed by the file object's data. If the file object is not based on a real file descriptor, tarfile cannot simply use os.fstat() but the user has to pass the size somehow. And I doubt that it's a good idea to add size arguments to TarFile.add() and .addfile() because it might lead to confusion.

I think tarfile is rather good at exposing the important parts of its low-level api to the programmer, in a way that still leaves some work for him to do but without getting in his way.  I don't see why manually creating TarInfo objects is such a big deal. It is the far superior way because it offers the maximum freedom for the programmer - admittedly at the cost of a slightly steeper learning curve. And we have to account for many different use cases that people have. For example, you don't mention what you think creating directories from scratch should be like in your opinion.

With regard to the usage of the size attribute the documentation for TarFile.addfile() says clearly:

"""Add the TarInfo object tarinfo to the archive. If fileobj is given, tarinfo.size bytes are read from it and added to the archive. You can create TarInfo objects using gettarinfo()."""
msg225537 - (view) Author: Mark Grandi (markgrandi) * Date: 2014-08-19 19:02
I was just thinking that if os.stat fails, then you try getting the size by just calling len() on it, as stuff like io.BytesIO and io.StringIO will respond to that.

But if we are not changing the behavior of the API, at the very least there needs to be documentation changes / an example of adding a file that does not have a file descriptor. I can do that if needed.
msg225539 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-08-19 20:02
The example given by Lars shows that it's not that easy to come up with the right code. Why not make it easier?
msg225558 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-08-20 07:37
I don't have an idea how to make it easier and still meet all/most requirements and without cluttering up the api. The way it currently works allows the programmer to control every tiny aspect of a tar member. Maybe it's best to simply add a new entry to the Examples section of the tarfile documentation.


import tarfile, io

with tarfile.open("sample.tar", mode="w") as tar:
    t = tarfile.TarInfo("foo")
    t.type = tarfile.DIRTYPE
    tar.addfile(t)

    b = "Hello world!".encode("ascii")

    t = tarfile.TarInfo("foo/bar")
    t.size = len(b)
    tar.addfile(t, io.BytesIO(b))
msg225591 - (view) Author: Mark Grandi (markgrandi) * Date: 2014-08-20 22:05
> I don't have an idea how to make it easier and still meet all/most requirements and without cluttering up the api.

That is what i mentioned in my original post, a lot of the time users just _don't care_ about a lot of the stuff that a tar archive can store (permission bits, uid/gid, etc).

Say i'm on my mac. I can select a bunch of files and then right click -> compress. Pretending that it saves the resulting archive as a .tar.gz rather then a .zip, that's really it. The user doesn't care about the permission bits, uid/gid or any of that, they just want a compressed archive.

While the api does do a good job of exposing the low level parts of the api with TarInfo, being able to set all the stuff manually or have it figured out through gettarinfo() calling os.stat()

My original reasoning for this bug report is that its way too hard to do it for in-memory files, as those don't have file descriptors so os.stat() fails. But why can't we just make it so:

gettarinfo() is called
    * if it's a regular file, it continues as it does not
    * if it is NOT a regular file (no file descriptor), then it returns a TarInfo object with the 'name' and 'size' set, and the rest of the fields set to default values (the current user's uid and gid, acceptable permission bits and the correct type for a regular file (REGFILE?)
        * if gettarinfo() is called with a non regular file and it's name has a slash, then its assumed to be a folder structure, so then it will add the correct TarInfo with type = DIRTYPE and then insert the file underneath that folder, sorta how zipfile works. I looked at the tarfile.py code and it seems it does this already. 


This just adds the needed "easy use case" for the tarfile module, as the complicated low level api is there, we just need something that users just want to create an archive without worrying too much about the low level stuff. So then they can just:

import tarfile, io

fileToAdd = io.BytesIO("hello world!".encode("utf-8"))
with tarfile.open("sample.tar", mode="w") as tar:

    # this TarInfo object has:
    #    name = 'somefile.txt'
    #    type = REGTYPE (or whatever is 'just a regular file')
    #    uid = 501, gid = 20, gname=staff, uname=markgrandi, mode=644
    tmpTarInfo = tar.gettarinfo("somefile.txt", fileToAdd)
    tar.addfile()



So basically its just having defaults for the TarInfo object when gettarinfo() is given a file that doesn't have a file descriptor.
msg225648 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-08-22 06:58
Please provide a patch which allows easy addition of file-like objects (not only io.BytesIO) and directories, preferably hard and symbolic links, too. It would be nice to still be able to change attributes of a TarInfo before addition. Please also add tests.
msg239090 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-03-24 03:36
If you read the documentation it is clear that gettarfile() requires an OS file, so won’t work with an internal Python-only file object. Maybe the documentation could be tweaked, but I don’t think the gettarfile() implementation should be changed. To me the whole point of it is to call fstat() on the file and fill in the TarInfo attributes appropriately.

Instead, perhaps an enhancement could be made that allowed something like this:

metadata = TarInfo.make_file('helloworld.txt', len(b))
tarFileObj.addfile(metadata, io.BytesIO(b))

The corresponding TarInfo class could grow new presets looking something like:

class TarInfo:
    @classmethod
    def make_file(cls, name, size):  # Name and size are mandatory
        self = cls(name)
        self.type = REGTYPE
        self.size = size
        self.mtime = None  # Force addfile() to set it to some default time.time() value
        self.mode = 0o644
        return self
    
    @classmethod
    def make_executable(cls, name, size):
        ...
        self.mode = 0o755
        ...
    
    @classmethod
    def make_directory(cls, name):
        ...
        self.type = DIRTYPE
        ...
    
    def make_hard_link(cls, name, target)
    def make_symlink(cls, name, target)
    def make_block_device(cls, name, major, minor)  # Set undocumented attributes
    def make_char_device(cls, name, major, minor)
msg241585 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-20 01:34
In Issue 22468, I posted a patch which encourages using TarInfo directly, and hopefully clarifies that gettarinfo() is only for OS files. I think that should cover the documentation aspect of this bug, although an enhancement to synthesize TarInfo objects for regular files etc might still be a good idea.
msg260539 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-02-20 00:18
New changeset 94a94deaf06a by Martin Panter in branch '3.5':
Issues #22468, #21996, #22208: Clarify gettarinfo() and TarInfo usage
https://hg.python.org/cpython/rev/94a94deaf06a

New changeset 9d5217aaea13 by Martin Panter in branch '2.7':
Issues #22468, #21996, #22208: Clarify gettarinfo() and TarInfo usage
https://hg.python.org/cpython/rev/9d5217aaea13
History
Date User Action Args
2016-02-20 00:18:59python-devsetnosy: + python-dev
messages: + msg260539
2015-04-20 01:34:21martin.pantersetmessages: + msg241585
2015-03-24 03:36:29martin.pantersetnosy: + martin.panter
messages: + msg239090
2014-08-22 06:58:24lars.gustaebelsettype: behavior -> enhancement
messages: + msg225648
stage: needs patch
2014-08-20 22:05:27markgrandisetmessages: + msg225591
2014-08-20 07:37:22lars.gustaebelsetmessages: + msg225558
2014-08-19 20:02:32pitrousetnosy: + pitrou
messages: + msg225539
2014-08-19 19:02:13markgrandisetmessages: + msg225537
2014-08-19 09:47:17lars.gustaebelsetmessages: + msg225524
2014-08-19 06:41:32markgrandisetmessages: + msg225516
2014-08-19 06:22:31r.david.murraysetnosy: + r.david.murray
2014-08-16 10:53:45lars.gustaebelsetpriority: normal -> low
assignee: lars.gustaebel
type: behavior
messages: + msg225389
2014-08-16 01:13:33ned.deilysetnosy: + lars.gustaebel
2014-08-15 23:29:17markgrandicreate