Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile can't add in memory files (reopened) #66404

Open
mgrandi mannequin opened this issue Aug 15, 2014 · 12 comments
Open

tarfile can't add in memory files (reopened) #66404

mgrandi mannequin opened this issue Aug 15, 2014 · 12 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@mgrandi
Copy link
Mannequin

mgrandi mannequin commented Aug 15, 2014

BPO 22208
Nosy @gustaebel, @pitrou, @bitdancer, @vadmium, @mgrandi

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/gustaebel'
closed_at = None
created_at = <Date 2014-08-15.23:29:17.566>
labels = ['type-feature', 'library']
title = "tarfile can't add in memory files (reopened)"
updated_at = <Date 2016-02-20.00:18:59.202>
user = 'https://github.com/mgrandi'

bugs.python.org fields:

activity = <Date 2016-02-20.00:18:59.202>
actor = 'python-dev'
assignee = 'lars.gustaebel'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2014-08-15.23:29:17.566>
creator = 'markgrandi'
dependencies = []
files = []
hgrepos = []
issue_num = 22208
keywords = []
message_count = 12.0
messages = ['225374', '225389', '225516', '225524', '225537', '225539', '225558', '225591', '225648', '239090', '241585', '260539']
nosy_count = 6.0
nosy_names = ['lars.gustaebel', 'pitrou', 'r.david.murray', 'python-dev', 'martin.panter', 'markgrandi']
pr_nums = []
priority = 'low'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue22208'
versions = ['Python 3.4']

@mgrandi
Copy link
Mannequin Author

mgrandi mannequin commented Aug 15, 2014

So I ran into this problem today, where near impossible to create a tarfile.TarFile object, then add files to the archive, when the files are in memory file-like objects (like io.BytesIO, io.StringIO, etc)

code example:

###################
import tarfile, io

tarFileIo = io.BytesIO()

tarFileObj = tarfile.open(fileobj=tarFileIo, mode="w:xz")

fileToAdd = io.BytesIO("hello world!".encode("utf-8"))

# fixes "AttributeError: '_io.BytesIO' object has no attribute 'name'"
fileToAdd.name="helloworld.txt"

# fails with 'io.UnsupportedOperation: fileno'
tarInfo = tarFileObj.gettarinfo(arcname="helloworld.txt", fileobj=fileToAdd)

# never runs
tarFileObj.addfile(tarInfo, fileobj=fileToAdd)
###################

this was previously reported as this bug: http://bugs.python.org/issue10369 but I am unhappy with the resolution of "its not a bug", and the 'hack' that Lars posted as a solution. My reasons:

1: The zipfile module supports writing in memory files / bytes , using the following code (which is weird but it works)

tmp = zipfile.ZipFile("tmp.zip", mode="w")
import io
x = io.BytesIO("hello world!".encode("utf-8"))
tmp.writestr("helloworld.txt", x.getbuffer())
tmp.close()

2: the 'hack' that Lars posted, while it works, this is unintuitive and confusing, and isn't the intended behavior. What happens if your script is cross platform, what file do you open to give to os.stat()? In the code posted it uses open('/etc/passwd/') for the fileobj parameter to gettarinfo(), but that file doesn't exist on windows, now not only are you doing this silly hack, you have to have code that checks platform.system() to get a valid file that is known to exist for every system, or use sys.executable, except the documentation for that says it can return None or an empty string.

3: it is easy to fix (at least to me), in tarfile.gettarinfo(), if fileobj is passed in, and it doesn't have a fileno, then to create the TarInfo object, you set 'name' to be the arcname parameter, size = len(fileobj), then have default (maybe overridden by keyword args to gettarinfo()) values for uid/gid/uname/gname.

On a random tar.gz file that I downloaded from sourceforge, the uid/gid are '500' (when my gid is 20 and uid is 501), and the gname/uname are just empty strings. So its obvious that those don't matter most of the time, and when they do matter, you can modify the TarInfo object after creation or pass in values for them in a theoretical keywords argument to gettarinfo().

If no one wants to code this I can provide a patch, I just want the previous bug report's status of "not a bug" to be reconsidered.

@mgrandi mgrandi mannequin added the stdlib Python modules in the Lib dir label Aug 15, 2014
@gustaebel
Copy link
Mannequin

gustaebel mannequin commented Aug 16, 2014

Why overcomplicate things?

import io, tarfile

with tarfile.open("foo.tar", mode="w") as tar:
    b = "hello world!".encode("utf-8")

    t = tarfile.TarInfo("helloworld.txt")
    t.size = len(b) # this is crucial
    tar.addfile(t, io.BytesIO(b))

My answer to bpo-10369 was never supposed to be used as a reference on how to add file-like objects to a TarFile. I posted it as a simpler but equivalent version of the code of the original poster, which is why it looks "hackish".

I think the documentation on TarFile.gettarinfo() is rather clear on how to use it (i.e. that it needs a file object with a valid file descriptor). Also, I think that the code above is intuitive and simple.

@gustaebel gustaebel mannequin added the type-bug An unexpected behavior, bug, or error label Aug 16, 2014
@gustaebel gustaebel mannequin self-assigned this Aug 16, 2014
@mgrandi
Copy link
Mannequin Author

mgrandi mannequin commented Aug 19, 2014

I still don't see why TarFile.add() can't be changed to accept a file like object so users don't have to fumble around with a TarInfo object when they just want add a file like object, and don't care about the permission bits and whatnot. It also says un the TarInfo section that you get those objects from gettarinfo(), and its not clear (at least to me) that you can just manually construct one, and that you only need the size property set.

I still the easiest way to solve this is to have gettarinfo() or add() accept file like objects that don't have file descriptors, as it would go a long way to simplifying the use of this module as well. (in zip file you don't ever have to worry about ZipInfo objects unless you need them...)

@gustaebel
Copy link
Mannequin

gustaebel mannequin commented Aug 19, 2014

tarfile needs to know the size of a file object beforehand because the tar header is written first followed by the file object's data. If the file object is not based on a real file descriptor, tarfile cannot simply use os.fstat() but the user has to pass the size somehow. And I doubt that it's a good idea to add size arguments to TarFile.add() and .addfile() because it might lead to confusion.

I think tarfile is rather good at exposing the important parts of its low-level api to the programmer, in a way that still leaves some work for him to do but without getting in his way. I don't see why manually creating TarInfo objects is such a big deal. It is the far superior way because it offers the maximum freedom for the programmer - admittedly at the cost of a slightly steeper learning curve. And we have to account for many different use cases that people have. For example, you don't mention what you think creating directories from scratch should be like in your opinion.

With regard to the usage of the size attribute the documentation for TarFile.addfile() says clearly:

"""Add the TarInfo object tarinfo to the archive. If fileobj is given, tarinfo.size bytes are read from it and added to the archive. You can create TarInfo objects using gettarinfo()."""

@mgrandi
Copy link
Mannequin Author

mgrandi mannequin commented Aug 19, 2014

I was just thinking that if os.stat fails, then you try getting the size by just calling len() on it, as stuff like io.BytesIO and io.StringIO will respond to that.

But if we are not changing the behavior of the API, at the very least there needs to be documentation changes / an example of adding a file that does not have a file descriptor. I can do that if needed.

@pitrou
Copy link
Member

pitrou commented Aug 19, 2014

The example given by Lars shows that it's not that easy to come up with the right code. Why not make it easier?

@gustaebel
Copy link
Mannequin

gustaebel mannequin commented Aug 20, 2014

I don't have an idea how to make it easier and still meet all/most requirements and without cluttering up the api. The way it currently works allows the programmer to control every tiny aspect of a tar member. Maybe it's best to simply add a new entry to the Examples section of the tarfile documentation.

import tarfile, io

with tarfile.open("sample.tar", mode="w") as tar:
    t = tarfile.TarInfo("foo")
    t.type = tarfile.DIRTYPE
    tar.addfile(t)

    b = "Hello world!".encode("ascii")

    t = tarfile.TarInfo("foo/bar")
    t.size = len(b)
    tar.addfile(t, io.BytesIO(b))

@mgrandi
Copy link
Mannequin Author

mgrandi mannequin commented Aug 20, 2014

I don't have an idea how to make it easier and still meet all/most requirements and without cluttering up the api.

That is what i mentioned in my original post, a lot of the time users just _don't care_ about a lot of the stuff that a tar archive can store (permission bits, uid/gid, etc).

Say i'm on my mac. I can select a bunch of files and then right click -> compress. Pretending that it saves the resulting archive as a .tar.gz rather then a .zip, that's really it. The user doesn't care about the permission bits, uid/gid or any of that, they just want a compressed archive.

While the api does do a good job of exposing the low level parts of the api with TarInfo, being able to set all the stuff manually or have it figured out through gettarinfo() calling os.stat()

My original reasoning for this bug report is that its way too hard to do it for in-memory files, as those don't have file descriptors so os.stat() fails. But why can't we just make it so:

gettarinfo() is called
* if it's a regular file, it continues as it does not
* if it is NOT a regular file (no file descriptor), then it returns a TarInfo object with the 'name' and 'size' set, and the rest of the fields set to default values (the current user's uid and gid, acceptable permission bits and the correct type for a regular file (REGFILE?)
* if gettarinfo() is called with a non regular file and it's name has a slash, then its assumed to be a folder structure, so then it will add the correct TarInfo with type = DIRTYPE and then insert the file underneath that folder, sorta how zipfile works. I looked at the tarfile.py code and it seems it does this already.

This just adds the needed "easy use case" for the tarfile module, as the complicated low level api is there, we just need something that users just want to create an archive without worrying too much about the low level stuff. So then they can just:

import tarfile, io

fileToAdd = io.BytesIO("hello world!".encode("utf-8"))
with tarfile.open("sample.tar", mode="w") as tar:

    # this TarInfo object has:
    #    name = 'somefile.txt'
    #    type = REGTYPE (or whatever is 'just a regular file')
    #    uid = 501, gid = 20, gname=staff, uname=markgrandi, mode=644
    tmpTarInfo = tar.gettarinfo("somefile.txt", fileToAdd)
    tar.addfile()

So basically its just having defaults for the TarInfo object when gettarinfo() is given a file that doesn't have a file descriptor.

@gustaebel
Copy link
Mannequin

gustaebel mannequin commented Aug 22, 2014

Please provide a patch which allows easy addition of file-like objects (not only io.BytesIO) and directories, preferably hard and symbolic links, too. It would be nice to still be able to change attributes of a TarInfo before addition. Please also add tests.

@gustaebel gustaebel mannequin added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Aug 22, 2014
@vadmium
Copy link
Member

vadmium commented Mar 24, 2015

If you read the documentation it is clear that gettarfile() requires an OS file, so won’t work with an internal Python-only file object. Maybe the documentation could be tweaked, but I don’t think the gettarfile() implementation should be changed. To me the whole point of it is to call fstat() on the file and fill in the TarInfo attributes appropriately.

Instead, perhaps an enhancement could be made that allowed something like this:

metadata = TarInfo.make_file('helloworld.txt', len(b))
tarFileObj.addfile(metadata, io.BytesIO(b))

The corresponding TarInfo class could grow new presets looking something like:

class TarInfo:
    @classmethod
    def make_file(cls, name, size):  # Name and size are mandatory
        self = cls(name)
        self.type = REGTYPE
        self.size = size
        self.mtime = None  # Force addfile() to set it to some default time.time() value
        self.mode = 0o644
        return self
    
    @classmethod
    def make_executable(cls, name, size):
        ...
        self.mode = 0o755
        ...
    
    @classmethod
    def make_directory(cls, name):
        ...
        self.type = DIRTYPE
        ...
    
    def make_hard_link(cls, name, target)
    def make_symlink(cls, name, target)
    def make_block_device(cls, name, major, minor)  # Set undocumented attributes
    def make_char_device(cls, name, major, minor)

@vadmium
Copy link
Member

vadmium commented Apr 20, 2015

In bpo-22468, I posted a patch which encourages using TarInfo directly, and hopefully clarifies that gettarinfo() is only for OS files. I think that should cover the documentation aspect of this bug, although an enhancement to synthesize TarInfo objects for regular files etc might still be a good idea.

@python-dev
Copy link
Mannequin

python-dev mannequin commented Feb 20, 2016

New changeset 94a94deaf06a by Martin Panter in branch '3.5':
Issues bpo-22468, bpo-21996, bpo-22208: Clarify gettarinfo() and TarInfo usage
https://hg.python.org/cpython/rev/94a94deaf06a

New changeset 9d5217aaea13 by Martin Panter in branch '2.7':
Issues bpo-22468, bpo-21996, bpo-22208: Clarify gettarinfo() and TarInfo usage
https://hg.python.org/cpython/rev/9d5217aaea13

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Status: No status
Development

No branches or pull requests

2 participants