This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [RFE] tarfile: support adding file objects without prior known size
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: martin.panter, mgorny, remi.lapeyre
Priority: normal Keywords: patch

Created on 2018-11-13 08:27 by mgorny, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 10714 open remi.lapeyre, 2018-11-26 10:31
Messages (11)
msg329818 - (view) Author: Michał Górny (mgorny) * Date: 2018-11-13 08:27
Currently, the tarfile module only supports adding files if their size is known prior to adding.  However, I think it'd be helpful to be able to store large dynamically generated streams straight into the (uncompressed) .tar file without being able to precalculate the final size and without having to use a temporary file.

I'm not really sure how the API should look like (i.e. whether it should be a new method or extension of addfile()) but the mechanism would be rather simple -- write TarInfo with size of 0, write data until end of stream, write padding appropriately to written data, seek back and update TarInfo.

Of course, the use of this API would have to be restricted to cases when underlying file supports seeking back and random writes, i.e. not stream, not compressed.
msg330203 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2018-11-21 13:26
Adding this API would require to provide a way to set file status like mode, uid, gid, mtime, type, linkname, uname and gname.

Adding a new argument to gettarinfo looks weird to me, adding a new method may be better. I will try to propose a working implementation.
msg330204 - (view) Author: Michał Górny (mgorny) * Date: 2018-11-21 14:24
> Adding this API would require to provide a way to set file status like mode, uid, gid, mtime, type, linkname, uname and gname.

That's why I mentioned addfile() -- it takes TarInfo object for that purpose.  I suppose the new function should have the same parameters, except it would set size in the TarInfo instance instead of getting it.
msg330205 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2018-11-21 14:36
Yes, but in the same way that there is the add method to conveniently build the TarInfo object for the user, shouldn't we provide a new convenience method to TarFile to support this (in addition to modifying TarInfo)?
msg330414 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2018-11-26 10:32
Hi @mgorny, the changeset in PR 10714 should  do what you are looking for.
msg330427 - (view) Author: Michał Górny (mgorny) * Date: 2018-11-26 12:05
Thanks a lot! I've left a few comments based on eyeball review. I'm going to test it later today.
msg330501 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-11-27 10:39
The changes at <https://github.com/python/cpython/pull/10714/files/11ca0f0> have various other behaviour changes which are not discussed here. They seem to be there just so that you can use the TCP socket from “urlopen” with “gettarinfo”. But “gettarinfo” is supposed to be for named filesystem objects that could be members of a tar file. See Issue 22208 discussing how to create TarInfo objects without using the OS filesystem.
msg330532 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2018-11-27 15:57
I came across this thread while working on the PR, creating tarinfo as Lars Gustäbel suggests does not work since you still need to get the size before reading.

Do you think the API should be different?
msg330870 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-12-02 09:04
If something like your “addbuffer” method existed, then you won’t need to get the size first, right? We don’t need the changes in “gettarinfo” for “addbuffer” to be useful.

BTW have you considered returning a file writer rather than accepting a file reader? Similar to ZipFile.open(..., mode='w'). It would be a bit more complicated to implement, but also more flexible for the user:

# File downloaded with “urlopen”, also possible with TarFile.addbuffer API:
with tf.get_file_writer(download_tarinfo) as writer:
    shutil.copyfileobj(urlopen_response, writer)

# SVG file generated on the fly, encoded with UTF-8 and Gzip compressed; not possible with “addbuffer”:
writer = tf.get_file_writer(svgz_tarinfo)
gzip_writer = gzip.GzipFile(fileobj=writer, mode='w')
with io.TextIOWrapper(gzip_writer, 'utf-8') as text_writer:
    svg = xml.sax.saxutils.XMLGenerator(text_writer, 'UTF-8')
    svg.startDocument()
    ...
msg330878 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2018-12-02 10:49
Thanks, this looks interesting.

How will the file writer know the whole file has been read? The override of the Tar header is done on `close`?

Are `download_tarinfo` and `svgz_tarinfo` built by hand if we don't make changes in `gettarinfo`?
msg330881 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-12-02 12:01
Yeah, the TarFile class would fix up the header when the user calls “close”. I think this is how it was done for ZipFile (implemented in Issue 26039).

Yes currently you would have to build the tarinfo object by hand. I think a helper function would be nice, but the “TarFile.gettarinfo” method is not a good place for that. In fact, documenting the default settings for the TarInfo constructor may be enough. Then you might get away with the following if you just want a regular file with no specific attributes:

download_tarinfo = TarInfo('index.html')
svgz_tarinfo = TarInfo('image.svg.gz')
History
Date User Action Args
2022-04-11 14:59:08adminsetgithub: 79408
2018-12-02 12:01:37martin.pantersetmessages: + msg330881
2018-12-02 10:49:01remi.lapeyresetmessages: + msg330878
2018-12-02 09:04:20martin.pantersetmessages: + msg330870
2018-11-27 15:57:03remi.lapeyresetmessages: + msg330532
2018-11-27 10:39:56martin.pantersetnosy: + martin.panter
messages: + msg330501
2018-11-26 12:05:59mgornysetmessages: + msg330427
2018-11-26 10:32:46remi.lapeyresetmessages: + msg330414
2018-11-26 10:31:23remi.lapeyresetkeywords: + patch
stage: patch review
pull_requests: + pull_request9963
2018-11-21 14:36:43remi.lapeyresetmessages: + msg330205
2018-11-21 14:24:47mgornysetmessages: + msg330204
2018-11-21 13:26:44remi.lapeyresetmessages: + msg330203
2018-11-21 12:45:50remi.lapeyresetnosy: + remi.lapeyre
2018-11-13 08:27:56mgornycreate