classification
Title: Multivolume support in tarfile module
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: christian.heimes, edulix, lars.gustaebel, loewis, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-06-28 08:06 by edulix, last changed 2014-04-14 11:20 by lars.gustaebel. This issue is now closed.

Files
File name Uploaded Description Edit
cpython-tarfile-multivolume.patch edulix, 2013-06-28 08:06 Patch providing the code review
split-xhdtype.tar.gz lars.gustaebel, 2014-04-13 12:11
Messages (17)
msg191979 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2013-06-28 08:06
The patch attached provides implementation for multivolume support for tarfile module. It contains both the changes in the module and a battery of unit tests. It contains support for multivolume for both GNU and PAX formats.

The main idea behind this proposal is that for dealing with new volumes, the user will provide a callback function. In this callback function the user typically calls to TarFile.open_volume(filename) with the appropiate filename. This is quite flexible in the sense that the callback function could even ask the user for the filename of the next volume (as tar command does), or do anything they need before returning or before calling to open_volume. 

Please feel free to comment on how to improve the implementation or the API. Documentation for the new feature will be provided at a later stage of the review process if the patch gets a good review.
msg191993 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-06-28 13:19
Eduardo, can you please fill out the contributor form?
msg193811 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2013-07-28 08:26
Sure, I will fill it out. But is it required?
msg200913 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2013-10-22 10:47
could you please check if my contributor form is already processed?
msg200914 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-22 10:53
You have an asterisk next to your user name. So yes, your form was received and processed.
msg209418 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2014-01-27 08:50
Do we have any news on this patch?
msg209437 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-01-27 11:54
Lars, would you like to take a look?
msg209633 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-01-29 11:26
I cannot yet go into the details, because I have not tested the patch.

The comments, docstrings and quoting are not very consistent with the rest of the module. There are a few spelling mistakes. The open_volume() method is more or less a copy of the open() method which is not optimal.

The patch adds a lot of complexity to the tarfile module for a use case that only a few connoisseurs benefit from. It seems to alter some inherent TarFile mechanics that people might rely on, e.g. the members attribute contains only the members stored in the current volume not the overall entirety of members. Does this patch reliably allow random-access?

Would it be possible/easier to add the same functionality using a separate class MultiVolumeTarFile instead?
msg209635 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2014-01-29 12:15
> I cannot yet go into the details, because I have not tested the patch.

> The comments, docstrings and quoting are not very consistent with the rest of the module. There are a few spelling mistakes. 

I can try to take care of this, though it'd be best if you can point me out concrete examples.

> The open_volume() method is more or less a copy of the open() method which is not optimal.

I know, but trying to do something else might be even worse..

> The patch adds a lot of complexity to the tarfile module for a use case that only a few connoisseurs benefit from. It seems to alter some inherent TarFile mechanics that people might rely on, e.g. the members attribute contains only the members stored in the current volume not the overall entirety of members. 

Well, that doesn't make much sense to me. You say that people rely on something like  "members attribute contains only the members stored in the current volume not the overall entirety of members", but as you know, python tarfile lib didn't support volumes before this patch, so I guess people couldn't be relying on that anyway.

Also please bear in mind that tarfile volumes support is part of the tar standard, and altough not everyone uses this, it has its niche uses (sliced backups etc).

> Does this patch reliably allow random-access?

Yes and no. 

No: when you open a volume for reading, the list of members is cleared as you pointed so you cannot access to them.

Yes: you can open any volume at the begining of a file, and it read the tarfile from there. I do that in my use-case.

> Would it be possible/easier to add the same functionality using a separate class MultiVolumeTarFile instead?

If you find open_volume similar to open() and don't like, I'm quite sure you would like all the needed copy-paste having this a separate class would entail. Also as I said before, might not make much sense as this is part of the standard.
msg209711 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-01-30 11:57
At first, I'd like to take back my comment on this patch being too complex for too little benefit. That is no real argument.

Okay, I gave it a shot and I have a few more remarks:

The patch does not support iterating over a multi-volume tar archive, e.g. for TarFile.list(). You must implement this.

In my opinion, a tar archive is one logical unit even if it spans across multiple volumes. Thus, it is vital to have .getmembers() and .getnames() reflect the entirety of the archive, e.g. to support "if filename in .getnames()". I think it could be a good idea to store the volume number along each TarInfo object for random-access.

By the way, which standard are you referring to? The only one I know of is POSIX pax which doesn't say anything about multiple volumes.
msg209997 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-02-02 15:53
I had the following idea: What about a separate class, let's call it TarVolumeSet for now, that maps a set of (virtual) volumes onto one big file-like object. This TarVolumeSet will be passed to a TarFile constructor as the fileobj argument. It is subclassable for implementing custom behavior.


class MyTarVolumeSet(tarfile.TarVolumeSet):

    def __init__(self, template):
        self.template = template

    def open_volume(self, volume_number):
        return open(self.template % volume_number, "rb")

volumes = MyTarVolumesSet("test.tar.%03d")
with tarfile.open(fileobj=volumes, mode="r:") as tar:
    for t in tar:
        print(t.name)


In my opinion, this approach has a number of advantages: Most importantly, it separates the multi-volume code from the TarFile class, which reduces the invasiveness, complexity and maintenance burden of the original approach. The TarFile class would be totally agnostic about the concept of multiple volumes, TarVolumeSet looks just like another file-object to TarFile. Looks like the cleanest solution to me so far.
msg213141 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2014-03-11 12:11
I guess I got it wrong, it's not part of the POSIX standard, just part of the GNU tar documentation. About the getmembers and getnames not reflecting the entirety of the archive, it's an optimization I needed and I think ccan be quite handy. It's also consistent with how the tar command works afaik, just listing the contents of the current volume. But it's true that could be done.

Regarding the TarVolumeSet idea, it could work but it's not as easy as that. You don't want to directly do a plain open in there, because you want to be able to deal with read/write modes, with gzip/bzip/Stream class. You also want to give the user maximum flexibility. That's why in my implementation new_volume_handler and open_volume functions are separated. Similarly, we could do something like this:

class MyTarVolumeSet(tarfile.TarVolumeSet):

    def __init__(self, template, max_vol_size):
        self.template = template
        self.max_vol_size = max_vol_size

    def new_volume_handler(self, tarfile, volume_number):
        self.open_volume(self.template % volume_number, tarfile)

volumes = MyTarVolumesSet("test.tar.%03d")
with tarfile.open(fileobj=volumes, mode="w:") as tar:
    for t in tar:
        print(t.name)

Note that the new_volume_handler in this example receives more or less the same params as in my patch (not the base name, because it's already stored in the template), and that the open_volume really abstracts which way it will be opened. It could also have, as in my patch, an optional fileobj parameter, for a new indirection/abstraction.

In any case, this kind of abstraction would still really need some kind of hooking with tarfile, because a multivol tarfile is not exactly the same as a normal tarfile chopped up. This might be doable unilateraly by a smart TarVolumeSet getting the information from the tarfile and modifying/patching anything needed. This is one of the reasons why one would probably would still need access to the tarfile inside the open_volume function.
msg213243 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-03-12 10:31
> It's also consistent with how the tar command works afaik, just listing the contents of the current volume.

No, GNU tar operates on the entirety of the archive and asks for the filename of the subsequent volume every time it hits eof in the current volume.

> You don't want to directly do a plain open in there, because you want to be able to deal with read/write modes, with gzip/bzip/Stream class.

The example I gave is based on the idea that there is a TarVolumeSet class in the tarfile module that implements all the required file-object methods (e.g.  read(), write(), seek(), etc.) and acts as if the sequence of volumes is actually one big file. It is passed to tarfile.open() as the fileobj argument. This TarVolumeSet class is supposed to be subclassable to let the user implement her/his own mode of operation. This way the open_volume() method can do whatever the user wants it to do. The TarVolumeSet class might as well have a new_volume() method for writing multivol archives, the example only covered the case of reading a multivol archive.

BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.

> [...] because a multivol tarfile is not exactly the same as a normal tarfile chopped up.

No, I think it is exactly that. The only purpose of the GNUTYPE_MULTIVOL header that is at the start of each subsequent volume is to give GNU tar the ability to detect if it is reading the correct volume. It is not essential and could as well be left out.
msg216007 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2014-04-13 09:27
> The example I gave is based on the idea that there is a TarVolumeSet class in the tarfile module that implements all the required file-object methods (e.g.  read(), write(), seek(), etc.) and acts as if the sequence of volumes is actually one big file. It is passed to tarfile.open() as the fileobj argument. This TarVolumeSet class is supposed to be subclassable to let the user implement her/his own mode of operation. This way the open_volume() method can do whatever the user wants it to do. The TarVolumeSet class might as well have a new_volume() method for writing multivol archives, the example only covered the case of reading a multivol archive.

This is not so easy, because for example if you want to move the logic in addfile() that manages volumes to the write() function, you have some issues:
 * write() will need to take into account blocks (BLOCKSIZE), just to be able to split the volumes correctly. This is something that usually shouldn't belong in a write() function as it has to do to tarfile.. and it can be messy that both layers deal with it (write() splitting volumes, and tarfile adding NUL at the end of a BLOCK..) This can be done I guess, but remember, we split a volume only in the middle of a big file, not in any other case (AFAIK). Hopefully you don't get huge pax headers or anything strange. Usually all other records are either in the begining or just occupy one block, but can we rule this problem out for good?

 * multivolume logic in write() needs read/write access to the current tarinfo being written (look for "tarfile" in addfile() funcion in my patch to see why). How do you propose this object should be accessed from write()? 

> BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.

But it has multivolume support right? Which is what I am proposing here. Also, you can gzip (or encrypt or anything) the volumes after creating the volumes..

>> [...] because a multivol tarfile is not exactly the same as a normal tarfile chopped up.
> No, I think it is exactly that. The only purpose of the GNUTYPE_MULTIVOL header that is at the start of each subsequent volume is to give GNU tar the ability to detect if it is reading the correct volume. It is not essential and could as well be left out.

I'm not going to discuss this, because I think that we can implement it in the way you propose to implement it anyway. But my patch supports it and I think it's an *useful* feature, so I want it in :-P
msg216015 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-04-13 12:11
> [...] but remember, we split a volume only in the middle of a big file, not in any other case (AFAIK). Hopefully you don't get huge pax headers or anything strange. [...] 

Hopefully? Sorry, but have you tested this? I did. I let GNU tar create a two volume archive that is split exactly between the two blocks of an XHDTYPE pax header.

The result is terrifying. At the beginning of the second volume GNU tar creates an XGLTYPE header as the pax replacement for a GNUTYPE_MULTIVOL header, followed by an XHDTYPE header ("GNUFileParts") that somehow decorates the following REGTYPE(!) tar header that contains the continuation of the split XHDTYPE header data from the previous volume. After that comes the REGTYPE file that the split XHDTYPE header was actually meant for as decoration.

I attached the archive to this issue.

What happens if a GNUTYPE_LONGNAME header is split in two? I don't wanna know...


> write() will need to take into account blocks (BLOCKSIZE), just to be able to split the volumes correctly.

It is mandatory to do the split on a block boundary (a multiple of 512).


> * multivolume logic in write() needs read/write access to the current tarinfo being written [...]. How do you propose this object should be accessed from write()?

I don't know and this problem seems to be quite hard to address with my approach. That's too bad.


> > BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.
> But it has multivolume support right? Which is what I am proposing here. Also, you can gzip (or encrypt or anything) the volumes after creating the volumes..

Yeah, it has multivolume support, but a very limited one that is not only weird but isn't even usable together with compression. And sure, I can compress and encrypt the volumes afterwards, but I can also create a compressed archive and pipe it through split(1) to split it into parts. Both ways create tar archives that are not readable by GNU tar because they're non-standard. So what?

Please tell me, what is your actual personal use-case for this feature?
msg216036 - (view) Author: Eduardo Robles Elvira (edulix) * Date: 2014-04-13 21:00
>> [...] but remember, we split a volume only in the middle of a big file, not in any other case (AFAIK). Hopefully you don't get huge pax headers or anything strange. [...] 
> Hopefully? Sorry, but have you tested this? I did. I let GNU tar create a two volume archive that is split exactly between the two blocks of an XHDTYPE pax header.
>
> The result is terrifying. At the beginning of the second volume GNU tar creates an XGLTYPE header as the pax replacement for a GNUTYPE_MULTIVOL header, followed by an XHDTYPE header ("GNUFileParts") that somehow decorates the following REGTYPE(!) tar header that contains the continuation of the split XHDTYPE header data from the previous volume. After that comes the REGTYPE file that the split XHDTYPE header was actually meant for as decoration.
>
> I attached the archive to this issue.
>
> What happens if a GNUTYPE_LONGNAME header is split in two? I don't wanna know...
>
>> write() will need to take into account blocks (BLOCKSIZE), just to be able to split the volumes correctly.
>
> It is mandatory to do the split on a block boundary (a multiple of 512).
>>> BTW, my version of GNU tar refuses to create compressed multiple-volume archives which is why I doubt the usefulness of this feature overall.
>> But it has multivolume support right? Which is what I am proposing here. Also, you can gzip (or encrypt or anything) the volumes after creating the volumes..
>
> Yeah, it has multivolume support, but a very limited one that is not only weird but isn't even usable together with compression. And sure, I can compress and encrypt the volumes afterward, but I can also create a compressed archive and pipe it through split(1) to split it into parts. Both ways create tar archives that are not readable by GNU tar because they're non-standard. So what?
>
> Please tell me, what is your actual personal use-case for this feature?

I'm willing modify the patch to remove the "weirdness" you refer to. I differ on that it's not usable: it might not be useful to you, but it's certainly a feature that covers part of the functionality of GNU tar. Actually, some of the unit tests are like this: use GNU Tar to compress, then extract with tarfile - and viceversa.

Of course you can use split. And I could use Ruby or Perl, but I'm using python and tarfile, and this is a GNU tar feature that is just not supported in python tarfile upstream, and I'm just trying to contribute this feature, if possible :-).

BTW, If I create a multivol tar file and then compress the volumes, that does not make it "non-standard", in the same way that if I create a PNG file and then compress it and then store it in EXTFS, it doesn't make it non-standard. I'm just using multiple layers of standards.

I'm a contractor, and I have been asked by a client to develop a python-based backup tool. The client is technical and had already an idea of what he wanted to do: use python-tarfile and add support to multivolume and some other goodies, and the client also wanted to try to push the changes upstream as we believe it is the correct thing to do.

BTW, when we designed the backup tool, we ruled out the possibility of using split because split wouldn't allow to correctly list all the files in each file-slice separately. We wanted to be able to recover all the files of each "volume" so that if we lose other volumes, we can still recover all the data from the volumes we have. 

Anyway, if you are the maintainer of tarfile and you think it's not possible to push tar-multivolume support upstream in python tarfile for whatever reason, please tell me.
msg216073 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-04-14 11:20
Okay, let me tell you why I reject your contribution at this point.

The patch you submitted may be well-suited for your purposes but it does not meet the requirements of a standard library implementation because it is not generic and comprehensive enough.

It contains duplicate code, spelling mistakes and needless code changes e.g.  in test_tarfile.py.

It does not expose one set of volumes as one tar archive to the user. It is not possible to iterate over all members of all volumes in one go. It does not allow random-access.

Actually, it does not implement complete multivolume support but only the "easy" parts.  For example, it fails to read GNU tar archives that are split in the middle of a pax header block sequence. The other way around, when writing it makes a split only when it is inside the data part of a member. Hence, it is possible that a volume turns out smaller than max_volume_size which is not only inaccurate but also bad on a tape device.

If you decide that you still want multivolume support in tarfile, feel free to reopen this issue with a new and significantly better patch. I gave you a number of clues on what I think is required.
History
Date User Action Args
2014-04-14 11:20:09lars.gustaebelsetstatus: open -> closed
assignee: lars.gustaebel
resolution: rejected
messages: + msg216073
2014-04-13 21:00:16edulixsetmessages: + msg216036
2014-04-13 12:11:17lars.gustaebelsetfiles: + split-xhdtype.tar.gz

messages: + msg216015
2014-04-13 09:27:21edulixsetmessages: + msg216007
2014-03-12 10:31:23lars.gustaebelsetmessages: + msg213243
2014-03-11 12:11:16edulixsetmessages: + msg213141
2014-02-02 15:53:04lars.gustaebelsetmessages: + msg209997
2014-01-30 11:57:45lars.gustaebelsetmessages: + msg209711
2014-01-29 12:15:46edulixsetmessages: + msg209635
2014-01-29 11:26:40lars.gustaebelsetmessages: + msg209633
2014-01-27 22:20:55berker.peksagsetversions: + Python 3.5, - Python 3.4
2014-01-27 11:54:01loewissetmessages: + msg209437
2014-01-27 08:50:31edulixsetmessages: + msg209418
2013-10-22 10:53:34christian.heimessetnosy: + christian.heimes
messages: + msg200914
2013-10-22 10:47:30edulixsetmessages: + msg200913
2013-07-28 18:40:36serhiy.storchakalinkissue18575 dependencies
2013-07-28 08:26:25edulixsetmessages: + msg193811
2013-06-28 15:30:32serhiy.storchakasetnosy: + serhiy.storchaka
2013-06-28 13:19:59loewissetnosy: + loewis
messages: + msg191993
2013-06-28 09:21:17serhiy.storchakasetnosy: + lars.gustaebel

type: enhancement
stage: patch review
2013-06-28 08:06:17edulixcreate