classification
Title: remove/delete method for zipfile/tarfile objects
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arthur.Darcet, Dave Sawyer, christian.heimes, chroipahtz, denfromufa, eric.araujo, gambl, lars.gustaebel, loewis, rhettinger, rossmclendon, sandro.tosi, serhiy.storchaka, terry.reedy, ubershmekel, victorlee129
Priority: normal Keywords: patch

Created on 2009-09-02 04:42 by rossmclendon, last changed 2016-05-25 16:52 by denfromufa.

Files
File name Uploaded Description Edit
delete.tar.gz victorlee129, 2009-10-26 07:15 Includes 2 files,the one named delete.py is the main file.
zipfile.remove.patch ubershmekel, 2011-03-09 20:54 bugs, docs and tests review
zipfile.remove.2.patch ubershmekel, 2011-03-15 01:08 improved patch for zipfile.remove review
tricky.zip serhiy.storchaka, 2015-02-25 10:57
mywork.patch Dave Sawyer, 2015-04-15 23:54 review
zipfile_filter.patch Dave Sawyer, 2015-04-16 02:15 renamed, fixed comment, removed some debugging code review
Repositories containing patches
https://hg.python.org/cpython
Messages (33)
msg92154 - (view) Author: Ross (rossmclendon) Date: 2009-09-02 04:42
It would be most helpful if a method could be included in the TarFile
class of the tarfile module and the ZipFile class of the zipfile module
that would remove a particular file (either given by a name or a
TarInfo/ZipInfo object) from the archive.

Usage to remove a single file from an archive would be as follows:

import zipfile
zipFileObject = zipfile.ZipFile(archiveName,'a')
zipFileObject.remove(fileToRemove)
zipFileObject.close()

Such a method should probably only apply to archives that are in append
mode as write mode would erase the original archive and read mode should
render the archive immutable.

One possible extra to be included is to allow a list of file names or
ZipInfo/TarInfo objects to be passed into the remove method, such that
all items in the list would be removed from the archive.
msg92155 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-09-02 05:20
+1
msg92156 - (view) Author: Ross (rossmclendon) Date: 2009-09-02 05:42
Slight change to:

"Such a method should probably only apply to archives that are in append
mode as write mode would erase the original archive and read mode should
render the archive immutable."

The method should probably still apply to an archive in write mode.  It
is conceivable that one may need to delete a file from the archive after
it has been written but before the archive object has been closed.
msg92158 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-09-02 07:00
-1. I don't think this can be implemented in a reasonable way, assuming
that you want the file to become smaller as a consequence of removal.
msg92164 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2009-09-02 12:11
-1, although I can only speak for tarfile. Removing members from a tar
archive sounds obvious and easy but it is *not*. A file in an archive is
stored as a header block (that contains the metadata) followed by a
number of data blocks (that contain the file's data). New files are
simply appended to the archive file. There is no central table of
contents whatsoever. To make things worse, a compressed archive is
compressed in one go from the beginning right up to the end, it is not
possible to access a member in the middle of an archive without having
to decompress all data before it.
Deleting files from an uncompressed archive is rather straightforward
implementation-wise but IO intensive and risky. In contrast, there is no
other way to delete files from a *compressed* tarfile than to make a
copy of it omitting the unwanted files.
msg92172 - (view) Author: Ross (rossmclendon) Date: 2009-09-02 16:40
In light of Lars's comment on the matter, perhaps this functionality
could be added to zip files only.  Surely it can be done, considering
that numerous utilities and even Windows Explorer provide such
functionality.  I must confess that I am unfamiliar with the inner
workings of file archives and compression, but seeing as it is
implemented in a number of places already, it seems logical that it
could be implemented in ZipFile as well.  I'll spend some time the next
few days educating myself about zip files and how this might be
accomplished.
msg92177 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-09-02 18:54
> In light of Lars's comment on the matter, perhaps this functionality
> could be added to zip files only.  Surely it can be done, considering
> that numerous utilities and even Windows Explorer provide such
> functionality.

Are you sure they are not creating a new file in order to delete
content? I recall that early zip tools (e.g. pkzip) had a mode
where they would merely delete the entry from the directory, but
leave the actual data in the file. Would you consider that a correct
implementation? If so, *that* can be done, for zipfiles, AFAIU.

> I'll spend some time the next
> few days educating myself about zip files and how this might be
> accomplished.

Please do - you'll find that deletion from zipfiles comes in a can
full of worms.
msg94475 - (view) Author: victorlee129 (victorlee129) Date: 2009-10-26 07:15
I done it In a very *violent* way.
Is it ok for you thought?
if so, would anybody please fix it into the lib?
msg94478 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-10-26 07:58
> I done it In a very *violent* way.
> Is it ok for you thought?

In the form in which you have done it, it is clearly
unacceptable for inclusion in the library: we don't
want to add two modules "delete" and "classtools".

In addition, notice that code is for tarfile, whereas
the OP was asking for a similar feature for zipfile.

> if so, would anybody please fix it into the lib?

This is not how this works. If you want us to take
action, please submit a complete and correct patch.
msg109158 - (view) Author: Troy Potts (chroipahtz) Date: 2010-07-03 05:26
I have attempted to implement a ZipFile.remove function.  It seems to work fine.  I have submitted a patch.

The method of implementation is: find the file's index in the file list, then sum the lengths of the file entries before it to find its location in the archive.  Then simply read in all the bytes after it, write them out at that location, and truncate the file x bytes shorter, where x is the length of the record.  This works because the directory listing is created when the file is closed, so there's no harm in truncating.

I've also made it truncate the zip file after reading in the existing files upon creation, because the directory information is not used after this point.

This could use some testing on large files.

This is my first patch, so let me know if I've done anything wrong.
msg109160 - (view) Author: Troy Potts (chroipahtz) Date: 2010-07-03 05:47
My patch had some bugs, I'll need to do some more testing.  Sorry about that.
msg130299 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2011-03-08 00:10
What's the status with this patch? If nobody's looking at it I can try to see if it works and write the test and documentation for it.
msg130388 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-03-09 00:21
Please feel free to test, revise, and write. Though 'removed', the file is still accessible via the history list. (Click 'zipfile_remove.patch' and then 'download'.)
msg130463 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2011-03-09 20:54
I fixed the bugs I found, added tests and documentation. What do you guys think?
msg130938 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2011-03-15 01:08
Fixed the bugs Martin pointed out and added the relevant tests. Sadly I had to move some stuff around, but I think the changes are all for the better. I wasn't sure about the right convention for the 2 constants I added btw.
msg140680 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-07-19 16:07
Martin did a review of the newer patch; maybe you didn’t get the mail (there’s a Rietveld bug when a user name without email is given to the Cc field).
msg159418 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-04-26 19:44
I'm not sure I understand how http://bugs.python.org/review/6818/show works. I've looked all over and only found remarks for "zipfile.remove.patch" and not for "zipfile.remove.2.patch" which addressed all the aforementioned issues.

Also, I don't understand how to add myself to the CC of this issue's review page.
msg160533 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 17:34
Yuval, can you please submit a contributor agreement? See

http://www.python.org/psf/contrib/
msg160534 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 17:36
As for adding yourself to the CC list: notice the string "ubershmekel" appearing in the "CC" field of http://bugs.python.org/review/6818/show. It means that you are already on the CC list.
msg192574 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-07-07 16:11
Yuval has submitted a CLA. I'm moving the proposal to 3.4 as 3.3 is in feature freeze mode.
msg229801 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2014-10-22 07:00
Ping. Has this been postponed?
msg229893 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-23 19:17
I agree with Martin and Lars, this issue is not so easy at looks at first glance.

For ZIP files we should distinct two different operations.

1. Remove the entry from the central directory (and may be mark local file header as invalid if it is possible). This is easy, fast and safe, but it doesn't change the size of ZIP file.

2. Physical remove the content of the file from ZIP file. This is so easy as remove a line from the text file. In worst case it has linear complexity from the size of ZIP file.

2a. The safer way is to create temporary file in the same directory, copy the content of original ZIP file excluding deleted file, and then replace original ZIP file by modified copy. Be aware about file and parent directory permissions, owners, and disk space.

2b. The faster but less safe way is to "shift" the content of the ZIP file after deleted file by reading it and writing back in the same ZIP file at different position. This way is not safe because when something bad happen at writing, we can lost all data. And of course there are crafty ZIP files in which the order of files doesn't match the order in central directory or even files data overlap.

For performance may be we should implement (2) not as a method to remove single file, but as a method which takes the filter function and then left in the ZIP file only files for which it returns true.

Or may be implement (1) and then add a method which cleans up the ZIP archive be removing all files removed from the central directory. We should discuss alternatives.

And as for concrete patch, zipfile.remove.2.patch can read the content of all ZIP file in the memory. This is not appropriate, because ZIP file can be very large.
msg236565 - (view) Author: Dave Sawyer (Dave Sawyer) * Date: 2015-02-25 09:23
I'd be interested in taking up the zip portion at Pycon 2015 this year. I recently had need to delete file(s) from a zipfile.

To do it today with the existing API requires you to unpack the zip and repack it. The unpack is slow and you need enough free disk space for the uncompressed files.

My strategy is essentially exactly what msg229893 2a said: copy binary blobs to a tempfile, then overwrite the original when complete. I would use a name filter function to decide what to delete and optional parameter for the temp file (falling back to tempfile.tempfile if None). IIRC, this is the same strategy used in the dotNet zip library.
msg236568 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-25 10:57
I think it would be better if new method would copy filtered content to other ZIP file. Not alway you want to modifying the original file.

The question is what to do with the data before the start of the ZIP file or between files in the ZIP file, or if files in the ZIP file are overlapped. Here is a sample of such file.
msg240336 - (view) Author: Matthew Gamble (gambl) * Date: 2015-04-09 15:21
Hi,

I've recently been working on a Python module for the Adobe universal container format (UCF) which extends the zip specification - as part of this I wanted to be able to remove and rename files in an archive.

I discovered this issue when writing the module so realised there wasn't currently a solution - so I went down the rabbit hole.

I've attached a patch which supports the removal and renaming of files in a zip archive. You can also look at this python module in a git-repo which is a the same code but separated out into a class that extends ZipFile: https://github.com/gambl/zipextended.

The patch provides 4 main new "public" functions for the zipfile library:

- remove(self, zinfo_or_arcname):
- rename(self, zinfo_or_arcname, filename):
- commit(self):
- clone(self, file, filenames_or_infolist=None, ignore_hidden_files=False)

The patch is in part modelled on the rubyzip solution. Remove and rename will initially only update the ZipFile's infolist. Changes are then persisted via a commit function which can be called manually - or will be called automatically upon close. Commit will then clone the zipfile with the necessary changes to a temporary file and replace the original file when that operation has completed successfully.

An alternative to remove files without modifying the original is via the clone method directly. This is in the spirit of Serhiy's suggestion of filtering the content and not modifying the original. You can pass a list of filenames or fileinfos of the files to be included in the clone.
So that clone can be performed without decompressing and then recompressing the files in the archive I have added two functions write_compressed and read_compressed.

I have also attempted to address Serhiy's concern with respect to the tricky.zip - "hidden files" in between members of the archive. The clone method will by default retain any hidden files and maintain the same relative order in the archive. You can also elect to ignore the hidden files, and clone with just the files listed in the central directory.

I did have to modify the tricky.zip attached to this issue manually as the CRC of file two (with file three embedded) was incorrect - and would therefore fail testzip(). I'm not actually sure how one would create such an archive - but I think that it's valid according to the zip spec. I've actually included the modified version in the patch for a few of the tests.

I appreciate that this is a large-ish patch and may take some time to review - but as suggested in the comments - this wasn't as straight forward as is seems!

Look forward to your comments. 

The signatures of the main functions are described below:

remove(self, zinfo_or_arcname):

    Remove a member from the archive.

    Args:
      zinfo_or_arcname (ZipInfo, str) ZipInfo object or filename of the
        member.

    Raises:
      RuntimeError: If attempting to modify an Zip archive that is closed.
---

rename(self, zinfo_or_arcname, filename):

    Rename a member in the archive.

    Args:
      zinfo_or_arcname (ZipInfo, str): ZipInfo object or filename of the
        member.
      filename (str): the new name for the member.

    Raises:
      RuntimeError: If attempting to modify an Zip archive that is closed.


clone(self, file, filenames_or_infolist=None, ignore_hidden_files=False):

    Clone the a zip file using the given file (filename or filepointer).

    Args:
      file (File, str): file-like object or filename of file to write the
        new zip file to.
      filenames_or_infolist (list(str), list(ZipInfo), optional): list of
        members from this zip file to include in the new zip file.
      ignore_hidden_files (boolean): flag to indicate wether hidden files
        (data inbetween managed memebers of the archive) should be included.

    Returns:
        A new ZipFile object of the cloned zipfile open in append mode.

        If copying hidden files then clone will attempt to maintain the
        relative order between the files and members in the archive

commit(self):
     Commit any inline modifications (removal and rename) to the zip archive.

     This makes use of a temporary file to create a new zip archive with the
     required modifications and then replaces the original.

     This therefore requires write access to either the directory where the
     original zipfile lives, or to python's default tempfile location.
msg240345 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2015-04-09 15:49
All of you who have or might submit patches -- Victorlee, Troy, Mathew, or anyone else, please sign a PSF contributor agreement.  We should not even look at a patch from you before you do.

Info: https://www.python.org/psf/contrib/
Form: https://www.python.org/psf/contrib/contrib-form/

Receipt is official when a * appears after your name.  This usually takes about a week.
msg240394 - (view) Author: Matthew Gamble (gambl) * Date: 2015-04-09 21:38
Thanks Terry - apologies - I meant to sign before I submitted the patch. I have signed the CLA now.
msg240422 - (view) Author: Matthew Gamble (gambl) * Date: 2015-04-10 13:52
Hi all,

Apologies again, I've had to pull the patch just temporarily whilst I check with my employer (The University of Manchester) that everything is OK with me contributing this. 

Everything should be OK - but I just wanted to do things correctly. 

I'll re-submit the patch for review when I get the OK.

- Matthew
msg241181 - (view) Author: Dave Sawyer (Dave Sawyer) * Date: 2015-04-15 23:54
The zipfile way to delete or rename would be to just change the index. It really doesn't want to be re-written as it is designed to span disks. Many old versions of files can be scattered within the zip. In addition self-extracting zip files will have executable code in the front of the zip. To make a zip smaller, it's desirable to have a filter function. There were points brought up about zip with overlapping data blocks (and you could envision saving space by having many identical files share the same compressed block... but this is not legal zip. A file info header block must precede EACH individual file data block. This is brought home by there being only a length field in the info header.
msg241192 - (view) Author: Dave Sawyer (Dave Sawyer) * Date: 2015-04-16 02:15
I can add some more tests to bring up the coverage, but wanted to get reviewer opinion on the direction of this before doing more work.
msg241201 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2015-04-16 04:56
Around PyCon it might take a little longer than a week. IIRC Ewa does the paper work. She is also on the organizing committee of PyCon.
msg241346 - (view) Author: Dave Sawyer (Dave Sawyer) * Date: 2015-04-17 17:36
Maybe it takes a little longer than a week. I have a final signed agreement from Ewa (https://secure.echosign.com/public/viewAgreement?aid=X88L4EVP5IXC289&eid=X88M6DGQ93J5K38&)

signed on
04/17/2014 6:48 PM
Wow, exactly one year ago!
msg266374 - (view) Author: Denis Akhiyarov (denfromufa) * Date: 2016-05-25 16:52
has this been merged?
History
Date User Action Args
2016-05-25 16:52:08denfromufasetnosy: + denfromufa
messages: + msg266374
2015-04-17 17:36:05Dave Sawyersetmessages: + msg241346
2015-04-16 04:56:26christian.heimessetmessages: + msg241201
2015-04-16 02:15:08Dave Sawyersetfiles: + zipfile_filter.patch
hgrepos: + hgrepo305
messages: + msg241192
2015-04-15 23:54:39Dave Sawyersetfiles: + mywork.patch

messages: + msg241181
2015-04-10 13:52:44gamblsetmessages: + msg240422
2015-04-10 13:45:46gamblsetfiles: - zip_hiddenfiles.zip
2015-04-10 13:45:34gamblsetfiles: - zipfile.remove_rename.patch
2015-04-09 21:38:31gamblsetmessages: + msg240394
2015-04-09 15:49:50terry.reedysetmessages: + msg240345
2015-04-09 15:23:42gamblsetfiles: + zip_hiddenfiles.zip
2015-04-09 15:21:59gamblsetfiles: + zipfile.remove_rename.patch
nosy: + gambl
messages: + msg240336

2015-02-25 10:57:18serhiy.storchakasetfiles: + tricky.zip

messages: + msg236568
2015-02-25 09:23:36Dave Sawyersetnosy: + Dave Sawyer
messages: + msg236565
2014-10-23 19:17:10serhiy.storchakasetmessages: + msg229893
stage: patch review ->
2014-10-22 07:45:11pitrousetnosy: + serhiy.storchaka

versions: + Python 3.5, - Python 3.4
2014-10-22 07:00:46ubershmekelsetmessages: + msg229801
2013-07-07 16:11:40christian.heimessetversions: + Python 3.4, - Python 3.3
nosy: + christian.heimes

messages: + msg192574

stage: patch review
2013-04-11 14:39:21Arthur.Darcetsetnosy: + Arthur.Darcet
2012-05-13 17:36:56loewissetmessages: + msg160534
2012-05-13 17:34:44loewissetmessages: + msg160533
2012-04-26 19:44:56ubershmekelsetmessages: + msg159418
2011-07-19 16:07:53eric.araujosetmessages: + msg140680
2011-03-19 05:07:47terry.reedylinkissue11415 superseder
2011-03-15 01:08:09ubershmekelsetfiles: + zipfile.remove.2.patch
nosy: loewis, rhettinger, terry.reedy, lars.gustaebel, rossmclendon, eric.araujo, ubershmekel, victorlee129, sandro.tosi, chroipahtz
messages: + msg130938
2011-03-09 21:10:18eric.araujosetnosy: + eric.araujo
2011-03-09 20:54:29ubershmekelsetfiles: + zipfile.remove.patch

messages: + msg130463
keywords: + patch
nosy: loewis, rhettinger, terry.reedy, lars.gustaebel, rossmclendon, ubershmekel, victorlee129, sandro.tosi, chroipahtz
2011-03-09 00:21:46terry.reedysetnosy: + terry.reedy
messages: + msg130388
2011-03-08 00:10:01ubershmekelsetnosy: + ubershmekel

messages: + msg130299
versions: + Python 3.3, - Python 3.1
2011-02-02 21:02:50sandro.tosisetkeywords: - patch
nosy: loewis, rhettinger, lars.gustaebel, rossmclendon, victorlee129, sandro.tosi, chroipahtz
2011-02-02 21:02:37sandro.tosisetnosy: + sandro.tosi
2010-07-03 05:47:17chroipahtzsetmessages: + msg109160
2010-07-03 05:46:48chroipahtzsetfiles: - zipfile_remove.patch
2010-07-03 05:26:41chroipahtzsetfiles: + zipfile_remove.patch

nosy: + chroipahtz
messages: + msg109158

keywords: + patch
2009-10-26 07:58:57loewissetmessages: + msg94478
2009-10-26 07:15:10victorlee129setfiles: + delete.tar.gz
versions: + Python 3.1, - Python 3.2
nosy: + victorlee129

messages: + msg94475

components: - IO
2009-09-02 18:54:26loewissetmessages: + msg92177
2009-09-02 16:40:04rossmclendonsetmessages: + msg92172
2009-09-02 12:11:12lars.gustaebelsetnosy: + lars.gustaebel
messages: + msg92164
2009-09-02 07:00:17loewissetnosy: + loewis
messages: + msg92158
2009-09-02 05:42:58rossmclendonsetmessages: + msg92156
2009-09-02 05:20:39rhettingersetnosy: + rhettinger
messages: + msg92155
2009-09-02 04:43:08rossmclendonsetcomponents: + IO
2009-09-02 04:42:10rossmclendoncreate