classification
Title: zipfile to handle duplicate files in archive
Type: enhancement Stage:
Components: Extension Modules Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BitTorment, Yaniv.Aknin, alanmcintyre, brtzsnr, georg.brandl, iritkatriel, jheiselman, michael.driscoll, obfusk, serhiy.storchaka, techtonik
Priority: normal Keywords: patch

Created on 2008-05-11 20:17 by techtonik, last changed 2021-08-02 22:01 by iritkatriel.

Files
File name Uploaded Description Edit
test2824.py techtonik, 2010-04-06 08:41
low_level_zipfile.diff Yaniv.Aknin, 2010-04-09 14:14 Deprecate the possibility of adding multiple files with the same name to ZipFile
Messages (13)
msg66661 - (view) Author: anatoly techtonik (techtonik) Date: 2008-05-11 20:17
ZipFile allows to add the same file to archive twice. I bet it is not
intended behavior for many users who would like to either replace file
inside of archive or get runtime warning about duplicate file to be
added. http://code.google.com/p/gvn/issues/detail?id=63


from zipfile import ZipFile

zt=ZipFile("ziptest.zip","w")
zt.write("ziptest.py")
zt.write("ziptest.py")
zt.close()
msg66682 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-05-11 21:58
I think a warning would be sensible here. The behavior is certainly not
what I would expect.
msg66728 - (view) Author: anatoly techtonik (techtonik) Date: 2008-05-12 15:03
How about adding optional "replace=True" attribute to the write method?
So that people who are not aware enough to adjust the code and handle
new warning could still get valid archives.
msg66781 - (view) Author: Martin McNickle (BitTorment) Date: 2008-05-13 12:42
The mechanism for throwing an error has been written, however for the
case of a duplicated filename, it appears to have been deliberatly not
been used by the original author.:

 def _writecheck(self, zinfo):
        """Check for errors before writing a file to the archive."""
        if zinfo.filename in self.NameToInfo:
            if self.debug:      # Warning for duplicate names
                print "Duplicate name:", zinfo.filename
        if self.mode not in ("w", "a"):
            raise RuntimeError, 'write() requires mode "w" or "a"'
        ...

Putting a 'replace=True' switch seems a little clumsy, it would be much
better to raise an error, or allow the user a way to globally control
what happens in this case, i.e. overwrite the existing file or drop it.
 Adding a global behaviour switch seems to be the best way to preserve
backwards compatibility.

What do people think is the best way?

-- Martin
msg89468 - (view) Author: Alexandru V. Mosoi (brtzsnr) Date: 2009-06-17 16:21
a similar problem appears when the zip file contains two files as in:
['_test/tree.pl', '_test/'].

for ZipFile.extractall() when _test/tree.pl is extracted _test/ is
created and the the extraction of _test/ fails because OSError: [Errno
17] File exists: '_test/'
msg102439 - (view) Author: anatoly techtonik (techtonik) Date: 2010-04-06 08:41
Still an issue for Python 2.7
msg102578 - (view) Author: Yaniv Aknin (Yaniv.Aknin) Date: 2010-04-08 00:02
This affect 3.x as well.

Regardless of the exact version this will come out in, I think the only proper solution is one with which we eventually, maybe in two versions' time, end up with a behaviour that raises an exception upon double insertion. Also, we should leave some switch to allow double insertions (as someone may /want/ to create a peculiar file). If we agree that's a worthy end goal, let's see what deprecation path we take to get there.

I feel PEP4/PEP5 cover more significant changes to the language/stdlib than this, but in their spirit I propose the following concrete solution: For the nearest possible version let's issue a DeprecationWarning when adding dual files and provide a global switch that will change the DeprecationWarning into raising a relevant exception instead. For the version after that, let's remove the DeprecationWarning warning altogether and choose one of these behaviours:
 (a) a global switch that will raise by default
 (b) a switch in ZipFile's __init__() that will raise by default
 (c) a switch in ZipFile's write() that will raise by default

I vote for (c), but I think the delta between the options is small.

If this is agreeable with the audience, I'll be happy to produce a patch against 2.7 and 3.2 that will provide the chosen behaviour, as well as change tests and documentation accordingly. I realize that time is of the essence if we want this included in 2.7, and I'll be happy to do the work in a timely manner, so if you're interested kindly let me know what you feel while the iron's hot. Either way, I'll be happy to see it in any future version.

Regarding the behaviour brtzsnr observed, I think it's a separate bug. I'll try to reproduce this behaviour, and open a separate ticket if I manage to do so.

blah. Small bug, many words!
msg102600 - (view) Author: anatoly techtonik (techtonik) Date: 2010-04-08 07:51
On the second thought having a switch for a low-level zip format hacking is a good addition. It may also be used for in-place removals from .zip file by erasing directory entry. Otherwise remove operation will require repacking archive into temporary file, which is too high level and vulnerable to disk space/permission matters.

So, I vote for:
(b) a switch in ZipFile's __init__() that will raise by default
and also because you'd need to modify write() and writestr().

class zipfile.ZipFile(file[, mode[, compression[, allowZip64, [low_level]]]]) 

"""Added in version 2.7: If `low_level` compatibility switch is set (False by default), then ZipFile allows low level operations, such as adding two files with the same name without raising an exception. This is for compatibility with previous versions.""" 

BTW, allowZip64 would be nice to be aliased/deprecated in favor of allow_zip64 for consistency with other double word params in module.
msg102721 - (view) Author: Yaniv Aknin (Yaniv.Aknin) Date: 2010-04-09 14:14
Attached is the addition of the 'low_level' parameter to ZipFile. Included are the parameter, a global switch controlling whether the parameter will raise an Exception or trigger a DeprecationWarning (the latter, for now), updated tests and updated documents. I didn't run test_zipfile64, I can't say for sure it works but I have no reason to believe it doesn't.
Changes are against py3k; let me know what you think, I'll backport to trunk.
msg181042 - (view) Author: Michael Driscoll (michael.driscoll) * Date: 2013-01-31 20:08
Whatever became of this? We just stumbled across this bug in our code at work where we accidentally put multiple files of the same name into the zip file and not only does it not overwrite the others, but when you go to access that file name, it grabs the first one that was put in.
msg383102 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-12-15 23:22
This was changed to a warning in issue20262. Is that enough or should it be an exception?
msg385801 - (view) Author: Jerry Heiselman (jheiselman) * Date: 2021-01-27 19:00
This just bit us too. I don't feel like a warning is enough. In our case, we want to prevent it from happening to begin with so that a developer who isn't expecting this doesn't have to know ahead of time to check for a file.

Once a duplicate is in the zip file, it is maintained in the listing of files, but deleting one of the entries using Windows explorer results in both entries being deleted which can (and has) led to a loss of data. So preventing it from happening in the first place seems more prudent. Of course, allow an override with a parameter to .write() perhaps.
msg385802 - (view) Author: Jerry Heiselman (jheiselman) * Date: 2021-01-27 19:03
Further, some tools like zipgrep, seem to iterate over the toc in the zipfile and end up running the grep part once per entry leading to some duplication of data returned without it being obvious that there wasn't actually duplicate data in the zip archive.

Interestingly, the zipfile module does not expose the duplicate toc entries using 'python -m zipfile -l <file.zip>', so it's hiding the issue itself.
History
Date User Action Args
2021-08-02 22:01:33iritkatrielsetversions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.6, Python 2.7, Python 3.2
2021-04-13 11:46:51obfusksetnosy: + obfusk
2021-01-27 19:03:47jheiselmansetmessages: + msg385802
2021-01-27 19:00:31jheiselmansetnosy: + jheiselman
messages: + msg385801
2020-12-15 23:22:12iritkatrielsetnosy: + iritkatriel
messages: + msg383102
2015-05-04 08:50:16wesseljsettype: enhancement
2013-01-31 20:08:47michael.driscollsetnosy: + michael.driscoll
messages: + msg181042
2012-04-07 18:58:59serhiy.storchakasetnosy: + serhiy.storchaka
2010-04-09 14:14:53Yaniv.Akninsetfiles: + low_level_zipfile.diff
keywords: + patch
messages: + msg102721
2010-04-08 07:51:53techtoniksetmessages: + msg102600
2010-04-08 00:02:07Yaniv.Akninsetnosy: + Yaniv.Aknin

messages: + msg102578
versions: + Python 3.2
2010-04-06 08:41:10techtoniksetfiles: + test2824.py

messages: + msg102439
versions: + Python 2.7
2009-06-17 16:21:09brtzsnrsetnosy: + brtzsnr

messages: + msg89468
versions: + Python 2.6, - Python 2.5
2008-05-13 12:42:28BitTormentsetnosy: + BitTorment
messages: + msg66781
2008-05-12 15:04:00techtoniksetmessages: + msg66728
2008-05-11 21:58:29georg.brandlsetnosy: + georg.brandl, alanmcintyre
messages: + msg66682
2008-05-11 20:17:49techtonikcreate