classification
Title: zipfile.py end of central directory detection not robust
Type: behavior Stage: resolved
Components: Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: KevinH, alanmcintyre, amaury.forgeotdarc, bcroq, georg.brandl, haypo, ned.deily, python-dev, r.david.murray, rep, rfk, verm, xuanji
Priority: critical Keywords: patch

Created on 2010-12-13 18:57 by KevinH, last changed 2011-06-09 20:04 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
more_robust.diff KevinH, 2010-12-14 17:07 one potential patch to address the zipfile.py behaviour
Issue10694.diff xuanji, 2010-12-21 08:12 review
Issue10694_2.diff xuanji, 2010-12-31 05:01 review
reps_patch_and_two_tests.diff xuanji, 2011-02-09 14:55 review
Messages (28)
msg123891 - (view) Author: Kevin Hendricks (KevinH) Date: 2010-12-13 18:57
The current version of zipfile.py is not robust to slight errors at the end of zip archives.  Many file servers **improperly** append a new line to the end of files that do not have a new line when they are uploaded from a browser.  This bug ends up adding 0x0d 0xa to the end of the zip archive.  This in turn makes zipfile.py eventually throw a "Not a zip file" exception when no other zip tools seem to have trouble with them.  Even unzip -t passes these "problem" zip archives with flying colours.

I hate to have to extract and create my own zipfile.py script just to be robust to zip archives that are commonly found on the net and that are handled more robustly by other software.

So please consider changing this code from _EndRecData below to simply ignore any trailing data after the proper stringEndArchive and structEndArchive are found instead of looking for the comment and verifying if the comment is properly formatted and throwing an exception if not correct.  Ignoring the "comment" seems to be more robust in this case as everything needed to unpack the zip archive has been found.


    # Either this is not a ZIP file, or it is a ZIP file with an archive
    # comment.  Search the end of the file for the "end of central directory"
    # record signature. The comment is the last item in the ZIP file and may be
    # up to 64K long.  It is assumed that the "end of central directory" magic
    # number does not appear in the comment.
    maxCommentStart = max(filesize - (1 << 16) - sizeEndCentDir, 0)
    fpin.seek(maxCommentStart, 0)
    data = fpin.read()
    start = data.rfind(stringEndArchive)
    if start >= 0:
        # found the magic number; attempt to unpack and interpret
        recData = data[start:start+sizeEndCentDir]
        endrec = list(struct.unpack(structEndArchive, recData))
        comment = data[start+sizeEndCentDir:]
        # check that comment length is correct
        if endrec[_ECD_COMMENT_SIZE] == len(comment):
            # Append the archive comment and start offset
            endrec.append(comment)
            endrec.append(maxCommentStart + start)
            if endrec[_ECD_OFFSET] == 0xffffffff:
                # There is apparently a "Zip64 end of central directory"
                # structure present, so go look for it
                return _EndRecData64(fpin, start - filesize, endrec)
            return endrec


This will in turn make the Python implementation of zipfile.py more robust to data improperly appended when some zip archives are uploaded or downloaded (similar to how other zip tools handle this issue).

Thank you for your time and consideration.
msg123932 - (view) Author: Xuanji Li (xuanji) Date: 2010-12-14 11:51
Hi, I would like to take a look at whether your code works, can you provide an zip file that is currently not read by zipfile but should be?

Also, I suggest you submit the code changes as a patch (http://www.python.org/dev/patches/)

thanks!
msg123953 - (view) Author: Kevin Hendricks (KevinH) Date: 2010-12-14 15:22
If you read the bug report it explains how to generate a testcase (i.e. append any data to the end of a zip archive)

Here it is as a step by step process 

1. simply take any working zip and call it testcase.zip 

2. do the following:

echo "\r\n" >> testcase.zip 

If you run "unzip -t" on testcase.zip it will pass with flying colors and will properly unzip on every piece of zip software I have tried.

However if you try to use python to copy the zip archive to another zip archive

python ./zipfix.py testcase.zip junk.zip
Error Occurred  File is not a zip file

All because of the appended carriage return / linefeed at the end.
msg123954 - (view) Author: Kevin Hendricks (KevinH) Date: 2010-12-14 15:34
Here is one potential patch.  It simply incorporates and non-comment extraneous data into a final comment so that nothing is lost.  Another solution that might be safer, would be to truncate the zip archive immediately after the endrec is found if the extraneous data is not a properly formatted comment.  The right solution is obviously up to the developers of zipfile.py

--- zipfile_orig.py	2010-12-14 10:23:58.000000000 -0500
+++ zipfile.py	2010-12-14 10:30:21.000000000 -0500
@@ -228,6 +228,13 @@
                 # structure present, so go look for it
                 return _EndRecData64(fpin, start - filesize, endrec)
             return endrec
+        else :
+            # be robust to non-comment extaneous data after endrec
+            # by making it a comment so that nothing is ever lost
+            endrec[_ECD_COMMENT_SIZE] = len(comment)
+            endrec.append(comment)
+            endrec.append(maxCommentStart + start)
+            return endrec
 
     # Unable to find a valid end of central directory structure
     return
msg124291 - (view) Author: Kevin Hendricks (KevinH) Date: 2010-12-18 16:46
Same problem exists in Python 3.1.3 and in later versions as well.

Same patch should also work.
msg124372 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-12-20 00:01
Thanks for the suggested code. As suggested, it would be helpful to supply a complete patch (or patches) ready to be reviewed and applied to the tips of the currently maintained source trees (py3k, 3.1, and 2.7) including adding an appropriate test case to Lib/test/test_zipfile.py.
msg124374 - (view) Author: Kevin Hendricks (KevinH) Date: 2010-12-20 00:21
Final patches against the trees make no sense as no developer has decided which way they want to actually handle the problem. 

My patch is only one way and I note it may not be the way the owners of the code want.

Also, this patch is very straight forward (one hunk) and should apply to 2.6, 2.7, and 3.1 (although I have not tried it with 3.1) with only line offsets.

So if the owner of this code actually looks that the patch and the bug report and makes a decision on how they want to handle this issue and they like the patch I have suggested, then I would be happy to diff it against whatever zipfile.py versions he/she wants.
msg124377 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-12-20 01:14
Not to belabor the point but the most useful part of a ready-to-go patch would be to have a fully automated test case that fits into the existing test_zipfile.py structure and that demonstrates the failure. Somebody has to write a test case one way or another because that's the first step towards getting a fix applied. That can be a big help as there are many more modules in the standard library than there are volunteer core developers to look at problems.
msg124378 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-20 01:22
Actually our normal procedure currently (this will change a bit after the migration to mercurial) is a patch against the py3k branch, and the committer will do the backport to the other active branches.  If the 2.7 code is very different, a separate 2.7 patch is sometimes helpful.  In this case it looks like a patch that applies to py3k is enough.

Alan hasn't spoken up yet, and he's listed as maintainer.  If he doesn't speak up someone else will hopefully make a decision before the end of the beta period for 3.2.  I'm bumping up the priority of this issue because I think not being able to handle commonly produced archives that other zip tools can handle is a relatively serious defect.

My own opinion is that truncation if the extra data doesn't look like a comment makes the most sense.  But absent an opinion from Alan I think I'd be guided by whatever other zip tools do in this case.  Kevin, am I correct in guessing that that is truncation?

Ned is definitely correct that absence of a test case is something that can delay getting a patch applied.  If a patch includes a unit test, a committer can often do a quick review-and-apply, while absent a unit test the committer would first need to write one, and therefore will often move on to some other, easier to process issue :)  In this case it seems to me that the unit test will only differ in one detail depending on the fix chosen, so it may be worth writing it even before a final decision is made, if you are willing.
msg124387 - (view) Author: Kevin Hendricks (KevinH) Date: 2010-12-20 14:06
I have not looked at how other tools handle this.  They could simply ignore what comes after a valid endrecdata is found, they could strip it out (truncate it) or make it into a final comment.  I guess for simply unpacking a zip archive, all of these are equivalent (it goes unused).

But if you are copying a zip archive to another archive then ignoring it and truncating it may be safer in some sense (since you have no idea what this extra data is for and why it might be attached) but then you are not being faithful to the original but at the same time you do not want to create improper zip archives.  If you change the extra data into a final comment, then at least none of the original data is actually lost (just moved slightly in the copied zip and protected as a comment) and could be recovered if it turns out to have been useful.  With so many things using/making the zip archive format (jars, normal zips, epubs, etc) you never know what might have been left over at the end of the zip file and if it was important.

So I am not really sure how to deal with this properly.  Also I know nothing about _EndRecData64 and if it needs to somehow be handled in a different way.

So someone who is very familiar with the code should look at this and tell us what is the right thing to do and even if the approach I took is correct (it works fine for me and I have taken to including my own zipfile.py in my own projects until this gets worked out) but it might not be the right thing to do.

As for a test case, I know nothing about that but will look at test_zipfile.py.  I am a Mac OS X user/developer so all of my code is targeted to running on both Python 2.5 (Mac OS X 10.5.X) and Python 2.6 (Mac OS 10.6.X). Python 3.X and even Python 2.7 are not on my horizon and not even on my build machine (trying to get Mac OS X users to install either would be an experience in frustration). I simply looked at the source in Python 2.7 and Python 3.1.3 (from the official Python releases from python.org) to see that the problem still exists (and it does).
msg124390 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-20 15:52
It's pretty easy, really, to do an SVN checkout of python and compile it on a mac, if you are at all familiar with the unix command line.  If you don't have the time or desire for that, though, someone will eventually get to it, we just don't know when ;)
msg124391 - (view) Author: Kevin Hendricks (KevinH) Date: 2010-12-20 16:24
Been programming on unix/vax and then linux since the mid 80s and on punch cards in the late 70s.  Grew my first beard writing 8080 and Z80 assembler.  All of that was over 30 years ago. 

All I want to do is report a damn bug!

Then I get nudged for a test case (although how to recreate the bug was already described) so I add the two lines to create a test case.  

Then I get nudged for a patch, so I give a patch even though there are many ways to deal with the issue.

Then I get nudged for patches for other branches, then I get nudged for official test_zipfile.py patches.

All of this **before** the damn owner has even bothered to look at it and say if he/she even wants it or if the patch is even correct.

I have my own working code for the epub ebook stuff I am working on so this issue no longer impacts me.

How did I go from a simple bug report to having to build python's latest checkout just to get someone to look at the bug.

You have got to be kidding me!

I am done here,  do what you want with the bug.
msg124396 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-20 18:19
Sorry, I thought I was being clear that if you *wanted* to help further here was how you could, but if you didn't then we'd get to it eventually.  We're all volunteers here, just like you, so every bit of help...helps, and we thank you sincerely for taking the time to report the bug and provide the preliminary patch and test.
msg124418 - (view) Author: Xuanji Li (xuanji) Date: 2010-12-21 08:12
Hi KevinH, sorry if I gave the wrong impression that you must submit a patch before anyone will look at the bug. I thought you wanted to provide a patch for this. If you only want to report a bug that is ok, thanks for that.

I have confirmed your bug on the py3k branch. Attached is a patch that adds a test case to test_zipfile.py (as requested by others) and ports Kevin's patch to py3k. I have confirmed that his patch works: without it the test fails, and with it the test passes. Note: You have to create a file "README.zip" in the Lib/test directory, because svn diff cannot display added files.
msg124419 - (view) Author: Xuanji Li (xuanji) Date: 2010-12-21 08:14
Sorry, forgot to add: README.zip should have a \r\n at the end. Actually I just zipped up README, ran echo "\r\n" >> README.zip and put it in Lib/test.
msg124449 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-12-21 18:54
"High" is not near high enough to get noticed before the release :)
msg124465 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-21 22:44
Xuanji: thanks for taking a crack at the test.

Rather than adding another data file to the test directory, how about creating a zipfile using the zipfile module in the test, closing it, opening it as a file, writing the /r/n to it, and then opening it back up for read as a zipfile for the test.  That's more parallel to how the other zipfile tests work.

Given what we know I'd be inclined to accept Kevin's suggested fix, unless Alan speaks up against it.  In that case the test may as well check that the comment is what we expect, as well.
msg124953 - (view) Author: Xuanji Li (xuanji) Date: 2010-12-31 05:01
Ok, new patch that creates a zipfile (actually I used TESTFN2, all the other tests seem to also use it)
msg124977 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-31 20:56
I finally got around to researching this issue in the tracker.

Issue 10298 is a close relative to this issue.  The fix from that issue make the test that Xuanji added here pass.  That issue contains no tests....it would be ideal to have tests that test the behavior in the face of actual comments in the zipfile, but even if all we have is Xuanji's test IMO we should apply one of the two fixes.

The 10298 patch takes the approach of ignoring the excess data but preserving the comment if any.  The author implies that that is what other tools do, so in the absence of input from Alan or other zipfile experts that's probably what we should go with.

Rep, could you look over this issue and indicate if you agree?

Note also issue 9239, which fixes one way that zipfile could create a zipfile with garbage at the end.

Then there is issue 1757072, where we hear some of Alan's thinking about this: a "non-strict" mode...but it is perhaps too late for a feature request, and there is the fact that ignoring the trailing data appears to be a de-facto standard.

And then we have issue 1757072, which is identical to this one and was closed won't fix, but apparently only because the source of the corrupted zip files wasn't identified, which this issue does do.

Interestingly, issue 669036 claims that zipfile.py supports garbage at the start, which makes tolerating garbage at the end seem sensibly symmetric.

Finally, comment support was added by the patch in issue 611760.  It would be interesting to know if garbage at the end was supported before that patch.  My guess is that it was, by ignoring it, but I haven't tested it.

Summary: if someone can review the actual patch, I think we should apply the issue 10298 patch along with Xuanji's test.  Xuanji, if you have the time and desire to add some additional tests that test comments with trailing data, that would be a bonus.  You could look at the tests in issue 9239 for ideas.
msg126338 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-01-15 16:39
Not important enough to block release.
msg128216 - (view) Author: Xuanji Li (xuanji) Date: 2011-02-09 14:55
I've attached a patch copying rep's patch from issue 10298, as i think it is cleaner (but it seems to do the same thing as kevin's patch) together with 2 test cases. 

David Murray: I'm not really sure what you mean by "test comments with trailing data"; as I understand, the way to fix the problem is simply to remove the excessively strict check that the length the zip file claims to have is equal to the actual amount of comments found by truncating comments to the claimed size; if that's the case my (new) test should be enough to exercise that logic. Please correct me if I misunderstand.
msg128218 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-02-09 16:11
Xuanji: yes, your test_ignores_stuff_appended_past_comments is exactly what I was asking for.  I've put this patch on my review list, but I may not get to it until after 3.2 final.
msg128276 - (view) Author: (rep) Date: 2011-02-10 10:24
Xuanji: Patch looks good now with the test cases.

Regarding the other fix suggested by Kevin I think that we should not introduce custom behaviour here like appending the eofdata to the comment even though the comment size in the record does not say so. In the end we want to make sure that correct zip files end up being processed correctly with the specified comment size and incorrect zip files with eofdata should not be rejected / throw exceptions.

So I agree with Xuanji that his patch (or mine) is the way to go.

You probably agree that we can mark issue 10298 to be superseded by this one here.
msg131634 - (view) Author: Xuanji Li (xuanji) Date: 2011-03-21 11:34
Hi, can someone look at the patch? There doesn't seem to be any objections to it and it'll solve a long-standing issue.

Thanks!
msg131754 - (view) Author: Bertrand Croq (bcroq) Date: 2011-03-22 16:40
The last patch (manually applied to Python 2.6) fixed the problem with a ZIP file used in my project.
msg137729 - (view) Author: Amar Takhar (verm) Date: 2011-06-06 03:53
This patch fixed every BadZipfile exception (~300) I got while testing over 50,000 zipfiles. (Applied to 2.7)

It would be nice to see this bug resolved.
msg138025 - (view) Author: Roundup Robot (python-dev) Date: 2011-06-09 20:02
New changeset 7530b141c20f by R David Murray in branch '3.2':
#10694: zipfile now ignores garbage at the end of a zipfile.
http://hg.python.org/cpython/rev/7530b141c20f

New changeset 2e49722c7263 by R David Murray in branch 'default':
Merge #10694: zipfile now ignores garbage at the end of a zipfile.
http://hg.python.org/cpython/rev/2e49722c7263

New changeset cc3255a707c7 by R David Murray in branch '2.7':
#10694: zipfile now ignores garbage at the end of a zipfile.
http://hg.python.org/cpython/rev/cc3255a707c7
msg138026 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-09 20:04
While I don't understand the zip file structure enough to do a definitive review of the patch, I'm willing to take processing 50K zip files successfully as a definitive test :)
History
Date User Action Args
2011-06-09 20:04:10r.david.murraysetstatus: open -> closed
versions: - Python 3.1
messages: + msg138026

resolution: fixed
stage: patch review -> resolved
2011-06-09 20:02:07python-devsetnosy: + python-dev
messages: + msg138025
2011-06-06 03:53:24vermsetnosy: + verm
messages: + msg137729
2011-03-22 16:54:50pitrousetnosy: + amaury.forgeotdarc

versions: + Python 3.3
2011-03-22 16:40:17bcroqsetnosy: georg.brandl, alanmcintyre, haypo, ned.deily, r.david.murray, rfk, bcroq, rep, xuanji, KevinH
messages: + msg131754
2011-03-22 16:23:18bcroqsetnosy: + bcroq
2011-03-21 11:47:23hayposetnosy: + haypo
2011-03-21 11:34:21xuanjisetnosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, rfk, rep, xuanji, KevinH
messages: + msg131634
2011-02-10 10:24:37repsetnosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, rfk, rep, xuanji, KevinH
messages: + msg128276
2011-02-09 16:11:40r.david.murraysetnosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, rfk, rep, xuanji, KevinH
messages: + msg128218
2011-02-09 14:55:18xuanjisetfiles: + reps_patch_and_two_tests.diff
nosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, rfk, rep, xuanji, KevinH
messages: + msg128216
2011-01-15 16:39:37georg.brandlsetpriority: deferred blocker -> critical
nosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, rfk, rep, xuanji, KevinH
messages: + msg126338
2011-01-14 20:01:48pitrousetnosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, rfk, rep, xuanji, KevinH
stage: test needed -> patch review
2010-12-31 20:56:07r.david.murraysetnosy: + rfk, rep
messages: + msg124977
2010-12-31 05:01:11xuanjisetfiles: + Issue10694_2.diff
nosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124953
2010-12-21 22:44:59r.david.murraysetnosy: georg.brandl, alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124465
2010-12-21 18:54:51georg.brandlsetpriority: high -> deferred blocker
nosy: + georg.brandl
messages: + msg124449

2010-12-21 08:14:14xuanjisetnosy: alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124419
2010-12-21 08:12:16xuanjisetfiles: + Issue10694.diff
nosy: alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124418
2010-12-20 18:19:18r.david.murraysetnosy: alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124396
2010-12-20 16:24:21KevinHsetnosy: alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124391
2010-12-20 15:52:13r.david.murraysetnosy: alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124390
2010-12-20 14:06:33KevinHsetnosy: alanmcintyre, ned.deily, r.david.murray, xuanji, KevinH
messages: + msg124387
2010-12-20 01:22:09r.david.murraysetpriority: normal -> high

nosy: + r.david.murray
messages: + msg124378

stage: needs patch -> test needed
2010-12-20 01:14:11ned.deilysetnosy: alanmcintyre, ned.deily, xuanji, KevinH
messages: + msg124377
2010-12-20 00:21:00KevinHsetnosy: alanmcintyre, ned.deily, xuanji, KevinH
messages: + msg124374
2010-12-20 00:01:47ned.deilysetversions: - Python 2.6, Python 2.5
nosy: + alanmcintyre, ned.deily

messages: + msg124372

stage: needs patch
2010-12-18 16:46:04KevinHsetnosy: xuanji, KevinH
messages: + msg124291
versions: + Python 3.1, Python 3.2
2010-12-14 17:07:44KevinHsetfiles: + more_robust.diff
keywords: + patch
versions: + Python 2.6, Python 2.5
2010-12-14 15:34:20KevinHsetmessages: + msg123954
2010-12-14 15:22:22KevinHsetmessages: + msg123953
2010-12-14 11:51:32xuanjisetnosy: + xuanji
messages: + msg123932
2010-12-13 18:57:49KevinHcreate