Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile.py end of central directory detection not robust #54903

Closed
KevinH mannequin opened this issue Dec 13, 2010 · 29 comments
Closed

zipfile.py end of central directory detection not robust #54903

KevinH mannequin opened this issue Dec 13, 2010 · 29 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@KevinH
Copy link
Mannequin

KevinH mannequin commented Dec 13, 2010

BPO 10694
Nosy @birkenfeld, @amauryfa, @vstinner, @ned-deily, @bitdancer, @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll, @serhiy-storchaka
Files
  • more_robust.diff: one potential patch to address the zipfile.py behaviour
  • Issue10694.diff
  • Issue10694_2.diff
  • reps_patch_and_two_tests.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-06-09.20:04:10.196>
    created_at = <Date 2010-12-13.18:57:49.831>
    labels = ['type-bug']
    title = 'zipfile.py end of central directory detection not robust'
    updated_at = <Date 2018-09-25.19:33:43.596>
    user = 'https://bugs.python.org/KevinH'

    bugs.python.org fields:

    activity = <Date 2018-09-25.19:33:43.596>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-06-09.20:04:10.196>
    closer = 'r.david.murray'
    components = []
    creation = <Date 2010-12-13.18:57:49.831>
    creator = 'KevinH'
    dependencies = []
    files = ['20040', '20127', '20200', '20723']
    hgrepos = []
    issue_num = 10694
    keywords = ['patch']
    message_count = 29.0
    messages = ['123891', '123932', '123953', '123954', '124291', '124372', '124374', '124377', '124378', '124387', '124390', '124391', '124396', '124418', '124419', '124449', '124465', '124953', '124977', '126338', '128216', '128218', '128276', '131634', '131754', '137729', '138025', '138026', '326400']
    nosy_count = 14.0
    nosy_names = ['georg.brandl', 'amaury.forgeotdarc', 'alanmcintyre', 'vstinner', 'ned.deily', 'r.david.murray', 'rfk', 'verm', 'bcroq', 'rep', 'xuanji', 'KevinH', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue10694'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @KevinH
    Copy link
    Mannequin Author

    KevinH mannequin commented Dec 13, 2010

    The current version of zipfile.py is not robust to slight errors at the end of zip archives. Many file servers **improperly** append a new line to the end of files that do not have a new line when they are uploaded from a browser. This bug ends up adding 0x0d 0xa to the end of the zip archive. This in turn makes zipfile.py eventually throw a "Not a zip file" exception when no other zip tools seem to have trouble with them. Even unzip -t passes these "problem" zip archives with flying colours.

    I hate to have to extract and create my own zipfile.py script just to be robust to zip archives that are commonly found on the net and that are handled more robustly by other software.

    So please consider changing this code from _EndRecData below to simply ignore any trailing data after the proper stringEndArchive and structEndArchive are found instead of looking for the comment and verifying if the comment is properly formatted and throwing an exception if not correct. Ignoring the "comment" seems to be more robust in this case as everything needed to unpack the zip archive has been found.

        # Either this is not a ZIP file, or it is a ZIP file with an archive
        # comment.  Search the end of the file for the "end of central directory"
        # record signature. The comment is the last item in the ZIP file and may be
        # up to 64K long.  It is assumed that the "end of central directory" magic
        # number does not appear in the comment.
        maxCommentStart = max(filesize - (1 << 16) - sizeEndCentDir, 0)
        fpin.seek(maxCommentStart, 0)
        data = fpin.read()
        start = data.rfind(stringEndArchive)
        if start >= 0:
            # found the magic number; attempt to unpack and interpret
            recData = data[start:start+sizeEndCentDir]
            endrec = list(struct.unpack(structEndArchive, recData))
            comment = data[start+sizeEndCentDir:]
            # check that comment length is correct
            if endrec[_ECD_COMMENT_SIZE] == len(comment):
                # Append the archive comment and start offset
                endrec.append(comment)
                endrec.append(maxCommentStart + start)
                if endrec[_ECD_OFFSET] == 0xffffffff:
                    # There is apparently a "Zip64 end of central directory"
                    # structure present, so go look for it
                    return _EndRecData64(fpin, start - filesize, endrec)
                return endrec

    This will in turn make the Python implementation of zipfile.py more robust to data improperly appended when some zip archives are uploaded or downloaded (similar to how other zip tools handle this issue).

    Thank you for your time and consideration.

    @KevinH KevinH mannequin added the type-bug An unexpected behavior, bug, or error label Dec 13, 2010
    @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll
    Copy link
    Mannequin

    Hi, I would like to take a look at whether your code works, can you provide an zip file that is currently not read by zipfile but should be?

    Also, I suggest you submit the code changes as a patch (http://www.python.org/dev/patches/)

    thanks!

    @KevinH
    Copy link
    Mannequin Author

    KevinH mannequin commented Dec 14, 2010

    If you read the bug report it explains how to generate a testcase (i.e. append any data to the end of a zip archive)

    Here it is as a step by step process

    1. simply take any working zip and call it testcase.zip

    2. do the following:

    echo "\r\n" >> testcase.zip

    If you run "unzip -t" on testcase.zip it will pass with flying colors and will properly unzip on every piece of zip software I have tried.

    However if you try to use python to copy the zip archive to another zip archive

    python ./zipfix.py testcase.zip junk.zip
    Error Occurred File is not a zip file

    All because of the appended carriage return / linefeed at the end.

    @KevinH
    Copy link
    Mannequin Author

    KevinH mannequin commented Dec 14, 2010

    Here is one potential patch. It simply incorporates and non-comment extraneous data into a final comment so that nothing is lost. Another solution that might be safer, would be to truncate the zip archive immediately after the endrec is found if the extraneous data is not a properly formatted comment. The right solution is obviously up to the developers of zipfile.py

    --- zipfile_orig.py	2010-12-14 10:23:58.000000000 -0500
    +++ zipfile.py	2010-12-14 10:30:21.000000000 -0500
    @@ -228,6 +228,13 @@
                     # structure present, so go look for it
                     return _EndRecData64(fpin, start - filesize, endrec)
                 return endrec
    +        else :
    +            # be robust to non-comment extaneous data after endrec
    +            # by making it a comment so that nothing is ever lost
    +            endrec[_ECD_COMMENT_SIZE] = len(comment)
    +            endrec.append(comment)
    +            endrec.append(maxCommentStart + start)
    +            return endrec
     
         # Unable to find a valid end of central directory structure
         return

    @KevinH
    Copy link
    Mannequin Author

    KevinH mannequin commented Dec 18, 2010

    Same problem exists in Python 3.1.3 and in later versions as well.

    Same patch should also work.

    @ned-deily
    Copy link
    Member

    Thanks for the suggested code. As suggested, it would be helpful to supply a complete patch (or patches) ready to be reviewed and applied to the tips of the currently maintained source trees (py3k, 3.1, and 2.7) including adding an appropriate test case to Lib/test/test_zipfile.py.

    @KevinH
    Copy link
    Mannequin Author

    KevinH mannequin commented Dec 20, 2010

    Final patches against the trees make no sense as no developer has decided which way they want to actually handle the problem.

    My patch is only one way and I note it may not be the way the owners of the code want.

    Also, this patch is very straight forward (one hunk) and should apply to 2.6, 2.7, and 3.1 (although I have not tried it with 3.1) with only line offsets.

    So if the owner of this code actually looks that the patch and the bug report and makes a decision on how they want to handle this issue and they like the patch I have suggested, then I would be happy to diff it against whatever zipfile.py versions he/she wants.

    @ned-deily
    Copy link
    Member

    Not to belabor the point but the most useful part of a ready-to-go patch would be to have a fully automated test case that fits into the existing test_zipfile.py structure and that demonstrates the failure. Somebody has to write a test case one way or another because that's the first step towards getting a fix applied. That can be a big help as there are many more modules in the standard library than there are volunteer core developers to look at problems.

    @bitdancer
    Copy link
    Member

    Actually our normal procedure currently (this will change a bit after the migration to mercurial) is a patch against the py3k branch, and the committer will do the backport to the other active branches. If the 2.7 code is very different, a separate 2.7 patch is sometimes helpful. In this case it looks like a patch that applies to py3k is enough.

    Alan hasn't spoken up yet, and he's listed as maintainer. If he doesn't speak up someone else will hopefully make a decision before the end of the beta period for 3.2. I'm bumping up the priority of this issue because I think not being able to handle commonly produced archives that other zip tools can handle is a relatively serious defect.

    My own opinion is that truncation if the extra data doesn't look like a comment makes the most sense. But absent an opinion from Alan I think I'd be guided by whatever other zip tools do in this case. Kevin, am I correct in guessing that that is truncation?

    Ned is definitely correct that absence of a test case is something that can delay getting a patch applied. If a patch includes a unit test, a committer can often do a quick review-and-apply, while absent a unit test the committer would first need to write one, and therefore will often move on to some other, easier to process issue :) In this case it seems to me that the unit test will only differ in one detail depending on the fix chosen, so it may be worth writing it even before a final decision is made, if you are willing.

    @KevinH
    Copy link
    Mannequin Author

    KevinH mannequin commented Dec 20, 2010

    I have not looked at how other tools handle this. They could simply ignore what comes after a valid endrecdata is found, they could strip it out (truncate it) or make it into a final comment. I guess for simply unpacking a zip archive, all of these are equivalent (it goes unused).

    But if you are copying a zip archive to another archive then ignoring it and truncating it may be safer in some sense (since you have no idea what this extra data is for and why it might be attached) but then you are not being faithful to the original but at the same time you do not want to create improper zip archives. If you change the extra data into a final comment, then at least none of the original data is actually lost (just moved slightly in the copied zip and protected as a comment) and could be recovered if it turns out to have been useful. With so many things using/making the zip archive format (jars, normal zips, epubs, etc) you never know what might have been left over at the end of the zip file and if it was important.

    So I am not really sure how to deal with this properly. Also I know nothing about _EndRecData64 and if it needs to somehow be handled in a different way.

    So someone who is very familiar with the code should look at this and tell us what is the right thing to do and even if the approach I took is correct (it works fine for me and I have taken to including my own zipfile.py in my own projects until this gets worked out) but it might not be the right thing to do.

    As for a test case, I know nothing about that but will look at test_zipfile.py. I am a Mac OS X user/developer so all of my code is targeted to running on both Python 2.5 (Mac OS X 10.5.X) and Python 2.6 (Mac OS 10.6.X). Python 3.X and even Python 2.7 are not on my horizon and not even on my build machine (trying to get Mac OS X users to install either would be an experience in frustration). I simply looked at the source in Python 2.7 and Python 3.1.3 (from the official Python releases from python.org) to see that the problem still exists (and it does).

    @bitdancer
    Copy link
    Member

    It's pretty easy, really, to do an SVN checkout of python and compile it on a mac, if you are at all familiar with the unix command line. If you don't have the time or desire for that, though, someone will eventually get to it, we just don't know when ;)

    @KevinH
    Copy link
    Mannequin Author

    KevinH mannequin commented Dec 20, 2010

    Been programming on unix/vax and then linux since the mid 80s and on punch cards in the late 70s. Grew my first beard writing 8080 and Z80 assembler. All of that was over 30 years ago.

    All I want to do is report a damn bug!

    Then I get nudged for a test case (although how to recreate the bug was already described) so I add the two lines to create a test case.

    Then I get nudged for a patch, so I give a patch even though there are many ways to deal with the issue.

    Then I get nudged for patches for other branches, then I get nudged for official test_zipfile.py patches.

    All of this **before** the damn owner has even bothered to look at it and say if he/she even wants it or if the patch is even correct.

    I have my own working code for the epub ebook stuff I am working on so this issue no longer impacts me.

    How did I go from a simple bug report to having to build python's latest checkout just to get someone to look at the bug.

    You have got to be kidding me!

    I am done here, do what you want with the bug.

    @bitdancer
    Copy link
    Member

    Sorry, I thought I was being clear that if you *wanted* to help further here was how you could, but if you didn't then we'd get to it eventually. We're all volunteers here, just like you, so every bit of help...helps, and we thank you sincerely for taking the time to report the bug and provide the preliminary patch and test.

    @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll
    Copy link
    Mannequin

    Hi KevinH, sorry if I gave the wrong impression that you must submit a patch before anyone will look at the bug. I thought you wanted to provide a patch for this. If you only want to report a bug that is ok, thanks for that.

    I have confirmed your bug on the py3k branch. Attached is a patch that adds a test case to test_zipfile.py (as requested by others) and ports Kevin's patch to py3k. I have confirmed that his patch works: without it the test fails, and with it the test passes. Note: You have to create a file "README.zip" in the Lib/test directory, because svn diff cannot display added files.

    @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll
    Copy link
    Mannequin

    Sorry, forgot to add: README.zip should have a \r\n at the end. Actually I just zipped up README, ran echo "\r\n" >> README.zip and put it in Lib/test.

    @birkenfeld
    Copy link
    Member

    "High" is not near high enough to get noticed before the release :)

    @bitdancer
    Copy link
    Member

    Xuanji: thanks for taking a crack at the test.

    Rather than adding another data file to the test directory, how about creating a zipfile using the zipfile module in the test, closing it, opening it as a file, writing the /r/n to it, and then opening it back up for read as a zipfile for the test. That's more parallel to how the other zipfile tests work.

    Given what we know I'd be inclined to accept Kevin's suggested fix, unless Alan speaks up against it. In that case the test may as well check that the comment is what we expect, as well.

    @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll
    Copy link
    Mannequin

    Ok, new patch that creates a zipfile (actually I used TESTFN2, all the other tests seem to also use it)

    @bitdancer
    Copy link
    Member

    I finally got around to researching this issue in the tracker.

    bpo-10298 is a close relative to this issue. The fix from that issue make the test that Xuanji added here pass. That issue contains no tests....it would be ideal to have tests that test the behavior in the face of actual comments in the zipfile, but even if all we have is Xuanji's test IMO we should apply one of the two fixes.

    The 10298 patch takes the approach of ignoring the excess data but preserving the comment if any. The author implies that that is what other tools do, so in the absence of input from Alan or other zipfile experts that's probably what we should go with.

    Rep, could you look over this issue and indicate if you agree?

    Note also bpo-9239, which fixes one way that zipfile could create a zipfile with garbage at the end.

    Then there is bpo-1757072, where we hear some of Alan's thinking about this: a "non-strict" mode...but it is perhaps too late for a feature request, and there is the fact that ignoring the trailing data appears to be a de-facto standard.

    And then we have bpo-1757072, which is identical to this one and was closed won't fix, but apparently only because the source of the corrupted zip files wasn't identified, which this issue does do.

    Interestingly, bpo-669036 claims that zipfile.py supports garbage at the start, which makes tolerating garbage at the end seem sensibly symmetric.

    Finally, comment support was added by the patch in bpo-611760. It would be interesting to know if garbage at the end was supported before that patch. My guess is that it was, by ignoring it, but I haven't tested it.

    Summary: if someone can review the actual patch, I think we should apply the bpo-10298 patch along with Xuanji's test. Xuanji, if you have the time and desire to add some additional tests that test comments with trailing data, that would be a bonus. You could look at the tests in bpo-9239 for ideas.

    @birkenfeld
    Copy link
    Member

    Not important enough to block release.

    @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll
    Copy link
    Mannequin

    I've attached a patch copying rep's patch from bpo-10298, as i think it is cleaner (but it seems to do the same thing as kevin's patch) together with 2 test cases.

    David Murray: I'm not really sure what you mean by "test comments with trailing data"; as I understand, the way to fix the problem is simply to remove the excessively strict check that the length the zip file claims to have is equal to the actual amount of comments found by truncating comments to the claimed size; if that's the case my (new) test should be enough to exercise that logic. Please correct me if I misunderstand.

    @bitdancer
    Copy link
    Member

    Xuanji: yes, your test_ignores_stuff_appended_past_comments is exactly what I was asking for. I've put this patch on my review list, but I may not get to it until after 3.2 final.

    @rep
    Copy link
    Mannequin

    rep mannequin commented Feb 10, 2011

    Xuanji: Patch looks good now with the test cases.

    Regarding the other fix suggested by Kevin I think that we should not introduce custom behaviour here like appending the eofdata to the comment even though the comment size in the record does not say so. In the end we want to make sure that correct zip files end up being processed correctly with the specified comment size and incorrect zip files with eofdata should not be rejected / throw exceptions.

    So I agree with Xuanji that his patch (or mine) is the way to go.

    You probably agree that we can mark bpo-10298 to be superseded by this one here.

    @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll
    Copy link
    Mannequin

    Hi, can someone look at the patch? There doesn't seem to be any objections to it and it'll solve a long-standing issue.

    Thanks!

    @bcroq
    Copy link
    Mannequin

    bcroq mannequin commented Mar 22, 2011

    The last patch (manually applied to Python 2.6) fixed the problem with a ZIP file used in my project.

    @verm
    Copy link
    Mannequin

    verm mannequin commented Jun 6, 2011

    This patch fixed every BadZipfile exception (~300) I got while testing over 50,000 zipfiles. (Applied to 2.7)

    It would be nice to see this bug resolved.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 9, 2011

    New changeset 7530b141c20f by R David Murray in branch '3.2':
    bpo-10694: zipfile now ignores garbage at the end of a zipfile.
    http://hg.python.org/cpython/rev/7530b141c20f

    New changeset 2e49722c7263 by R David Murray in branch 'default':
    Merge bpo-10694: zipfile now ignores garbage at the end of a zipfile.
    http://hg.python.org/cpython/rev/2e49722c7263

    New changeset cc3255a707c7 by R David Murray in branch '2.7':
    bpo-10694: zipfile now ignores garbage at the end of a zipfile.
    http://hg.python.org/cpython/rev/cc3255a707c7

    @bitdancer
    Copy link
    Member

    While I don't understand the zip file structure enough to do a definitive review of the patch, I'm willing to take processing 50K zip files successfully as a definitive test :)

    @serhiy-storchaka
    Copy link
    Member

    This change significantly increased the probability of false positive detection in is_zipfile(). It was around 1/2**32, now it is around 1/2**16. It looks small, but not enough small if you test hundreds of thousands files. See bpo-28494.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants