Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in zipfile writing in 2.7.13 #73280

Closed
peterebden mannequin opened this issue Dec 28, 2016 · 21 comments
Closed

Regression in zipfile writing in 2.7.13 #73280

peterebden mannequin opened this issue Dec 28, 2016 · 21 comments
Assignees
Labels
3.7 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@peterebden
Copy link
Mannequin

peterebden mannequin commented Dec 28, 2016

BPO 29094
Nosy @Yhg1s, @gpshead, @vstinner, @larryhastings, @benjaminp, @ned-deily, @serhiy-storchaka, @peterebden
PRs
  • Revert "Issue #29094: Offsets in a ZIP file created with extern file object and modes" #1467
  • Revert bpo-26293 for zipfile breakage. See also bpo-29094. #1484
  • [3.6] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). #1485
  • [3.5] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). #1486
  • Files
  • zipfile_write_absolute_offsets.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2017-05-18.15:15:26.153>
    created_at = <Date 2016-12-28.13:26:31.940>
    labels = ['3.7', 'type-bug', 'library', 'release-blocker']
    title = 'Regression in zipfile writing in 2.7.13'
    updated_at = <Date 2017-05-18.15:15:26.152>
    user = 'https://github.com/peterebden'

    bugs.python.org fields:

    activity = <Date 2017-05-18.15:15:26.152>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2017-05-18.15:15:26.153>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2016-12-28.13:26:31.940>
    creator = 'Peter Ebden'
    dependencies = []
    files = ['46107']
    hgrepos = []
    issue_num = 29094
    keywords = ['patch']
    message_count = 21.0
    messages = ['284172', '284421', '284423', '284424', '284425', '284432', '284436', '292803', '292826', '292827', '292828', '292838', '292842', '292896', '292897', '292914', '293025', '293047', '293154', '293156', '293158']
    nosy_count = 9.0
    nosy_names = ['twouters', 'gregory.p.smith', 'vstinner', 'larry', 'benjamin.peterson', 'ned.deily', 'python-dev', 'serhiy.storchaka', 'Peter Ebden']
    pr_nums = ['1467', '1484', '1485', '1486']
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue29094'
    versions = ['Python 3.5', 'Python 3.6', 'Python 3.7']

    @peterebden
    Copy link
    Mannequin Author

    peterebden mannequin commented Dec 28, 2016

    In Python 2.7.13, using zipfile.ZipFile to write into a file with some initial preamble produces a zip file that cannot be read again by some zip implementations.
    Our use case is using pex (https://github.com/pantsbuild/pex) which writes a zip that begins with a shebang, and later attempting to manipulate that using Go's standard archive/zip package. In 2.7.12 that works OK, but in 2.7.13 the .pex file is rejected on reading. Linux's command-line unzip tool will read the archive, but issues a warning ("4 extra bytes at beginning or within zipfile") which wasn't present previously.
    zipfile.ZipFile does read the files OK.

    I assume this is related to https://bugs.python.org/issue26293 since that's the most obvious zipfile change in 2.7.13. It's pretty easy to reproduce using the example in that issue:

    from zipfile import ZipFile
    with open('a.zip', 'wb') as base:
        base.write(b'old\n')
        with ZipFile(base, 'a') as myzip:
            myzip.write('eggs.txt')

    unzip -t a.zip
    Archive: a.zip
    warning [a.zip]: 4 extra bytes at beginning or within zipfile
    (attempting to process anyway)
    ...

    @peterebden peterebden mannequin added the stdlib Python modules in the Lib dir label Dec 28, 2016
    @serhiy-storchaka
    Copy link
    Member

    The specification of ZIP files is not pretty clear, but the unzip utility first try to interpret offsets relatively to the start of the archive that can be different from the start of the file. Go's standard archive/zip package interprets offsets as absolute file positions and doesn't support concatenated ZIP files.

    In any case there is a regression. Proposed patch partially reverts bpo-26293. Offsets in created ZIP file are now absolute when create with modes 'w' and 'x'. But they are relative when create with mode 'a'. This should fix a regression in pex, but keep the case of bpo-26293.

    I'm going to push the patch before tagging 3.5.3 rc1.

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life release-blocker type-bug An unexpected behavior, bug, or error labels Jan 1, 2017
    @benjaminp
    Copy link
    Contributor

    Is there a reason to ever not use relative offsets? It seems that's strictly more general the absolute.

    @serhiy-storchaka
    Copy link
    Member

    Actually all offsets are relative to some point, just this point not always at the start of the file. If concatenate the ZIP file to other file, the unzip utility and the zipfile module are able to infer the starting point and correct offsets. Thus there is no matter what is the starting point of offsets. But Go's standard archive/zip package works only if offsets are relative to the start of the file. I don't know how common such interpretation however.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 1, 2017

    New changeset f5aa1c9c2b7e by Serhiy Storchaka in branch '3.5':
    Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes
    https://hg.python.org/cpython/rev/f5aa1c9c2b7e

    New changeset 342bc734f523 by Serhiy Storchaka in branch '2.7':
    Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes
    https://hg.python.org/cpython/rev/342bc734f523

    New changeset f36f9bce997d by Serhiy Storchaka in branch '3.6':
    Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes
    https://hg.python.org/cpython/rev/f36f9bce997d

    @larryhastings
    Copy link
    Contributor

    If this is fixed, can we close this issue? This release blocker is one of two issues blocking 3.5.3 rc1.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 1, 2017

    New changeset a80c14ace927 by Serhiy Storchaka in branch 'default':
    Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes
    https://hg.python.org/cpython/rev/a80c14ace927

    @Yhg1s
    Copy link
    Member

    Yhg1s commented May 2, 2017

    For the record, this new behaviour is wrong. It's not immediately obvious from the ZIP spec, but offsets have to be from the start of the file, not the archive, or ZIP64 can't work. And yes, that means you can't blindly concatenate ZIP files to other files, you have to rewrite them.

    The way to create an 'embedded' ZIP archive like requested in issue bpo-26293 is to write it to a separate file (or file-like object). Making "a" have this magical and wrong-for-almost-anyone meaning is... confusing, to say the least. The change certainly doesn't belong in a bugfix release, but I guess it's too late for that now.

    It's also not documented; the docs mention this:

    If mode is 'a' and file refers to an existing ZIP file, then
    additional files are added to it. If file does not refer to a ZIP
    file, then a new ZIP archive is appended to the file. This is
    meant for adding a ZIP archive to another file (such as
    python.exe).

    which is ambiguous at best and certainly suggests the resulting file would be usable as-is, which is not the case if you use mode "a" (but *is* the case if you use mode "w", since 342bc734f523).

    @gpshead
    Copy link
    Member

    gpshead commented May 3, 2017

    re-opening since, though released, this behavior change broke existing code.

    @gpshead gpshead reopened this May 3, 2017
    @gpshead
    Copy link
    Member

    gpshead commented May 3, 2017

    release managers need to decide.

    @larryhastings
    Copy link
    Contributor

    What's to decide? If the new behavior is also broken, we should fix it. I'd like a fix in the next 3.5.

    @benjaminp
    Copy link
    Contributor

    Anything breaking 2.7 should be reverted. I take it, the change from bpo-26293 should be as well as this one?

    Yhg1s, the commit message for bpo-26293 included the sentence "Offsets in ZIP file now are relative to the start of the archive in conforming to the specification." Can someone point to the line in the spec that covers this?

    @serhiy-storchaka
    Copy link
    Member

    https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

    There is no explicit line, but offsets always are named relative: "relative offset of local header", "offset of start of central directory with respect to the starting disk number". This looks not pretty clear, but open source utilities (unzip, p7zip) support ZIP archives started not from the start of the file.

    @Yhg1s
    Copy link
    Member

    Yhg1s commented May 3, 2017

    The spec isn't very explicit about it, yes, but it does say this:

    4.4.16 relative offset of local header: (4 bytes)

       This is the offset from the start of the first disk on
       which this file appears, to where the local header should
       be found.
    

    "the start of the first disk" could be construed to mean "the start of the ZIP archive embedded in this file". However, if you consider the information that's available, the only way to make ZIP archives work in the face of ZIP64 and other extensions that add information between the end-of-central-directory record and the end of the central directory, it's obvious that you can't correctly handle ZIP archives that start at an arbitrary point in a file.

    ZIP archives have both a 4-byte magic number at the start, and a central directory at the end. The end-of-central-directory record is the very last thing in the file, and it records both the offset of the start of the central directory and the size of the central directory. In absense of any ZIP extensions that add records between the end-of-central-directory record and the end of the central directory, you can use those to correct all offsets in the ZIP archive. But as soon as you add (for example) ZIP64 records, this no longer works: ZIP64 has an end-of-zip64-central-directory locator, and variable-sized end-of-zip64-central-directory record. The locator is fixed size right before the end-of-central-directory record and records the offset (from the start of the file) to the end-of-zip64-central-directory record, but *not* the size of that record or any other information you can use to determine the offset of the start of the archive in the file.

    Only by assuming the central directory record comes right before the end-of-central-directory record, or assuming fixed sizes for the ZIP64 record, can you deal with ZIP archives with offsets not from the start of the file. This assumption is not only *not* guaranteed by the ZIP spec, it's explicitly invalidated by ZIP64's variable sized records, and possibly other extensions (like encryption, compression and digital signatures, although I don't remember if those actually affect this).

    It's true that many ZIP tools try to deal with these kinds of archives, although they *do* realise it's wrong and they usually *do* warn about it. They still can't deal with it if it uses variable-sized ZIP64 features (other than trawling through the file looking for the 4-byte magic numbers).

    Here's an example of code that breaks because of this: https://github.com/Yhg1s/zipfile-hacks. I tried to convince zipfile to create Zip64 files with extra fields (the variable-sized parts) but unfortunately the *cough* "design" of the zipfile module doesn't allow that -- feel free to ignore the force_zip64 parts of the script.

    (I'm using two python installations I had laying around here; I could've used 2.7.12 vs 2.7.13 instead, and the results would be the same.)

    # Python 2.7.12 -- so old behaviour
    % python create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py2-preamble-w.zip create_small_zip64.py
    % python create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py2-preamble-a.zip create_small_zip64.py

    # Python 3.6.0+ -- after this change, so new behaviour
    % ~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py3-preamble-w.zip create_small_zip64.py
    % ~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py3-preamble-a.zip create_small_zip64.py

    The old zipfiles are fine:
    % zip -T py2-preamble-w.zip
    test of py2-preamble-w.zip OK
    % zip -T py2-preamble-a.zip
    test of py2-preamble-a.zip OK

    The new one using 'w' is also fine (as expected):
    % zip -T py3-preamble-w.zip
    test of py3-preamble-w.zip OK

    The new one using 'a' is broken:
    % zip -T py3-preamble-a.zip
    warning [py3-preamble-a.zip]: 17 extra bytes at beginning or within zipfile
    (attempting to process anyway)
    test of py3-preamble-a.zip FAILED

    zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)

    The 'unzip' tool does work, but it also prints a warning:
    % unzip -l py3-preamble-a.zip
    Archive: py3-preamble-a.zip
    warning [py3-preamble-a.zip]: 17 extra bytes at beginning or within zipfile
    (attempting to process anyway)
    Length Date Time Name
    --------- ---------- ----- ----
    4016 2017-05-03 14:23 create_small_zip64.py
    --------- -------
    4016 1 file

    Whether other tools try to compensate for the error depends greatly on the tool; there's quite a few that don't.

    For the record, we had two different bits of code that created zipfiles with preambles using mode='a', created by (at least) two different people. I don't think it's unreasonable to assume that if you have a file with existing data you don't want the ZipFile to overwrite, it should be using mode 'a' :P

    @Yhg1s
    Copy link
    Member

    Yhg1s commented May 3, 2017

    ... and yes, the changes to bpo-26293 should be reverted; the changes attached to this issue merely revert part of bpo-26293's changes.

    @serhiy-storchaka
    Copy link
    Member

    Thank you for raising this again and for your research Thomas. I now see that my understanding of the specification was wrong. I'll revert changes to zipfile.

    There is one question. If the zip file was concatenated to other file (and looks incorrect from the view of other tools), and we opened it in append mode and added new entries, should the central directory count offsets from the start of the file or from the start of embedded archive? In the first case offsets of first entries (which was in the original concatenated archive) in the central directory will be different from offsets in local headers. In the second case the zip file will look incorrect from the view of other tools (but it already looked incorrect before appending).

    @Yhg1s
    Copy link
    Member

    Yhg1s commented May 4, 2017

    Well, what should the zipfile module do when you open a broken zipfile for appending in the first place? :) There are many ways in which it could be broken. I don't think a zipfile with incorrect offsets should be treated any differently.

    (I don't know *how* the zipfile module should treat broken zipfiles, and I don't know how it treats them now -- but I think the most important thing is consistency and documenting the choice :P)

    @benjaminp
    Copy link
    Contributor

    New changeset ef4c6ba by Benjamin Peterson in branch '2.7':
    Revert "Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes" (bpo-1467)
    ef4c6ba

    @serhiy-storchaka
    Copy link
    Member

    New changeset 3763ea8 by Serhiy Storchaka in branch 'master':
    Revert bpo-26293 for zipfile breakage. See also bpo-29094. (bpo-1484)
    3763ea8

    @serhiy-storchaka
    Copy link
    Member

    New changeset 70dc6a7 by Serhiy Storchaka in branch '3.6':
    [3.6] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). (bpo-1485)
    70dc6a7

    @serhiy-storchaka
    Copy link
    Member

    New changeset c8faabc by Serhiy Storchaka in branch '3.5':
    [3.5] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). (bpo-1486)
    c8faabc

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants