classification
Title: Regression in zipfile writing in 2.7.13
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Peter Ebden, benjamin.peterson, gregory.p.smith, larry, ned.deily, python-dev, serhiy.storchaka, twouters, vstinner
Priority: release blocker Keywords: patch

Created on 2016-12-28 13:26 by Peter Ebden, last changed 2017-05-18 15:15 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
zipfile_write_absolute_offsets.patch serhiy.storchaka, 2017-01-01 07:34 review
Pull Requests
URL Status Linked Edit
PR 1467 merged benjamin.peterson, 2017-05-05 06:07
PR 1484 merged serhiy.storchaka, 2017-05-06 11:13
PR 1485 merged serhiy.storchaka, 2017-05-06 11:54
PR 1486 merged serhiy.storchaka, 2017-05-06 11:57
Messages (21)
msg284172 - (view) Author: Peter Ebden (Peter Ebden) * Date: 2016-12-28 13:26
In Python 2.7.13, using zipfile.ZipFile to write into a file with some initial preamble produces a zip file that cannot be read again by some zip implementations.
Our use case is using pex (https://github.com/pantsbuild/pex) which writes a zip that begins with a shebang, and later attempting to manipulate that using Go's standard archive/zip package. In 2.7.12 that works OK, but in 2.7.13 the .pex file is rejected on reading. Linux's command-line unzip tool will read the archive, but issues a warning ("4 extra bytes at beginning or within zipfile") which wasn't present previously. 
zipfile.ZipFile does read the files OK.

I assume this is related to https://bugs.python.org/issue26293 since that's the most obvious zipfile change in 2.7.13. It's pretty easy to reproduce using the example in that issue:

from zipfile import ZipFile
with open('a.zip', 'wb') as base:
    base.write(b'old\n')
    with ZipFile(base, 'a') as myzip:
        myzip.write('eggs.txt')

unzip -t a.zip
Archive:  a.zip
warning [a.zip]:  4 extra bytes at beginning or within zipfile
  (attempting to process anyway)
...
msg284421 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-01 07:34
The specification of ZIP files is not pretty clear, but the unzip utility first try to interpret offsets relatively to the start of the archive that can be different from the start of the file. Go's standard archive/zip package interprets offsets as absolute file positions and doesn't support concatenated ZIP files.

In any case there is a regression. Proposed patch partially reverts issue26293. Offsets in created ZIP file are now absolute when create with modes 'w' and 'x'. But they are relative when create with mode 'a'. This should fix a regression in pex, but keep the case of issue26293.

I'm going to push the patch before tagging 3.5.3 rc1.
msg284423 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2017-01-01 15:10
Is there a reason to ever not use relative offsets? It seems that's strictly more general the absolute.
msg284424 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-01 15:54
Actually all offsets are relative to some point, just this point not always at the start of the file. If concatenate the ZIP file to other file, the unzip utility and the zipfile module are able to infer the starting point and correct offsets. Thus there is no matter what is the starting point of offsets. But Go's standard archive/zip package works only if offsets are relative to the start of the file. I don't know how common such interpretation however.
msg284425 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017-01-01 17:08
New changeset f5aa1c9c2b7e by Serhiy Storchaka in branch '3.5':
Issue #29094: Offsets in a ZIP file created with extern file object and modes
https://hg.python.org/cpython/rev/f5aa1c9c2b7e

New changeset 342bc734f523 by Serhiy Storchaka in branch '2.7':
Issue #29094: Offsets in a ZIP file created with extern file object and modes
https://hg.python.org/cpython/rev/342bc734f523

New changeset f36f9bce997d by Serhiy Storchaka in branch '3.6':
Issue #29094: Offsets in a ZIP file created with extern file object and modes
https://hg.python.org/cpython/rev/f36f9bce997d
msg284432 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2017-01-01 22:28
If this is fixed, can we close this issue?  This release blocker is one of two issues blocking 3.5.3 rc1.
msg284436 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017-01-01 23:43
New changeset a80c14ace927 by Serhiy Storchaka in branch 'default':
Issue #29094: Offsets in a ZIP file created with extern file object and modes
https://hg.python.org/cpython/rev/a80c14ace927
msg292803 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2017-05-02 20:37
For the record, this new behaviour is wrong. It's not immediately obvious from the ZIP spec, but offsets have to be from the start of the file, not the archive, or ZIP64 can't work. And yes, that means you can't blindly concatenate ZIP files to other files, you have to rewrite them.

The way to create an 'embedded' ZIP archive like requested in issue #26293 is to write it to a separate file (or file-like object). Making "a" have this magical and wrong-for-almost-anyone meaning is... confusing, to say the least. The change certainly doesn't belong in a bugfix release, but I guess it's too late for that now. 

It's also not documented; the docs mention this:

  If mode is 'a' and file refers to an existing ZIP file, then
  additional files are added to it. If file does not refer to a ZIP
  file, then a new ZIP archive is appended to the file. This is
  meant for adding a ZIP archive to another file (such as
  python.exe).

which is ambiguous at best and certainly suggests the resulting file would be usable as-is, which is not the case if you use mode "a" (but *is* the case if you use mode "w", since 342bc734f523).
msg292826 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-05-03 00:34
re-opening since, though released, this behavior change broke existing code.
msg292827 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-05-03 00:34
release managers need to decide.
msg292828 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2017-05-03 00:40
What's to decide?  If the new behavior is also broken, we should fix it.  I'd like a fix in the next 3.5.
msg292838 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2017-05-03 05:02
Anything breaking 2.7 should be reverted. I take it, the change from #26293 should be as well as this one?

Yhg1s, the commit message for #26293 included the sentence "Offsets in ZIP file now are relative to the start of the archive in conforming to the specification." Can someone point to the line in the spec that covers this?
msg292842 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-03 05:22
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

There is no explicit line, but offsets always are named relative: "relative offset of local header", "offset of start of central directory with respect to the starting disk number". This looks not pretty clear, but open source utilities (unzip, p7zip) support ZIP archives started not from the start of the file.
msg292896 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2017-05-03 12:45
The spec isn't very explicit about it, yes, but it does say this:

4.4.16 relative offset of local header: (4 bytes)

       This is the offset from the start of the first disk on
       which this file appears, to where the local header should
       be found.

"the start of the first disk" could be construed to mean "the start of the ZIP archive embedded in this file". However, if you consider the information that's available, the only way to make ZIP archives work in the face of ZIP64 and other extensions that add information between the end-of-central-directory record and the end of the central directory, it's obvious that you can't correctly handle ZIP archives that start at an arbitrary point in a file.

ZIP archives have both a 4-byte magic number at the start, and a central directory at the end. The end-of-central-directory record is the very last thing in the file, and it records both the offset of the start of the central directory and the size of the central directory. In absense of any ZIP extensions that add records between the end-of-central-directory record and the end of the central directory, you can use those to correct all offsets in the ZIP archive. But as soon as you add (for example) ZIP64 records, this no longer works: ZIP64 has an end-of-zip64-central-directory locator, and variable-sized end-of-zip64-central-directory record. The locator is fixed size right before the end-of-central-directory record and records the offset (from the start of the file) to the end-of-zip64-central-directory record, but *not* the size of that record or any other information you can use to determine the offset of the start of the archive in the file.

Only by assuming the central directory record comes right before the end-of-central-directory record, or assuming fixed sizes for the ZIP64 record, can you deal with ZIP archives with offsets not from the start of the file. This assumption is not only *not* guaranteed by the ZIP spec, it's explicitly invalidated by ZIP64's variable sized records, and possibly other extensions (like encryption, compression and digital signatures, although I don't remember if those actually affect this).

It's true that many ZIP tools try to deal with these kinds of archives, although they *do* realise it's wrong and they usually *do* warn about it. They still can't deal with it if it uses variable-sized ZIP64 features (other than trawling through the file looking for the 4-byte magic numbers).

Here's an example of code that breaks because of this: https://github.com/Yhg1s/zipfile-hacks. I tried to convince zipfile to create Zip64 files with extra fields (the variable-sized parts) but unfortunately the *cough* "design" of the zipfile module doesn't allow that -- feel free to ignore the force_zip64 parts of the script.

(I'm using two python installations I had laying around here; I could've used 2.7.12 vs 2.7.13 instead, and the results would be the same.)

# Python 2.7.12 -- so old behaviour
% python create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py2-preamble-w.zip create_small_zip64.py
% python create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py2-preamble-a.zip create_small_zip64.py

# Python 3.6.0+ -- after this change, so new behaviour
% ~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py3-preamble-w.zip create_small_zip64.py
% ~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py3-preamble-a.zip create_small_zip64.py

The old zipfiles are fine:
% zip -T py2-preamble-w.zip
test of py2-preamble-w.zip OK
% zip -T py2-preamble-a.zip
test of py2-preamble-a.zip OK

The new one using 'w' is also fine (as expected):
% zip -T py3-preamble-w.zip
test of py3-preamble-w.zip OK

The new one using 'a' is broken:
% zip -T py3-preamble-a.zip
warning [py3-preamble-a.zip]:  17 extra bytes at beginning or within zipfile
  (attempting to process anyway)
test of py3-preamble-a.zip FAILED

zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)

The 'unzip' tool does work, but it also prints a warning:
% unzip -l py3-preamble-a.zip
Archive:  py3-preamble-a.zip
warning [py3-preamble-a.zip]:  17 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
     4016  2017-05-03 14:23   create_small_zip64.py
---------                     -------
     4016                     1 file

Whether other tools try to compensate for the error depends greatly on the tool; there's quite a few that don't.

For the record, we had two different bits of code that created zipfiles with preambles using mode='a', created by (at least) two different people. I don't think it's unreasonable to assume that if you have a file with existing data you don't want the ZipFile to overwrite, it should be using mode 'a' :P
msg292897 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2017-05-03 12:50
... and yes, the changes to issue 26293 should be reverted; the changes attached to this issue merely revert part of issue 26293's changes.
msg292914 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-03 16:24
Thank you for raising this again and for your research Thomas. I now see that my understanding of the specification was wrong. I'll revert changes to zipfile.

There is one question. If the zip file was concatenated to other file (and looks incorrect from the view of other tools), and we opened it in append mode and added new entries, should the central directory count offsets from the start of the file or from the start of embedded archive? In the first case offsets of first entries (which was in the original concatenated archive) in the central directory will be different from offsets in local headers. In the second case the zip file will look incorrect from the view of other tools (but it already looked incorrect before appending).
msg293025 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2017-05-04 22:42
Well, what should the zipfile module do when you open a broken zipfile for appending in the first place? :) There are many ways in which it could be broken. I don't think a zipfile with incorrect offsets should be treated any differently.

(I don't know *how* the zipfile module should treat broken zipfiles, and I don't know how it treats them now -- but I think the most important thing is consistency and documenting the choice :P)
msg293047 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2017-05-05 06:54
New changeset ef4c6ba169ff59215442fc4047d83f7a3d39bf39 by Benjamin Peterson in branch '2.7':
Revert "Issue #29094: Offsets in a ZIP file created with extern file object and modes" (#1467)
https://github.com/python/cpython/commit/ef4c6ba169ff59215442fc4047d83f7a3d39bf39
msg293154 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-06 11:46
New changeset 3763ea865cee5bbabcce11cd577811135e0fc747 by Serhiy Storchaka in branch 'master':
Revert bpo-26293 for zipfile breakage. See also bpo-29094. (#1484)
https://github.com/python/cpython/commit/3763ea865cee5bbabcce11cd577811135e0fc747
msg293156 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-06 12:10
New changeset 70dc6a7a0b7f104d87512556fca242c2ca96a010 by Serhiy Storchaka in branch '3.6':
[3.6] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). (#1485)
https://github.com/python/cpython/commit/70dc6a7a0b7f104d87512556fca242c2ca96a010
msg293158 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-06 12:11
New changeset c8faabce6ef318f3b425c6defd846e274d61e2ef by Serhiy Storchaka in branch '3.5':
[3.5] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). (#1486)
https://github.com/python/cpython/commit/c8faabce6ef318f3b425c6defd846e274d61e2ef
History
Date User Action Args
2017-12-04 07:29:43serhiy.storchakaunlinkissue30213 superseder
2017-06-15 18:30:21serhiy.storchakalinkissue30213 superseder
2017-05-18 15:15:26serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: backport needed -> resolved
2017-05-06 12:11:22serhiy.storchakasetmessages: + msg293158
2017-05-06 12:10:52serhiy.storchakasetmessages: + msg293156
2017-05-06 11:57:37serhiy.storchakasetpull_requests: + pull_request1589
2017-05-06 11:54:20serhiy.storchakasetpull_requests: + pull_request1587
2017-05-06 11:46:03serhiy.storchakasetmessages: + msg293154
2017-05-06 11:13:01serhiy.storchakasetpull_requests: + pull_request1585
2017-05-05 07:17:39serhiy.storchakasetstage: commit review -> backport needed
versions: - Python 2.7
2017-05-05 06:54:45benjamin.petersonsetmessages: + msg293047
2017-05-05 06:07:51benjamin.petersonsetpull_requests: + pull_request1566
2017-05-04 22:42:56twouterssetmessages: + msg293025
2017-05-03 16:24:48serhiy.storchakasetmessages: + msg292914
2017-05-03 12:50:33twouterssetmessages: + msg292897
2017-05-03 12:45:17twouterssetmessages: + msg292896
2017-05-03 05:44:50serhiy.storchakasetpull_requests: - pull_request911
2017-05-03 05:22:53serhiy.storchakasetmessages: + msg292842
2017-05-03 05:02:39benjamin.petersonsetmessages: + msg292838
2017-05-03 00:40:00larrysetmessages: + msg292828
2017-05-03 00:39:29vstinnersetnosy: + vstinner
2017-05-03 00:34:59gregory.p.smithsetpriority: normal -> release blocker

messages: + msg292827
2017-05-03 00:34:06gregory.p.smithsetstatus: closed -> open

nosy: + gregory.p.smith
messages: + msg292826

resolution: fixed -> (no value)
stage: resolved -> commit review
2017-05-02 20:37:24twouterssetnosy: + twouters
messages: + msg292803
2017-03-31 16:36:17dstufftsetpull_requests: + pull_request911
2017-01-01 23:43:26python-devsetmessages: + msg284436
2017-01-01 22:40:58serhiy.storchakasetpriority: release blocker -> normal
status: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-01-01 22:28:06larrysetmessages: + msg284432
2017-01-01 17:08:29python-devsetnosy: + python-dev
messages: + msg284425
2017-01-01 15:54:13serhiy.storchakasetmessages: + msg284424
2017-01-01 15:10:54benjamin.petersonsetmessages: + msg284423
2017-01-01 07:34:49serhiy.storchakasetfiles: + zipfile_write_absolute_offsets.patch
priority: normal -> release blocker
type: behavior

versions: + Python 3.5, Python 3.6, Python 3.7
keywords: + patch
nosy: + ned.deily, benjamin.peterson, larry

messages: + msg284421
stage: patch review
2016-12-29 04:09:08benjamin.petersonsetassignee: serhiy.storchaka

nosy: + serhiy.storchaka
2016-12-28 13:26:31Peter Ebdencreate