This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author twouters
Recipients Peter Ebden, benjamin.peterson, gregory.p.smith, larry, ned.deily, python-dev, serhiy.storchaka, twouters, vstinner
Date 2017-05-03.12:45:17
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1493815517.4.0.378393806225.issue29094@psf.upfronthosting.co.za>
In-reply-to
Content
The spec isn't very explicit about it, yes, but it does say this:

4.4.16 relative offset of local header: (4 bytes)

       This is the offset from the start of the first disk on
       which this file appears, to where the local header should
       be found.

"the start of the first disk" could be construed to mean "the start of the ZIP archive embedded in this file". However, if you consider the information that's available, the only way to make ZIP archives work in the face of ZIP64 and other extensions that add information between the end-of-central-directory record and the end of the central directory, it's obvious that you can't correctly handle ZIP archives that start at an arbitrary point in a file.

ZIP archives have both a 4-byte magic number at the start, and a central directory at the end. The end-of-central-directory record is the very last thing in the file, and it records both the offset of the start of the central directory and the size of the central directory. In absense of any ZIP extensions that add records between the end-of-central-directory record and the end of the central directory, you can use those to correct all offsets in the ZIP archive. But as soon as you add (for example) ZIP64 records, this no longer works: ZIP64 has an end-of-zip64-central-directory locator, and variable-sized end-of-zip64-central-directory record. The locator is fixed size right before the end-of-central-directory record and records the offset (from the start of the file) to the end-of-zip64-central-directory record, but *not* the size of that record or any other information you can use to determine the offset of the start of the archive in the file.

Only by assuming the central directory record comes right before the end-of-central-directory record, or assuming fixed sizes for the ZIP64 record, can you deal with ZIP archives with offsets not from the start of the file. This assumption is not only *not* guaranteed by the ZIP spec, it's explicitly invalidated by ZIP64's variable sized records, and possibly other extensions (like encryption, compression and digital signatures, although I don't remember if those actually affect this).

It's true that many ZIP tools try to deal with these kinds of archives, although they *do* realise it's wrong and they usually *do* warn about it. They still can't deal with it if it uses variable-sized ZIP64 features (other than trawling through the file looking for the 4-byte magic numbers).

Here's an example of code that breaks because of this: https://github.com/Yhg1s/zipfile-hacks. I tried to convince zipfile to create Zip64 files with extra fields (the variable-sized parts) but unfortunately the *cough* "design" of the zipfile module doesn't allow that -- feel free to ignore the force_zip64 parts of the script.

(I'm using two python installations I had laying around here; I could've used 2.7.12 vs 2.7.13 instead, and the results would be the same.)

# Python 2.7.12 -- so old behaviour
% python create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py2-preamble-w.zip create_small_zip64.py
% python create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py2-preamble-a.zip create_small_zip64.py

# Python 3.6.0+ -- after this change, so new behaviour
% ~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py3-preamble-w.zip create_small_zip64.py
% ~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py3-preamble-a.zip create_small_zip64.py

The old zipfiles are fine:
% zip -T py2-preamble-w.zip
test of py2-preamble-w.zip OK
% zip -T py2-preamble-a.zip
test of py2-preamble-a.zip OK

The new one using 'w' is also fine (as expected):
% zip -T py3-preamble-w.zip
test of py3-preamble-w.zip OK

The new one using 'a' is broken:
% zip -T py3-preamble-a.zip
warning [py3-preamble-a.zip]:  17 extra bytes at beginning or within zipfile
  (attempting to process anyway)
test of py3-preamble-a.zip FAILED

zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)

The 'unzip' tool does work, but it also prints a warning:
% unzip -l py3-preamble-a.zip
Archive:  py3-preamble-a.zip
warning [py3-preamble-a.zip]:  17 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
     4016  2017-05-03 14:23   create_small_zip64.py
---------                     -------
     4016                     1 file

Whether other tools try to compensate for the error depends greatly on the tool; there's quite a few that don't.

For the record, we had two different bits of code that created zipfiles with preambles using mode='a', created by (at least) two different people. I don't think it's unreasonable to assume that if you have a file with existing data you don't want the ZipFile to overwrite, it should be using mode 'a' :P
History
Date User Action Args
2017-05-03 12:45:17twouterssetrecipients: + twouters, gregory.p.smith, vstinner, larry, benjamin.peterson, ned.deily, python-dev, serhiy.storchaka, Peter Ebden
2017-05-03 12:45:17twouterssetmessageid: <1493815517.4.0.378393806225.issue29094@psf.upfronthosting.co.za>
2017-05-03 12:45:17twouterslinkissue29094 messages
2017-05-03 12:45:17twouterscreate