classification
Title: zipfile.ZipFile() unable to open zip File with a short extra header
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.5, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: alanmcintyre, eric.araujo, gregory.p.smith, loewis, mark.dickinson, meador.inge, pleed, python-dev, serhiy.storchaka, terry.reedy, ubershmekel, void.sender
Priority: normal Keywords: patch

Created on 2012-03-15 11:15 by pleed, last changed 2014-05-30 06:59 by gregory.p.smith. This issue is now closed.

Files
File name Uploaded Description Edit
bla.apk pleed, 2012-03-15 11:15 APK File (Zip Format) that unzips but crashes when opened with zipfile
fix_zipfile_extra.patch serhiy.storchaka, 2012-03-17 09:52 FIX: Ignore extra padding bytes in zip file review
zipfile_extra_test.patch serhiy.storchaka, 2012-05-13 17:55 Test for zipfile with padding. review
Messages (30)
msg155874 - (view) Author: pleed (pleed) Date: 2012-03-15 11:15
zipfile.ZipFile("bla.apk") crashes, but should not since other tools work perfectly with this file.

Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import zipfile
>>> zipfile.ZipFile("bla.apk")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/zipfile.py", line 710, in __init__
    self._GetContents()
  File "/usr/lib/python2.7/zipfile.py", line 744, in _GetContents
    self._RealGetContents()
  File "/usr/lib/python2.7/zipfile.py", line 803, in _RealGetContents
    x._decodeExtra()
  File "/usr/lib/python2.7/zipfile.py", line 369, in _decodeExtra
    tp, ln = unpack('<HH', extra[:4])
struct.error: unpack requires a string argument of length 4
msg156089 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-03-16 21:53
.apk is primarily used for Android Package files
https://en.wikipedia.org/wiki/APK_%28file_format%29
which are zipped JAR-based archives. By changing .apk to .zip, I can open the file in Win7 explorer, which has zip built in.

A 'crash' is a segfault or equvalent. A traceback is a graceful shutdown and not a crash. I reproduced the traceback on 3.3.0a1. It appears to say that there are not enough bytes.

import struct
struct.unpack('<HH', b'abc')
#produces the same error
struct.error: unpack requires a bytes object of length 4

The 3.x message is clearer in that the second arg is a bytestring, not a char string. It would not hurt it it said how many bytes it did get.

However, the behavior that leads to the message baffles me. The _decodeExtra code looks like this, with some prints that I added.

    def _decodeExtra(self):
        # Try to decode the extra field.
        extra = self.extra
        unpack = struct.unpack
        while extra:
            print(extra)
            tp, ln = unpack('<HH', extra[:4])
            print(tp,ln)
            if tp == 1: 
                pass  # substitute for actual code not called
            extra = extra[ln+4:]
            print(extra)

The output is
b'\xfe\xca\x00\x00'
51966 0
b''
b'\x00'
Traceback (most recent call last): ...

So it looks like extra was properly formatted and properly snipped to 0 bytes, but somehow gained a null byte before being tested again. 1 byte is not 4 bytes and hence the error, but how?
msg156110 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-17 00:13
This is print from *other* _decodeExtra call. Add print('start') at start of _decodeExtra and you will see it.

start
b'\xfe\xca\x00\x00'
51966 0
b''
start
start
b'\x00'
msg156135 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-03-17 04:18
Thanks. There must be multiple extra fields allowed. This output suggests that a) zipfile mis-parses the extra field in some peculiar case (which I doubt) or b) the file either gives a bad field length or third extra field is mal-formed and that other tools catch the error and ignore the bad extra field whereas zipfile does not.
msg156147 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-17 09:52
This is padding. Other implimentations just ignore it and Python must.
msg159584 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-29 06:44
Anyone can look at the patch? This same problem reported in http://bugs.python.org/msg73317 .
msg159602 - (view) Author: Yuval Greenfield (ubershmekel) * Date: 2012-04-29 11:45
If we're modifying zipfile, please consider adding the reviewed patch for ZipFile.remove at http://bugs.python.org/issue6818
msg160538 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 17:58
Serhiy: can you please provide a unit test? The OP's test case is unsuitable because of both size and copyright.

As for the actual issue: the extra data (type 0xcafe) is apparently added by jar tools to make the jar file executable:

http://www.androidadb.com/source/commons-compress-1.1-src/src/main/java/org/apache/commons/compress/archivers/zip/JarMarker.java.html

ISTM that the extra 0 byte is in clear violation of the ZIP specification, which says that there always be type+length headers, followed by length data. So if the length is 0, there ought not to be any data. I don't buy the "padding" argument, since a) the spec is silent on any padding, and b) for alignment reasons, it's not plausible to add any padding, since it is aligned fine without this padding, and unaligned with the padding.

I can sympathize with the desire to accept the zipfile, anyway (i.e. despite it being broken). At the same time, I also think that Python should not let errors pass silently.

So as a way out, I propose that the ZipFile class gains a "strict" attribute, indicating whether "acceptable" violations of the spec are ignored or reported as exceptions.

So -1 on the patch as it stands (but thanks for the analysis).
msg160539 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 18:06
> Serhiy: can you please provide a unit test? The OP's test case is unsuitable because of both size and copyright.

I provided the test several minutes ago.
msg160541 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 18:27
> I can sympathize with the desire to accept the zipfile, anyway (i.e. despite it being broken). At the same time, I also think that Python should not let errors pass silently.

I do not know other implementation of ZIP, which output an error or a
warning on such files. The fact is that such files exist in the wild
world.

> So as a way out, I propose that the ZipFile class gains a "strict" attribute, indicating whether "acceptable" violations of the spec are ignored or reported as exceptions.

It is a not easy task (and unnecessary, I suppose). Now zipfile ignores
many errors (for example, it completely ignores local file headers).
msg160550 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 19:03
>> So as a way out, I propose that the ZipFile class gains a "strict"
>> attribute, indicating whether "acceptable" violations of the spec
>> are ignored or reported as exceptions.
>
> It is a not easy task (and unnecessary, I suppose). Now zipfile
> ignores many errors (for example, it completely ignores local file
> headers).

Why do you consider this difficult? Just add a "strict" flag, make
it false by default, and raise an error if there is any unparsable
extra data.

I'm not asking that you implement a complete validity check for all
aspects of the zip spec - just that there is a mode where it continues
to raise an exception as it currently does (but with a better
exception).
msg160553 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 19:18
> Just add a "strict" flag, make
> it false by default, and raise an error if there is any unparsable
> extra data.

If the module does not actually checks the correctness of all
(including local file headers, which now it ignores), it will be a false
promise. And I don't understand who these checks are needed at all.

Note that the parameter "strict" in htmlparser is considered to be
obsolete.
msg160590 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2012-05-14 01:20
This is definitely *not* a padding issue.
The problem is that one of the records has an extra field of length 1:

$ zipinfo -v bla.apk

...

Central directory entry #3:
---------------------------

  There are an extra 16 bytes preceding this file.

  res/raw/ice.png

  offset of local header from start of archive:     1190 (000004A6h) bytes
  file system or operating system of origin:        MS-DOS, OS/2 or NT FAT
  version of encoding software:                     1.0
  minimum file system compatibility required:       MS-DOS, OS/2 or NT FAT
  minimum software version required to extract:     1.0
  compression method:                               none (stored)
  file security status:                             not encrypted
  extended local header:                            no
  file last modified on (DOS date/time):            2012 Jan 13 15:30:08
  32-bit CRC value (hex):                           f969abab
  compressed size:                                  26250 bytes
  uncompressed size:                                26250 bytes
  length of filename:                               15 characters
  length of extra field:                            1 bytes
  length of file comment:                           0 characters
  disk number on which file begins:                 disk 1
  apparent file type:                               binary
  non-MSDOS external file attributes:               000000 hex
  MS-DOS file attributes (00 hex):                  none

  The central-directory extra field contains:

  There is no file comment.


Our implementation can't deal with that because of the line that Terry 
pointed out:

            tp, ln = unpack('<HH', extra[:4])

As Martin pointed out, the standard says that things must be in 
multiples of 4-bytes.  So the record is non-portable.
msg160591 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2012-05-14 01:52
FWIW, I think it will OK to just ignore extra fields that we can't 
interpret as according to the standard.  The only one we currently
care about is the "Zip64 extended information extra field".  Also,
other programs (including the Info-Zip tools) seem to mostly ignore
these fields.
msg160645 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-14 17:31
> This is definitely *not* a padding issue.

This is definitely a padding issue. All uncompressed files are located
so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
+30+17+3=27536, etc). This is, probably, allows the use of mmap for the
resources.

> As Martin pointed out, the standard says that things must be in 
> multiples of 4-bytes.

More precisely, the extra field must have at least 4-bytes length to fit
a header. The standard is insufficiently defined in terms of what would
happen if the rest of the field is less than 4 bytes (this is hidden
behind by ellipsis).

>   So the record is non-portable.

De jure the record is non-portable. De facto the record is portable
(many other tools supports it). But even if it does not portable, we are
dealing with the expansion of the zip format, which is very easy support
for reading.
msg160655 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2012-05-14 18:50
On Mon, May 14, 2012 at 12:31 PM, Serhiy Storchaka
<report@bugs.python.org> wrote:

> Serhiy Storchaka <storchaka@gmail.com> added the comment:
>
>> This is definitely *not* a padding issue.
>
> This is definitely a padding issue. All uncompressed files are located
> so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
> +30+17+3=27536, etc). This is, probably, allows the use of mmap for the
> resources.

So?  Someone may be using the extra fields to pad things, but for the purpose
of this issue that is completely irrelevant.  We only care about the
proper structure
of the file.  Besides, without clear reference to source code or a
specification any
hypothesis of padding is hearsay.

Did you look at the decoding I sent?  The extra length field length is clearly
reported as a size of one and the contents of the extra field are set to '\x00'.
The extra field of size one is the actual problem, not padding.

>> As Martin pointed out, the standard says that things must be in
>> multiples of 4-bytes.
>
> More precisely, the extra field must have at least 4-bytes length to fit
> a header. The standard is insufficiently defined in terms of what would
> happen if the rest of the field is less than 4 bytes (this is hidden
> behind by ellipsis).

How is it insufficiently defined at all?  It says [1]:

          In order to allow different programs and different types
          of information to be stored in the 'extra' field in .ZIP
          files, the following structure should be used for all
          programs storing data in this field:

          header1+data1 + header2+data2 . . .

          Each header should consist of:

            Header ID - 2 bytes
            Data Size - 2 bytes

          Note: all fields stored in Intel low-byte/high-byte order.

The ellipsis is just a standard convention for indicating a repeating
pattern.  Extra fields which are not multiples of four bytes are not
properly formed.

>>   So the record is non-portable.
>
> De jure the record is non-portable. De facto the record is portable
> (many other tools supports it). But even if it does not portable, we are
> dealing with the expansion of the zip format, which is very easy support
> for reading.

Like I said before, I am all for dropping extra fields we can not
interpret.  However,
let us be clear that with respect to the standard we are implementing
that zip files
constructed like this are ill-formed.

[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT
msg160663 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-14 20:36
> This is definitely a padding issue. All uncompressed files are located
> so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
> +30+17+3=27536, etc). This is, probably, allows the use of mmap for the
> resources.

That can't possibly be the reason. mmap requires 4k (4096) alignment (on 
x86; more than that on SPARC).

Perhaps there is a reason for 4-alignment - but it can't be mmap-
msg160664 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-14 20:38
> Like I said before, I am all for dropping extra fields we can not
> interpret.  However,
> let us be clear that with respect to the standard we are implementing
> that zip files > constructed like this are ill-formed.

I would still like to see the zipfile module to be able to detect 
ill-formed zipfiles - certainly not by default. This can be useful
to check zipfiles that somebody produces. As the Python check wouldn't
be complete, one would certainly need to apply other checks as well.
msg160673 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-14 22:22
> That can't possibly be the reason. mmap requires 4k (4096) alignment (on 
> x86; more than that on SPARC).

This may be the reason to mmap the entire file and then read aligned
binary data.
msg160684 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2012-05-15 05:24
> I would still like to see the zipfile module to be able to detect 
> ill-formed zipfiles - certainly not by default.

Agreed.  We already do produce some informational messages via the 
documented 'ZipFile.debug' attribute.  Just checking that flag and
producing a message to stdout like we do in other cases is one option.
Another, as you suggested already, is to provide a strict flag
that indicates an exception should be thrown upon encountering an 
ill-formed file.

I like the idea of reusing what we already have.  Although, when it 
fails, it doesn't fail loudly.
msg163743 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-24 07:26
Any chance to commit the patch before final feature freeze?
msg163749 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-24 07:41
Strictly speaking, the feature freeze has passed already, so you'll need the time machine.

However, this is not the *final* feature freeze - there will be a 3.4 release of Python also.
msg163813 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-06-24 17:48
This issue is currently marked as a bugfix for current releases also. Since the proposal seems to be to introduce a new flag, should it be re-classified as enhancement (for 3.4)?
msg163831 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-24 19:53
I think this is a bugfix. I also don't think that "strict" attribute
makes sense if we can't guarantee strong validity (see issue15114).
msg163835 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-24 20:01
I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

For reasons discussed, I also disagree with the proposed patch (file24902). If this helps to advance this issue, I can formally reject it if desired.
msg163855 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-06-24 20:42
Given that the overall trend in the stdlib seems to be to be more lenient in parsing, as with htmlparser, and that the not-really-strict 'strict' parameter for that is being removed, (#15114) I do not understand the objection to accepting one more malformation in zip files and the desire to add a not-really-strict 'strict' parameter here. Perhaps this philosophical issue should be discussed on pydev.
msg163857 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-24 20:48
> I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

I agree. The module is behaving correctly.
msg187177 - (view) Author: Void (void.sender) Date: 2013-04-17 16:14
> FWIW, I think it will OK to just ignore extra fields that we can't 
> interpret as according to the standard.  The only one we currently
> care about is the "Zip64 extended information extra field".  Also,
> other programs (including the Info-Zip tools) seem to mostly ignore
> these fields.

Yes, please.

> The ellipsis is just a standard convention for indicating a repeating
> pattern.  Extra fields which are not multiples of four bytes are not
> properly formed.

. . .

[;pause;]

Totally agree.  But at the very least it should check the size and raise a proper exception (e.g. BadZipFile).

FWIW, the code is already avoiding proper bounds checking using the built-in behavior of slicing -- 

    tp, ln = unpack('<HH', extra[:4])
    ...
    extra = extra[ln+4:]

-- which would otherwise throw exceptions if slice didn't innately and silently fail (due to an index being out-of-bounds).


> I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

You realize that you are calling user-controlled data buggy, right?
msg219377 - (view) Author: Roundup Robot (python-dev) Date: 2014-05-30 06:43
New changeset 6dd5e9556a60 by Gregory P. Smith in branch '2.7':
Fix issue #14315: The zipfile module now ignores extra fields in the central
http://hg.python.org/cpython/rev/6dd5e9556a60

New changeset 33843896ce4e by Gregory P. Smith in branch '3.4':
Fix issue #14315: The zipfile module now ignores extra fields in the central
http://hg.python.org/cpython/rev/33843896ce4e

New changeset 89be1419472c by Gregory P. Smith in branch 'default':
Fix issue #14315: The zipfile module now ignores extra fields in the central
http://hg.python.org/cpython/rev/89be1419472c
msg219378 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2014-05-30 06:59
This was never an enhancement.  zipfile was failing to properly deal with real world data that other zip file tools on the planet were perfectly happy to deal with.  That's a bug.  Fixed.

 Practicality beats purity.
 Be lenient in what you accept.

The zipfile module is not meant to be a zip "standard" validation tool.

The other discussions in this bug about adding actual features such as a "strict" mode flag could be done but should really go in feature requests of their own.  The zipfile module is... not a wonderful body of code (understatement).  Example: It is still quite possible to get non zipfile.BadZipFile exceptions such as struct.error based on various arrangements of input data.
History
Date User Action Args
2014-05-30 06:59:38gregory.p.smithsetstatus: open -> closed

resolution: fixed

assignee: gregory.p.smith
stage: test needed -> resolved
title: zipfile.ZipFile() unable to open zip File -> zipfile.ZipFile() unable to open zip File with a short extra header
nosy: + gregory.p.smith
versions: + Python 2.7, Python 3.4, Python 3.5, - Python 3.3
messages: + msg219378
type: enhancement -> behavior
2014-05-30 06:43:02python-devsetnosy: + python-dev
messages: + msg219377
2013-04-17 16:14:53void.sendersetnosy: + void.sender
messages: + msg187177
2012-06-24 20:48:24serhiy.storchakasetmessages: + msg163857
2012-06-24 20:42:07terry.reedysetmessages: + msg163855
versions: - Python 2.7, Python 3.2
2012-06-24 20:01:06loewissettype: behavior -> enhancement
messages: + msg163835
2012-06-24 19:53:15serhiy.storchakasetmessages: + msg163831
2012-06-24 17:48:35terry.reedysetmessages: + msg163813
2012-06-24 07:41:51loewissetmessages: + msg163749
2012-06-24 07:26:27serhiy.storchakasetmessages: + msg163743
2012-05-15 05:24:05meador.ingesetmessages: + msg160684
2012-05-14 22:22:14serhiy.storchakasetmessages: + msg160673
2012-05-14 20:38:42loewissetmessages: + msg160664
2012-05-14 20:36:35loewissetmessages: + msg160663
2012-05-14 18:50:40meador.ingesetmessages: + msg160655
2012-05-14 17:31:14serhiy.storchakasetmessages: + msg160645
2012-05-14 01:52:21meador.ingesetmessages: + msg160591
2012-05-14 01:20:40meador.ingesetmessages: + msg160590
2012-05-13 19:18:00serhiy.storchakasetmessages: + msg160553
2012-05-13 19:03:38loewissetmessages: + msg160550
2012-05-13 18:27:06serhiy.storchakasetmessages: + msg160541
2012-05-13 18:06:52serhiy.storchakasetmessages: + msg160539
2012-05-13 17:58:53loewissetnosy: + loewis
messages: + msg160538
2012-05-13 17:55:27serhiy.storchakasetfiles: + zipfile_extra_test.patch
2012-05-13 15:30:04eric.araujosetstage: test needed
2012-04-29 11:45:32ubershmekelsetnosy: + ubershmekel
messages: + msg159602
2012-04-29 06:44:54serhiy.storchakasetmessages: + msg159584
2012-03-17 09:52:43serhiy.storchakasetfiles: + fix_zipfile_extra.patch
keywords: + patch
messages: + msg156147
2012-03-17 04:18:35terry.reedysetmessages: + msg156135
2012-03-17 00:13:47serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg156110
2012-03-16 23:35:27eric.araujosetnosy: + eric.araujo
2012-03-16 21:53:49terry.reedysetversions: + Python 3.2, Python 3.3
nosy: + mark.dickinson, alanmcintyre, terry.reedy, meador.inge

messages: + msg156089

components: + Library (Lib), - None
type: crash -> behavior
2012-03-15 11:15:45pleedcreate