zipfile.ZipFile() unable to open zip File with a short extra header #58523

pleed · 2012-03-15T11:15:46Z

BPO	14315
Nosy	@loewis, @terryjreedy, @gpshead, @mdickinson, @merwok, @meadori, @serhiy-storchaka
Files	bla.apk: APK File (Zip Format) that unzips but crashes when opened with zipfile fix_zipfile_extra.patch: FIX: Ignore extra padding bytes in zip file zipfile_extra_test.patch: Test for zipfile with padding.

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gpshead'
closed_at = <Date 2014-05-30.06:59:38.064>
created_at = <Date 2012-03-15.11:15:45.559>
labels = ['type-bug', 'library']
title = 'zipfile.ZipFile() unable to open zip File with a short extra header'
updated_at = <Date 2018-08-06.06:43:30.242>
user = 'https://bugs.python.org/pleed'

bugs.python.org fields:

activity = <Date 2018-08-06.06:43:30.242>
actor = 'serhiy.storchaka'
assignee = 'gregory.p.smith'
closed = True
closed_date = <Date 2014-05-30.06:59:38.064>
closer = 'gregory.p.smith'
components = ['Library (Lib)']
creation = <Date 2012-03-15.11:15:45.559>
creator = 'pleed'
dependencies = []
files = ['24860', '24902', '25564']
hgrepos = []
issue_num = 14315
keywords = ['patch']
message_count = 31.0
messages = ['155874', '156089', '156110', '156135', '156147', '159584', '159602', '160538', '160539', '160541', '160550', '160553', '160590', '160591', '160645', '160655', '160663', '160664', '160673', '160684', '163743', '163749', '163813', '163831', '163835', '163855', '163857', '187177', '219377', '219378', '323179']
nosy_count = 12.0
nosy_names = ['loewis', 'terry.reedy', 'gregory.p.smith', 'mark.dickinson', 'alanmcintyre', 'eric.araujo', 'ubershmekel', 'meador.inge', 'python-dev', 'serhiy.storchaka', 'pleed', 'void.sender']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue14315'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

pleed · 2012-03-15T11:15:45Z

zipfile.ZipFile("bla.apk") crashes, but should not since other tools work perfectly with this file.

Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import zipfile
>>> zipfile.ZipFile("bla.apk")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/zipfile.py", line 710, in __init__
    self._GetContents()
  File "/usr/lib/python2.7/zipfile.py", line 744, in _GetContents
    self._RealGetContents()
  File "/usr/lib/python2.7/zipfile.py", line 803, in _RealGetContents
    x._decodeExtra()
  File "/usr/lib/python2.7/zipfile.py", line 369, in _decodeExtra
    tp, ln = unpack('<HH', extra[:4])
struct.error: unpack requires a string argument of length 4

terryjreedy · 2012-03-16T21:53:49Z

.apk is primarily used for Android Package files
https://en.wikipedia.org/wiki/APK_%28file_format%29
which are zipped JAR-based archives. By changing .apk to .zip, I can open the file in Win7 explorer, which has zip built in.

A 'crash' is a segfault or equvalent. A traceback is a graceful shutdown and not a crash. I reproduced the traceback on 3.3.0a1. It appears to say that there are not enough bytes.

import struct
struct.unpack('<HH', b'abc')
#produces the same error
struct.error: unpack requires a bytes object of length 4

The 3.x message is clearer in that the second arg is a bytestring, not a char string. It would not hurt it it said how many bytes it did get.

However, the behavior that leads to the message baffles me. The _decodeExtra code looks like this, with some prints that I added.

    def _decodeExtra(self):
        # Try to decode the extra field.
        extra = self.extra
        unpack = struct.unpack
        while extra:
            print(extra)
            tp, ln = unpack('<HH', extra[:4])
            print(tp,ln)
            if tp == 1: 
                pass  # substitute for actual code not called
            extra = extra[ln+4:]
            print(extra)

The output is
b'\xfe\xca\x00\x00'
51966 0
b''
b'\x00'
Traceback (most recent call last): ...

So it looks like extra was properly formatted and properly snipped to 0 bytes, but somehow gained a null byte before being tested again. 1 byte is not 4 bytes and hence the error, but how?

serhiy-storchaka · 2012-03-17T00:13:48Z

This is print from *other* _decodeExtra call. Add print('start') at start of _decodeExtra and you will see it.

start
b'\xfe\xca\x00\x00'
51966 0
b''
start
start
b'\x00'

terryjreedy · 2012-03-17T04:18:35Z

Thanks. There must be multiple extra fields allowed. This output suggests that a) zipfile mis-parses the extra field in some peculiar case (which I doubt) or b) the file either gives a bad field length or third extra field is mal-formed and that other tools catch the error and ignore the bad extra field whereas zipfile does not.

serhiy-storchaka · 2012-03-17T09:52:43Z

This is padding. Other implimentations just ignore it and Python must.

serhiy-storchaka · 2012-04-29T06:44:54Z

Anyone can look at the patch? This same problem reported in http://bugs.python.org/msg73317 .

ubershmekel · 2012-04-29T11:45:33Z

If we're modifying zipfile, please consider adding the reviewed patch for ZipFile.remove at http://bugs.python.org/issue6818

loewis · 2012-05-13T17:58:53Z

Serhiy: can you please provide a unit test? The OP's test case is unsuitable because of both size and copyright.

As for the actual issue: the extra data (type 0xcafe) is apparently added by jar tools to make the jar file executable:

http://www.androidadb.com/source/commons-compress-1.1-src/src/main/java/org/apache/commons/compress/archivers/zip/JarMarker.java.html

ISTM that the extra 0 byte is in clear violation of the ZIP specification, which says that there always be type+length headers, followed by length data. So if the length is 0, there ought not to be any data. I don't buy the "padding" argument, since a) the spec is silent on any padding, and b) for alignment reasons, it's not plausible to add any padding, since it is aligned fine without this padding, and unaligned with the padding.

I can sympathize with the desire to accept the zipfile, anyway (i.e. despite it being broken). At the same time, I also think that Python should not let errors pass silently.

So as a way out, I propose that the ZipFile class gains a "strict" attribute, indicating whether "acceptable" violations of the spec are ignored or reported as exceptions.

So -1 on the patch as it stands (but thanks for the analysis).

serhiy-storchaka · 2012-05-13T18:06:53Z

Serhiy: can you please provide a unit test? The OP's test case is unsuitable because of both size and copyright.

I provided the test several minutes ago.

serhiy-storchaka · 2012-05-13T18:27:06Z

I can sympathize with the desire to accept the zipfile, anyway (i.e. despite it being broken). At the same time, I also think that Python should not let errors pass silently.

I do not know other implementation of ZIP, which output an error or a
warning on such files. The fact is that such files exist in the wild
world.

So as a way out, I propose that the ZipFile class gains a "strict" attribute, indicating whether "acceptable" violations of the spec are ignored or reported as exceptions.

It is a not easy task (and unnecessary, I suppose). Now zipfile ignores
many errors (for example, it completely ignores local file headers).

loewis · 2012-05-13T19:03:38Z

> So as a way out, I propose that the ZipFile class gains a "strict"
> attribute, indicating whether "acceptable" violations of the spec
> are ignored or reported as exceptions.

It is a not easy task (and unnecessary, I suppose). Now zipfile
ignores many errors (for example, it completely ignores local file
headers).

Why do you consider this difficult? Just add a "strict" flag, make
it false by default, and raise an error if there is any unparsable
extra data.

I'm not asking that you implement a complete validity check for all
aspects of the zip spec - just that there is a mode where it continues
to raise an exception as it currently does (but with a better
exception).

serhiy-storchaka · 2012-05-13T19:18:00Z

Just add a "strict" flag, make
it false by default, and raise an error if there is any unparsable
extra data.

If the module does not actually checks the correctness of all
(including local file headers, which now it ignores), it will be a false
promise. And I don't understand who these checks are needed at all.

Note that the parameter "strict" in htmlparser is considered to be
obsolete.

meadori · 2012-05-14T01:20:40Z

This is definitely *not* a padding issue.
The problem is that one of the records has an extra field of length 1:

$ zipinfo -v bla.apk

...

Central directory entry #3:
---------------------------

There are an extra 16 bytes preceding this file.

res/raw/ice.png

offset of local header from start of archive: 1190 (000004A6h) bytes
file system or operating system of origin: MS-DOS, OS/2 or NT FAT
version of encoding software: 1.0
minimum file system compatibility required: MS-DOS, OS/2 or NT FAT
minimum software version required to extract: 1.0
compression method: none (stored)
file security status: not encrypted
extended local header: no
file last modified on (DOS date/time): 2012 Jan 13 15:30:08
32-bit CRC value (hex): f969abab
compressed size: 26250 bytes
uncompressed size: 26250 bytes
length of filename: 15 characters
length of extra field: 1 bytes
length of file comment: 0 characters
disk number on which file begins: disk 1
apparent file type: binary
non-MSDOS external file attributes: 000000 hex
MS-DOS file attributes (00 hex): none

The central-directory extra field contains:

There is no file comment.

Our implementation can't deal with that because of the line that Terry
pointed out:

        tp, ln = unpack('<HH', extra[:4])

As Martin pointed out, the standard says that things must be in
multiples of 4-bytes. So the record is non-portable.

meadori · 2012-05-14T01:52:22Z

FWIW, I think it will OK to just ignore extra fields that we can't
interpret as according to the standard. The only one we currently
care about is the "Zip64 extended information extra field". Also,
other programs (including the Info-Zip tools) seem to mostly ignore
these fields.

serhiy-storchaka · 2012-05-14T17:31:14Z

This is definitely *not* a padding issue.

This is definitely a padding issue. All uncompressed files are located
so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
+30+17+3=27536, etc). This is, probably, allows the use of mmap for the
resources.

As Martin pointed out, the standard says that things must be in
multiples of 4-bytes.

More precisely, the extra field must have at least 4-bytes length to fit
a header. The standard is insufficiently defined in terms of what would
happen if the rest of the field is less than 4 bytes (this is hidden
behind by ellipsis).

So the record is non-portable.

De jure the record is non-portable. De facto the record is portable
(many other tools supports it). But even if it does not portable, we are
dealing with the expansion of the zip format, which is very easy support
for reading.

meadori · 2012-05-14T18:50:41Z

On Mon, May 14, 2012 at 12:31 PM, Serhiy Storchaka
<report@bugs.python.org> wrote:

Serhiy Storchaka <storchaka@gmail.com> added the comment:

> This is definitely *not* a padding issue.

This is definitely a padding issue. All uncompressed files are located
so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
+30+17+3=27536, etc). This is, probably, allows the use of mmap for the
resources.

So? Someone may be using the extra fields to pad things, but for the purpose
of this issue that is completely irrelevant. We only care about the
proper structure
of the file. Besides, without clear reference to source code or a
specification any
hypothesis of padding is hearsay.

Did you look at the decoding I sent? The extra length field length is clearly
reported as a size of one and the contents of the extra field are set to '\x00'.
The extra field of size one is the actual problem, not padding.

> As Martin pointed out, the standard says that things must be in
> multiples of 4-bytes.

More precisely, the extra field must have at least 4-bytes length to fit
a header. The standard is insufficiently defined in terms of what would
happen if the rest of the field is less than 4 bytes (this is hidden
behind by ellipsis).

How is it insufficiently defined at all? It says [1]:

      In order to allow different programs and different types
      of information to be stored in the 'extra' field in .ZIP
      files, the following structure should be used for all
      programs storing data in this field:

      header1+data1 + header2+data2 . . .

      Each header should consist of:

        Header ID - 2 bytes
        Data Size - 2 bytes

      Note: all fields stored in Intel low-byte/high-byte order.

The ellipsis is just a standard convention for indicating a repeating
pattern. Extra fields which are not multiples of four bytes are not
properly formed.

> So the record is non-portable.

De jure the record is non-portable. De facto the record is portable
(many other tools supports it). But even if it does not portable, we are
dealing with the expansion of the zip format, which is very easy support
for reading.

Like I said before, I am all for dropping extra fields we can not
interpret. However,
let us be clear that with respect to the standard we are implementing
that zip files
constructed like this are ill-formed.

[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT

loewis · 2012-05-14T20:36:36Z

This is definitely a padding issue. All uncompressed files are located
so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
+30+17+3=27536, etc). This is, probably, allows the use of mmap for the
resources.

That can't possibly be the reason. mmap requires 4k (4096) alignment (on
x86; more than that on SPARC).

Perhaps there is a reason for 4-alignment - but it can't be mmap-

loewis · 2012-05-14T20:38:43Z

Like I said before, I am all for dropping extra fields we can not
interpret. However,
let us be clear that with respect to the standard we are implementing
that zip files > constructed like this are ill-formed.

I would still like to see the zipfile module to be able to detect
ill-formed zipfiles - certainly not by default. This can be useful
to check zipfiles that somebody produces. As the Python check wouldn't
be complete, one would certainly need to apply other checks as well.

serhiy-storchaka · 2012-05-14T22:22:15Z

That can't possibly be the reason. mmap requires 4k (4096) alignment (on
x86; more than that on SPARC).

This may be the reason to mmap the entire file and then read aligned
binary data.

meadori · 2012-05-15T05:24:05Z

I would still like to see the zipfile module to be able to detect
ill-formed zipfiles - certainly not by default.

Agreed. We already do produce some informational messages via the
documented 'ZipFile.debug' attribute. Just checking that flag and
producing a message to stdout like we do in other cases is one option.
Another, as you suggested already, is to provide a strict flag
that indicates an exception should be thrown upon encountering an
ill-formed file.

I like the idea of reusing what we already have. Although, when it
fails, it doesn't fail loudly.

serhiy-storchaka · 2012-06-24T07:26:27Z

Any chance to commit the patch before final feature freeze?

loewis · 2012-06-24T07:41:51Z

Strictly speaking, the feature freeze has passed already, so you'll need the time machine.

However, this is not the *final* feature freeze - there will be a 3.4 release of Python also.

terryjreedy · 2012-06-24T17:48:35Z

This issue is currently marked as a bugfix for current releases also. Since the proposal seems to be to introduce a new flag, should it be re-classified as enhancement (for 3.4)?

serhiy-storchaka · 2012-06-24T19:53:15Z

I think this is a bugfix. I also don't think that "strict" attribute
makes sense if we can't guarantee strong validity (see bpo-15114).

loewis · 2012-06-24T20:01:06Z

I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

For reasons discussed, I also disagree with the proposed patch (file24902). If this helps to advance this issue, I can formally reject it if desired.

terryjreedy · 2012-06-24T20:42:08Z

Given that the overall trend in the stdlib seems to be to be more lenient in parsing, as with htmlparser, and that the not-really-strict 'strict' parameter for that is being removed, (bpo-15114) I do not understand the objection to accepting one more malformation in zip files and the desire to add a not-really-strict 'strict' parameter here. Perhaps this philosophical issue should be discussed on pydev.

serhiy-storchaka · 2012-06-24T20:48:24Z

I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

I agree. The module is behaving correctly.

voidsender · 2013-04-17T16:14:53Z

FWIW, I think it will OK to just ignore extra fields that we can't
interpret as according to the standard. The only one we currently
care about is the "Zip64 extended information extra field". Also,
other programs (including the Info-Zip tools) seem to mostly ignore
these fields.

Yes, please.

The ellipsis is just a standard convention for indicating a repeating
pattern. Extra fields which are not multiples of four bytes are not
properly formed.

. . .

[;pause;]

Totally agree. But at the very least it should check the size and raise a proper exception (e.g. BadZipFile).

FWIW, the code is already avoiding proper bounds checking using the built-in behavior of slicing --

tp, ln = unpack('<HH', extra[:4])
...
extra = extra[ln+4:]

-- which would otherwise throw exceptions if slice didn't innately and silently fail (due to an index being out-of-bounds).

I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

You realize that you are calling user-controlled data buggy, right?

python-dev · 2014-05-30T06:43:01Z

New changeset 6dd5e9556a60 by Gregory P. Smith in branch '2.7':
Fix issue bpo-14315: The zipfile module now ignores extra fields in the central
http://hg.python.org/cpython/rev/6dd5e9556a60

New changeset 33843896ce4e by Gregory P. Smith in branch '3.4':
Fix issue bpo-14315: The zipfile module now ignores extra fields in the central
http://hg.python.org/cpython/rev/33843896ce4e

New changeset 89be1419472c by Gregory P. Smith in branch 'default':
Fix issue bpo-14315: The zipfile module now ignores extra fields in the central
http://hg.python.org/cpython/rev/89be1419472c

gpshead · 2014-05-30T06:59:36Z

This was never an enhancement. zipfile was failing to properly deal with real world data that other zip file tools on the planet were perfectly happy to deal with. That's a bug. Fixed.

Practicality beats purity.
Be lenient in what you accept.

The zipfile module is not meant to be a zip "standard" validation tool.

The other discussions in this bug about adding actual features such as a "strict" mode flag could be done but should really go in feature requests of their own. The zipfile module is... not a wonderful body of code (understatement). Example: It is still quite possible to get non zipfile.BadZipFile exceptions such as struct.error based on various arrangements of input data.

serhiy-storchaka · 2018-08-06T06:43:30Z

Just for information: the padding in APK files is added by the zipalign utility [1].

[1] https://developer.android.com/studio/command-line/zipalign

pleed mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Mar 15, 2012

terryjreedy added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 16, 2012

loewis mannequin removed the type-bug An unexpected behavior, bug, or error label Jun 24, 2012

loewis mannequin added the type-feature A feature request or enhancement label Jun 24, 2012

gpshead closed this as completed May 30, 2014

gpshead changed the title ~~zipfile.ZipFile() unable to open zip File~~ zipfile.ZipFile() unable to open zip File with a short extra header May 30, 2014

gpshead self-assigned this May 30, 2014

gpshead added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels May 30, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zipfile.ZipFile() unable to open zip File with a short extra header #58523

zipfile.ZipFile() unable to open zip File with a short extra header #58523

pleed mannequin commented Mar 15, 2012

pleed mannequin commented Mar 15, 2012

terryjreedy commented Mar 16, 2012

serhiy-storchaka commented Mar 17, 2012

terryjreedy commented Mar 17, 2012

serhiy-storchaka commented Mar 17, 2012

serhiy-storchaka commented Apr 29, 2012

ubershmekel mannequin commented Apr 29, 2012

loewis mannequin commented May 13, 2012

serhiy-storchaka commented May 13, 2012

serhiy-storchaka commented May 13, 2012

loewis mannequin commented May 13, 2012

serhiy-storchaka commented May 13, 2012

meadori commented May 14, 2012

meadori commented May 14, 2012

serhiy-storchaka commented May 14, 2012

meadori commented May 14, 2012

loewis mannequin commented May 14, 2012

loewis mannequin commented May 14, 2012

serhiy-storchaka commented May 14, 2012

meadori commented May 15, 2012

serhiy-storchaka commented Jun 24, 2012

loewis mannequin commented Jun 24, 2012

terryjreedy commented Jun 24, 2012

serhiy-storchaka commented Jun 24, 2012

loewis mannequin commented Jun 24, 2012

terryjreedy commented Jun 24, 2012

serhiy-storchaka commented Jun 24, 2012

voidsender mannequin commented Apr 17, 2013

python-dev mannequin commented May 30, 2014

gpshead commented May 30, 2014

serhiy-storchaka commented Aug 6, 2018

zipfile.ZipFile() unable to open zip File with a short extra header #58523

zipfile.ZipFile() unable to open zip File with a short extra header #58523

Comments

pleed mannequin commented Mar 15, 2012

pleed mannequin commented Mar 15, 2012

terryjreedy commented Mar 16, 2012

serhiy-storchaka commented Mar 17, 2012

terryjreedy commented Mar 17, 2012

serhiy-storchaka commented Mar 17, 2012

serhiy-storchaka commented Apr 29, 2012

ubershmekel mannequin commented Apr 29, 2012

loewis mannequin commented May 13, 2012

serhiy-storchaka commented May 13, 2012

serhiy-storchaka commented May 13, 2012

loewis mannequin commented May 13, 2012

serhiy-storchaka commented May 13, 2012

meadori commented May 14, 2012

meadori commented May 14, 2012

serhiy-storchaka commented May 14, 2012

meadori commented May 14, 2012

loewis mannequin commented May 14, 2012

loewis mannequin commented May 14, 2012

serhiy-storchaka commented May 14, 2012

meadori commented May 15, 2012

serhiy-storchaka commented Jun 24, 2012

loewis mannequin commented Jun 24, 2012

terryjreedy commented Jun 24, 2012

serhiy-storchaka commented Jun 24, 2012

loewis mannequin commented Jun 24, 2012

terryjreedy commented Jun 24, 2012

serhiy-storchaka commented Jun 24, 2012

voidsender mannequin commented Apr 17, 2013

python-dev mannequin commented May 30, 2014

gpshead commented May 30, 2014

serhiy-storchaka commented Aug 6, 2018