Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile.ZipFile() unable to open zip File with a short extra header #58523

Closed
pleed mannequin opened this issue Mar 15, 2012 · 31 comments
Closed

zipfile.ZipFile() unable to open zip File with a short extra header #58523

pleed mannequin opened this issue Mar 15, 2012 · 31 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@pleed
Copy link
Mannequin

pleed mannequin commented Mar 15, 2012

BPO 14315
Nosy @loewis, @terryjreedy, @gpshead, @mdickinson, @merwok, @meadori, @serhiy-storchaka
Files
  • bla.apk: APK File (Zip Format) that unzips but crashes when opened with zipfile
  • fix_zipfile_extra.patch: FIX: Ignore extra padding bytes in zip file
  • zipfile_extra_test.patch: Test for zipfile with padding.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2014-05-30.06:59:38.064>
    created_at = <Date 2012-03-15.11:15:45.559>
    labels = ['type-bug', 'library']
    title = 'zipfile.ZipFile() unable to open zip File with a short extra header'
    updated_at = <Date 2018-08-06.06:43:30.242>
    user = 'https://bugs.python.org/pleed'

    bugs.python.org fields:

    activity = <Date 2018-08-06.06:43:30.242>
    actor = 'serhiy.storchaka'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2014-05-30.06:59:38.064>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2012-03-15.11:15:45.559>
    creator = 'pleed'
    dependencies = []
    files = ['24860', '24902', '25564']
    hgrepos = []
    issue_num = 14315
    keywords = ['patch']
    message_count = 31.0
    messages = ['155874', '156089', '156110', '156135', '156147', '159584', '159602', '160538', '160539', '160541', '160550', '160553', '160590', '160591', '160645', '160655', '160663', '160664', '160673', '160684', '163743', '163749', '163813', '163831', '163835', '163855', '163857', '187177', '219377', '219378', '323179']
    nosy_count = 12.0
    nosy_names = ['loewis', 'terry.reedy', 'gregory.p.smith', 'mark.dickinson', 'alanmcintyre', 'eric.araujo', 'ubershmekel', 'meador.inge', 'python-dev', 'serhiy.storchaka', 'pleed', 'void.sender']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14315'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @pleed
    Copy link
    Mannequin Author

    pleed mannequin commented Mar 15, 2012

    zipfile.ZipFile("bla.apk") crashes, but should not since other tools work perfectly with this file.

    Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
    [GCC 4.6.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import zipfile
    >>> zipfile.ZipFile("bla.apk")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/zipfile.py", line 710, in __init__
        self._GetContents()
      File "/usr/lib/python2.7/zipfile.py", line 744, in _GetContents
        self._RealGetContents()
      File "/usr/lib/python2.7/zipfile.py", line 803, in _RealGetContents
        x._decodeExtra()
      File "/usr/lib/python2.7/zipfile.py", line 369, in _decodeExtra
        tp, ln = unpack('<HH', extra[:4])
    struct.error: unpack requires a string argument of length 4

    @pleed pleed mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Mar 15, 2012
    @terryjreedy
    Copy link
    Member

    .apk is primarily used for Android Package files
    https://en.wikipedia.org/wiki/APK_%28file_format%29
    which are zipped JAR-based archives. By changing .apk to .zip, I can open the file in Win7 explorer, which has zip built in.

    A 'crash' is a segfault or equvalent. A traceback is a graceful shutdown and not a crash. I reproduced the traceback on 3.3.0a1. It appears to say that there are not enough bytes.

    import struct
    struct.unpack('<HH', b'abc')
    #produces the same error
    struct.error: unpack requires a bytes object of length 4

    The 3.x message is clearer in that the second arg is a bytestring, not a char string. It would not hurt it it said how many bytes it did get.

    However, the behavior that leads to the message baffles me. The _decodeExtra code looks like this, with some prints that I added.

        def _decodeExtra(self):
            # Try to decode the extra field.
            extra = self.extra
            unpack = struct.unpack
            while extra:
                print(extra)
                tp, ln = unpack('<HH', extra[:4])
                print(tp,ln)
                if tp == 1: 
                    pass  # substitute for actual code not called
                extra = extra[ln+4:]
                print(extra)
    The output is
    b'\xfe\xca\x00\x00'
    51966 0
    b''
    b'\x00'
    Traceback (most recent call last): ...

    So it looks like extra was properly formatted and properly snipped to 0 bytes, but somehow gained a null byte before being tested again. 1 byte is not 4 bytes and hence the error, but how?

    @terryjreedy terryjreedy added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 16, 2012
    @serhiy-storchaka
    Copy link
    Member

    This is print from *other* _decodeExtra call. Add print('start') at start of _decodeExtra and you will see it.

    start
    b'\xfe\xca\x00\x00'
    51966 0
    b''
    start
    start
    b'\x00'

    @terryjreedy
    Copy link
    Member

    Thanks. There must be multiple extra fields allowed. This output suggests that a) zipfile mis-parses the extra field in some peculiar case (which I doubt) or b) the file either gives a bad field length or third extra field is mal-formed and that other tools catch the error and ignore the bad extra field whereas zipfile does not.

    @serhiy-storchaka
    Copy link
    Member

    This is padding. Other implimentations just ignore it and Python must.

    @serhiy-storchaka
    Copy link
    Member

    Anyone can look at the patch? This same problem reported in http://bugs.python.org/msg73317 .

    @ubershmekel
    Copy link
    Mannequin

    ubershmekel mannequin commented Apr 29, 2012

    If we're modifying zipfile, please consider adding the reviewed patch for ZipFile.remove at http://bugs.python.org/issue6818

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 13, 2012

    Serhiy: can you please provide a unit test? The OP's test case is unsuitable because of both size and copyright.

    As for the actual issue: the extra data (type 0xcafe) is apparently added by jar tools to make the jar file executable:

    http://www.androidadb.com/source/commons-compress-1.1-src/src/main/java/org/apache/commons/compress/archivers/zip/JarMarker.java.html

    ISTM that the extra 0 byte is in clear violation of the ZIP specification, which says that there always be type+length headers, followed by length data. So if the length is 0, there ought not to be any data. I don't buy the "padding" argument, since a) the spec is silent on any padding, and b) for alignment reasons, it's not plausible to add any padding, since it is aligned fine without this padding, and unaligned with the padding.

    I can sympathize with the desire to accept the zipfile, anyway (i.e. despite it being broken). At the same time, I also think that Python should not let errors pass silently.

    So as a way out, I propose that the ZipFile class gains a "strict" attribute, indicating whether "acceptable" violations of the spec are ignored or reported as exceptions.

    So -1 on the patch as it stands (but thanks for the analysis).

    @serhiy-storchaka
    Copy link
    Member

    Serhiy: can you please provide a unit test? The OP's test case is unsuitable because of both size and copyright.

    I provided the test several minutes ago.

    @serhiy-storchaka
    Copy link
    Member

    I can sympathize with the desire to accept the zipfile, anyway (i.e. despite it being broken). At the same time, I also think that Python should not let errors pass silently.

    I do not know other implementation of ZIP, which output an error or a
    warning on such files. The fact is that such files exist in the wild
    world.

    So as a way out, I propose that the ZipFile class gains a "strict" attribute, indicating whether "acceptable" violations of the spec are ignored or reported as exceptions.

    It is a not easy task (and unnecessary, I suppose). Now zipfile ignores
    many errors (for example, it completely ignores local file headers).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 13, 2012

    > So as a way out, I propose that the ZipFile class gains a "strict"
    > attribute, indicating whether "acceptable" violations of the spec
    > are ignored or reported as exceptions.

    It is a not easy task (and unnecessary, I suppose). Now zipfile
    ignores many errors (for example, it completely ignores local file
    headers).

    Why do you consider this difficult? Just add a "strict" flag, make
    it false by default, and raise an error if there is any unparsable
    extra data.

    I'm not asking that you implement a complete validity check for all
    aspects of the zip spec - just that there is a mode where it continues
    to raise an exception as it currently does (but with a better
    exception).

    @serhiy-storchaka
    Copy link
    Member

    Just add a "strict" flag, make
    it false by default, and raise an error if there is any unparsable
    extra data.

    If the module does not actually checks the correctness of all
    (including local file headers, which now it ignores), it will be a false
    promise. And I don't understand who these checks are needed at all.

    Note that the parameter "strict" in htmlparser is considered to be
    obsolete.

    @meadori
    Copy link
    Member

    meadori commented May 14, 2012

    This is definitely *not* a padding issue.
    The problem is that one of the records has an extra field of length 1:

    $ zipinfo -v bla.apk

    ...

    Central directory entry #3:
    ---------------------------

    There are an extra 16 bytes preceding this file.

    res/raw/ice.png

    offset of local header from start of archive: 1190 (000004A6h) bytes
    file system or operating system of origin: MS-DOS, OS/2 or NT FAT
    version of encoding software: 1.0
    minimum file system compatibility required: MS-DOS, OS/2 or NT FAT
    minimum software version required to extract: 1.0
    compression method: none (stored)
    file security status: not encrypted
    extended local header: no
    file last modified on (DOS date/time): 2012 Jan 13 15:30:08
    32-bit CRC value (hex): f969abab
    compressed size: 26250 bytes
    uncompressed size: 26250 bytes
    length of filename: 15 characters
    length of extra field: 1 bytes
    length of file comment: 0 characters
    disk number on which file begins: disk 1
    apparent file type: binary
    non-MSDOS external file attributes: 000000 hex
    MS-DOS file attributes (00 hex): none

    The central-directory extra field contains:

    There is no file comment.

    Our implementation can't deal with that because of the line that Terry
    pointed out:

            tp, ln = unpack('<HH', extra[:4])
    

    As Martin pointed out, the standard says that things must be in
    multiples of 4-bytes. So the record is non-portable.

    @meadori
    Copy link
    Member

    meadori commented May 14, 2012

    FWIW, I think it will OK to just ignore extra fields that we can't
    interpret as according to the standard. The only one we currently
    care about is the "Zip64 extended information extra field". Also,
    other programs (including the Info-Zip tools) seem to mostly ignore
    these fields.

    @serhiy-storchaka
    Copy link
    Member

    This is definitely *not* a padding issue.

    This is definitely a padding issue. All uncompressed files are located
    so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
    +30+17+3=27536, etc). This is, probably, allows the use of mmap for the
    resources.

    As Martin pointed out, the standard says that things must be in
    multiples of 4-bytes.

    More precisely, the extra field must have at least 4-bytes length to fit
    a header. The standard is insufficiently defined in terms of what would
    happen if the rest of the field is less than 4 bytes (this is hidden
    behind by ellipsis).

    So the record is non-portable.

    De jure the record is non-portable. De facto the record is portable
    (many other tools supports it). But even if it does not portable, we are
    dealing with the expansion of the zip format, which is very easy support
    for reading.

    @meadori
    Copy link
    Member

    meadori commented May 14, 2012

    On Mon, May 14, 2012 at 12:31 PM, Serhiy Storchaka
    <report@bugs.python.org> wrote:

    Serhiy Storchaka <storchaka@gmail.com> added the comment:

    > This is definitely *not* a padding issue.

    This is definitely a padding issue. All uncompressed files are located
    so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
    +30+17+3=27536, etc). This is, probably, allows the use of mmap for the
    resources.

    So? Someone may be using the extra fields to pad things, but for the purpose
    of this issue that is completely irrelevant. We only care about the
    proper structure
    of the file. Besides, without clear reference to source code or a
    specification any
    hypothesis of padding is hearsay.

    Did you look at the decoding I sent? The extra length field length is clearly
    reported as a size of one and the contents of the extra field are set to '\x00'.
    The extra field of size one is the actual problem, not padding.

    > As Martin pointed out, the standard says that things must be in
    > multiples of 4-bytes.

    More precisely, the extra field must have at least 4-bytes length to fit
    a header. The standard is insufficiently defined in terms of what would
    happen if the rest of the field is less than 4 bytes (this is hidden
    behind by ellipsis).

    How is it insufficiently defined at all? It says [1]:

          In order to allow different programs and different types
          of information to be stored in the 'extra' field in .ZIP
          files, the following structure should be used for all
          programs storing data in this field:
    
          header1+data1 + header2+data2 . . .
    
          Each header should consist of:
    
            Header ID - 2 bytes
            Data Size - 2 bytes
    
          Note: all fields stored in Intel low-byte/high-byte order.
    

    The ellipsis is just a standard convention for indicating a repeating
    pattern. Extra fields which are not multiples of four bytes are not
    properly formed.

    >   So the record is non-portable.

    De jure the record is non-portable. De facto the record is portable
    (many other tools supports it). But even if it does not portable, we are
    dealing with the expansion of the zip format, which is very easy support
    for reading.

    Like I said before, I am all for dropping extra fields we can not
    interpret. However,
    let us be clear that with respect to the standard we are implementing
    that zip files
    constructed like this are ill-formed.

    [1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 14, 2012

    This is definitely a padding issue. All uncompressed files are located
    so that the data starts with a 4-byte boundary (1190+30+15+1=1236, 27486
    +30+17+3=27536, etc). This is, probably, allows the use of mmap for the
    resources.

    That can't possibly be the reason. mmap requires 4k (4096) alignment (on
    x86; more than that on SPARC).

    Perhaps there is a reason for 4-alignment - but it can't be mmap-

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 14, 2012

    Like I said before, I am all for dropping extra fields we can not
    interpret. However,
    let us be clear that with respect to the standard we are implementing
    that zip files > constructed like this are ill-formed.

    I would still like to see the zipfile module to be able to detect
    ill-formed zipfiles - certainly not by default. This can be useful
    to check zipfiles that somebody produces. As the Python check wouldn't
    be complete, one would certainly need to apply other checks as well.

    @serhiy-storchaka
    Copy link
    Member

    That can't possibly be the reason. mmap requires 4k (4096) alignment (on
    x86; more than that on SPARC).

    This may be the reason to mmap the entire file and then read aligned
    binary data.

    @meadori
    Copy link
    Member

    meadori commented May 15, 2012

    I would still like to see the zipfile module to be able to detect
    ill-formed zipfiles - certainly not by default.

    Agreed. We already do produce some informational messages via the
    documented 'ZipFile.debug' attribute. Just checking that flag and
    producing a message to stdout like we do in other cases is one option.
    Another, as you suggested already, is to provide a strict flag
    that indicates an exception should be thrown upon encountering an
    ill-formed file.

    I like the idea of reusing what we already have. Although, when it
    fails, it doesn't fail loudly.

    @serhiy-storchaka
    Copy link
    Member

    Any chance to commit the patch before final feature freeze?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 24, 2012

    Strictly speaking, the feature freeze has passed already, so you'll need the time machine.

    However, this is not the *final* feature freeze - there will be a 3.4 release of Python also.

    @terryjreedy
    Copy link
    Member

    This issue is currently marked as a bugfix for current releases also. Since the proposal seems to be to introduce a new flag, should it be re-classified as enhancement (for 3.4)?

    @serhiy-storchaka
    Copy link
    Member

    I think this is a bugfix. I also don't think that "strict" attribute
    makes sense if we can't guarantee strong validity (see bpo-15114).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 24, 2012

    I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

    For reasons discussed, I also disagree with the proposed patch (file24902). If this helps to advance this issue, I can formally reject it if desired.

    @loewis loewis mannequin removed the type-bug An unexpected behavior, bug, or error label Jun 24, 2012
    @loewis loewis mannequin added the type-feature A feature request or enhancement label Jun 24, 2012
    @terryjreedy
    Copy link
    Member

    Given that the overall trend in the stdlib seems to be to be more lenient in parsing, as with htmlparser, and that the not-really-strict 'strict' parameter for that is being removed, (bpo-15114) I do not understand the objection to accepting one more malformation in zip files and the desire to add a not-really-strict 'strict' parameter here. Perhaps this philosophical issue should be discussed on pydev.

    @serhiy-storchaka
    Copy link
    Member

    I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

    I agree. The module is behaving correctly.

    @voidsender
    Copy link
    Mannequin

    voidsender mannequin commented Apr 17, 2013

    FWIW, I think it will OK to just ignore extra fields that we can't
    interpret as according to the standard. The only one we currently
    care about is the "Zip64 extended information extra field". Also,
    other programs (including the Info-Zip tools) seem to mostly ignore
    these fields.

    Yes, please.

    The ellipsis is just a standard convention for indicating a repeating
    pattern. Extra fields which are not multiples of four bytes are not
    properly formed.

    . . .

    [;pause;]

    Totally agree. But at the very least it should check the size and raise a proper exception (e.g. BadZipFile).

    FWIW, the code is already avoiding proper bounds checking using the built-in behavior of slicing --

    tp, ln = unpack('<HH', extra[:4])
    ...
    extra = extra[ln+4:]
    

    -- which would otherwise throw exceptions if slice didn't innately and silently fail (due to an index being out-of-bounds).

    I still don't see the bug; the module is behaving correctly - it is the zipfile that is buggy. Supporting this specific kind of buggy zipfiles is a new feature.

    You realize that you are calling user-controlled data buggy, right?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 30, 2014

    New changeset 6dd5e9556a60 by Gregory P. Smith in branch '2.7':
    Fix issue bpo-14315: The zipfile module now ignores extra fields in the central
    http://hg.python.org/cpython/rev/6dd5e9556a60

    New changeset 33843896ce4e by Gregory P. Smith in branch '3.4':
    Fix issue bpo-14315: The zipfile module now ignores extra fields in the central
    http://hg.python.org/cpython/rev/33843896ce4e

    New changeset 89be1419472c by Gregory P. Smith in branch 'default':
    Fix issue bpo-14315: The zipfile module now ignores extra fields in the central
    http://hg.python.org/cpython/rev/89be1419472c

    @gpshead
    Copy link
    Member

    gpshead commented May 30, 2014

    This was never an enhancement. zipfile was failing to properly deal with real world data that other zip file tools on the planet were perfectly happy to deal with. That's a bug. Fixed.

    Practicality beats purity.
    Be lenient in what you accept.

    The zipfile module is not meant to be a zip "standard" validation tool.

    The other discussions in this bug about adding actual features such as a "strict" mode flag could be done but should really go in feature requests of their own. The zipfile module is... not a wonderful body of code (understatement). Example: It is still quite possible to get non zipfile.BadZipFile exceptions such as struct.error based on various arrangements of input data.

    @gpshead gpshead closed this as completed May 30, 2014
    @gpshead gpshead changed the title zipfile.ZipFile() unable to open zip File zipfile.ZipFile() unable to open zip File with a short extra header May 30, 2014
    @gpshead gpshead self-assigned this May 30, 2014
    @gpshead gpshead added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels May 30, 2014
    @serhiy-storchaka
    Copy link
    Member

    Just for information: the padding in APK files is added by the zipalign utility [1].

    [1] https://developer.android.com/studio/command-line/zipalign

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants