This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: zipfile module: new feature (two lines of code), useful for test, security and forensics
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: alanmcintyre, dhillier, massimosala, serhiy.storchaka, steven.daprano, twouters
Priority: normal Keywords: patch

Created on 2020-04-16 13:59 by massimosala, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
py27_zipfile.patch massimosala, 2020-04-16 13:59 patch for Python 2.7.17
Messages (13)
msg366597 - (view) Author: Massimo Sala (massimosala) * Date: 2020-04-16 13:59
module zipfile

Tag "Components": I am not sure "Library (Lib)" is the correct one. If it isn't, please fix.

I use python to check zip files against malware.
In these files the are binary blobs outside the ZIP archive.
The malware payload isn't inside the ZIP file structure.
Example: a file "openme.zip" with this content :
[blob from offset 0 to offset 5678] 
[ZIP archive from offset 5679 to end of file]

zipfile already handles this, finding the ZIP structure inside the file.

My change is just to add a new public property, to expose an internal variable: the file offset of the ZIP structure.

I know, I am after the code freeze of Python 2.7.18.
But the change is really trivial, see the diff.
I hope you can approve this patch for all the Python versions, also for 2.7, to have consistency. For 2.7 this is the last call.
msg366601 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-04-16 14:43
This is a new feature and cannot be added to older versions which are in feature-freeze.

Adding the feature to (say) Python 2.7.18 would be inconsistent, because it wouldn't exist in 2.7.0 through .17. Likewise for all the other versions before 3.9.

Personally, this sounds like a nice feature to have, and your use-case sounds convincing to me.
msg366685 - (view) Author: Massimo Sala (massimosala) * Date: 2020-04-17 23:09
Hi Steven

Every software "ecosystem" has its guidelines and I am a newbie about
python development.

Mmh I see your concerns. I agree about your deletions of all py 3 versions
before the latest 3.9.

About Py 2, I remark these facts:
- there are a lot of forensics tools still written for py 2;
- python 2.7.18 will be forever the last python 2 and I think is it fine to
end-users to have zipfile with this feature both in py 2.7 and py 3.9;
- in the code there isn't any new routine to test: the change is just to
expose one internal variable.

I agree my request is an exception but I think you have to agree this
situation is exceptional.
IMHO rules must exist to help us and I think this request doesn't carry any
burden.

I ask you please
- to reconsider my request
- anyway, to put me in contact with zipfile mainteners, I don't know how to
reach them but I want to hear them about this.

Many thanks, Massimo
msg366689 - (view) Author: Daniel Hillier (dhillier) * Date: 2020-04-18 01:26
Could something similar be achieved by looking for the earliest file header offset?

def find_earliest_header_offset(zf):
    earliest_offset = None
    for zinfo in zf.infolist():
        if earliest_offset is None:
            earliest_offset = zinfo.header_offset
        else:
            earliest_offset = min(zinfo.header_offset, earliest_offset)
    return earliest_offset


You could also adapt this using

    zinfo.compress_size + len(zinfo.FileHeader())

to see if there were any sections inside the archive which were not referenced from the central directory. Not sure if zip files with arbitrary bytes inside the archive would be valid everywhere, but I think they are using zipfile.

You can also have zipped content inside an archive which has a valid fileheader but no reference from the central directory. Those entries are discoverable by implementations which process content serially from the start of the file but not implementations which rely on the central directory.
msg366690 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-04-18 02:36
Sorry Massimo, there are no new features being added to 2.7, not even 
critical security fixes. That's not my decision.

https://www.python.org/doc/sunset-python-2/

Python 2 is effectively now a dead project from the point of view of us 
here at CPython. The very final bug fix release, 2.7.18, is due out any 
time soon, but it only includes fixes up to 1st of January.

You could try submitting your feature request to third-party bundlers of 
2.7, such as Red Hat, but I expect they will say no to new features.

For what it is worth, I don't agree that this situation is exceptional. 
Even if 2.7 wasn't obsolete, this is still a new feature. If we made an 
exception for you, then people using Python 2.7 still couldn't use this 
feature: `myzipfile.offset` would fail on code using Python 2.7, 2.7.1, 
2.7.2, 2.7.3, ... 2.7.17 and only work with 2.7.18. Nobody could use it 
unless their application required 2.7.18.

If you want this in 2.7 for your own personal use, wait for the 2.7.18 
final release, then add it into your personal copy. It is open source 
and you are completely permitted!
msg366695 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-18 07:36
I am not sure it would help you. There are legitimate files which contain a payload followed by the ZIP archive (self-extracting archives, programs with embedded ZIP archives). And the malware can make the offset of the ZIP archive be zero.

If you want to check whether the file looks like an executable, analyze first few bytes of the file. All executable files should start by one of well recognized signatures, otherwise the OS would not know how to execute them and they would not be malware.
msg366709 - (view) Author: Massimo Sala (massimosala) * Date: 2020-04-18 12:58
On Sat, 18 Apr 2020 at 04:37, Steven D'Aprano <report@bugs.python.org>
wrote:
    If we made an exception for you, then people using Python 2.7 still
couldn't use this feature:
    `myzipfile.offset` would fail on code using Python 2.7, 2.7.1, 2.7.2,
2.7.3, ... 2.7.17 and only work with 2.7.18.
    Nobody could use it unless their application required 2.7.18.

Yes, it seems to me obvious it will work only with Python 2.7.18, and I see
no problem.
If you need new features, you have always to update (to a new MINOR version
or, like you said, MAJOR version).

I am used to other softwares where some features are backported to older
versions and IMHO it is very useful.
Sometimes you just need a specific feature and it isn't possible to update
to a MAJOR version.
You have to consider there are many legacy softwares, also in business, and
a version leap means a lot of work and tests.

Speaking in general, not only python: if the maintainers backport that
specific feature, bingo! you have only to update to the same MAJOR new
MINOR version. And this is good for the user base, there isn't "one size
fits all".
I shot my bullet but I cannot change python.org way of life.

Steven many thanks for your answers and patience to explain.
BTW yes I will patch python 2.7 sources and compile it... also on legacy,
intranet, centos 5 servers we cannot update :-)
msg366711 - (view) Author: Massimo Sala (massimosala) * Date: 2020-04-18 13:31
Hi Serhiy

Thanks for the suggestion but I don't need to analyse different
self-extraction payloads (and I think it is always unreliable, there are
too many self-extractors in the wild).

I spend two words about my work.

I analyze ZIP archives because they are the "incarnation" also of microsoft
OOXML and openoffice OASIS ODF documents.

I always find these kind of files with not zero offset aren't strictly
compliant documents (by their respective file formats specifications).
Sometimes there is a self-extrator, sometimes there are pieces of malware
blobs (outside the ZIP structure or inside it, into the compressed files),
sometimes other errors.

For us checking the offset is very effective: we discard "bad" documents at
maximum speed before any other checks and it is more reliable than
antivirus (checking against specific blobs signatures, everytime changing).
With just a single test we have a 100% go/nogo result. Every colleague
grasp this check, there aren't hard to read and maintain routines.

Massimo

On Sat, 18 Apr 2020 at 09:36, Serhiy Storchaka <report@bugs.python.org>
wrote:

>
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
>
> I am not sure it would help you. There are legitimate files which contain
> a payload followed by the ZIP archive (self-extracting archives, programs
> with embedded ZIP archives). And the malware can make the offset of the ZIP
> archive be zero.
>
> If you want to check whether the file looks like an executable, analyze
> first few bytes of the file. All executable files should start by one of
> well recognized signatures, otherwise the OS would not know how to execute
> them and they would not be malware.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40301>
> _______________________________________
>
msg366712 - (view) Author: Massimo Sala (massimosala) * Date: 2020-04-18 13:36
Hi Daniel

Could you please elaborate the advantages of your loop versus my two lines
of code?
I don't grasp...

Thanks, Massimo

On Sat, 18 Apr 2020 at 03:26, Daniel Hillier <report@bugs.python.org> wrote:

>
> Daniel Hillier <daniel.hillier@gmail.com> added the comment:
>
> Could something similar be achieved by looking for the earliest file
> header offset?
>
> def find_earliest_header_offset(zf):
>     earliest_offset = None
>     for zinfo in zf.infolist():
>         if earliest_offset is None:
>             earliest_offset = zinfo.header_offset
>         else:
>             earliest_offset = min(zinfo.header_offset, earliest_offset)
>     return earliest_offset
>
>
> You could also adapt this using
>
>     zinfo.compress_size + len(zinfo.FileHeader())
>
> to see if there were any sections inside the archive which were not
> referenced from the central directory. Not sure if zip files with arbitrary
> bytes inside the archive would be valid everywhere, but I think they are
> using zipfile.
>
> You can also have zipped content inside an archive which has a valid
> fileheader but no reference from the central directory. Those entries are
> discoverable by implementations which process content serially from the
> start of the file but not implementations which rely on the central
> directory.
>
> ----------
> nosy: +dhillier
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40301>
> _______________________________________
>
msg366713 - (view) Author: Daniel Hillier (dhillier) * Date: 2020-04-18 13:45
Hi Massimo,

Unless I'm missing something about your requirements, the advantage is that
it already works in python 2.7 so there is no need to patch Python. Just
bundle the above function with your analysis tool and you're good to go.

Cheers,
Dan

On Sat, Apr 18, 2020 at 11:36 PM Massimo Sala <report@bugs.python.org>
wrote:

>
> Massimo Sala <massimo.sala.71@gmail.com> added the comment:
>
> Hi Daniel
>
> Could you please elaborate the advantages of your loop versus my two lines
> of code?
> I don't grasp...
>
> Thanks, Massimo
>
> On Sat, 18 Apr 2020 at 03:26, Daniel Hillier <report@bugs.python.org>
> wrote:
>
> >
> > Daniel Hillier <daniel.hillier@gmail.com> added the comment:
> >
> > Could something similar be achieved by looking for the earliest file
> > header offset?
> >
> > def find_earliest_header_offset(zf):
> >     earliest_offset = None
> >     for zinfo in zf.infolist():
> >         if earliest_offset is None:
> >             earliest_offset = zinfo.header_offset
> >         else:
> >             earliest_offset = min(zinfo.header_offset, earliest_offset)
> >     return earliest_offset
> >
> >
> > You could also adapt this using
> >
> >     zinfo.compress_size + len(zinfo.FileHeader())
> >
> > to see if there were any sections inside the archive which were not
> > referenced from the central directory. Not sure if zip files with
> arbitrary
> > bytes inside the archive would be valid everywhere, but I think they are
> > using zipfile.
> >
> > You can also have zipped content inside an archive which has a valid
> > fileheader but no reference from the central directory. Those entries are
> > discoverable by implementations which process content serially from the
> > start of the file but not implementations which rely on the central
> > directory.
> >
> > ----------
> > nosy: +dhillier
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <https://bugs.python.org/issue40301>
> > _______________________________________
> >
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40301>
> _______________________________________
>
msg366714 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-18 13:45
Just check the first 4 bytes of the file. In "normal" ZIP archive they are b'PK\3\4' (or b'PK\5\6' if it is empty). It is so reliable as checking the offset, and more efficient. It is even more reliable, because a malware can have zero ZIP archive offset, but it cannot start with b'PK\3\4'.
msg366717 - (view) Author: Massimo Sala (massimosala) * Date: 2020-04-18 14:16
Sorry I forgot to mention one specific case.
We have valid archives with a starting "blob": digitally signed zip files,
their filename extension is ".zip.p7m".

I agree your tip can be useful to other readers.
Best regards, Sala

On Sat, 18 Apr 2020 at 15:45, Serhiy Storchaka <report@bugs.python.org>
wrote:

>
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
>
> Just check the first 4 bytes of the file. In "normal" ZIP archive they are
> b'PK\3\4' (or b'PK\5\6' if it is empty). It is so reliable as checking the
> offset, and more efficient. It is even more reliable, because a malware can
> have zero ZIP archive offset, but it cannot start with b'PK\3\4'.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40301>
> _______________________________________
>
msg366720 - (view) Author: Massimo Sala (massimosala) * Date: 2020-04-18 14:35
I choosed to use the internal variable *concat* because
- if I recollect correctly, it is calculated before successive routines;
- I didn't see your solution (!), there is a very nice computed variable in
front of my eyes.

Mmh
1) Reliability
Cannot be sure this always run with malformed files :
        for zinfo in zf.infolist():

We can try / except but we loose the computation.
If *concat* is already computed (unless completely damaged files), IMHO my
solution is better.

2) Performance
What are the performance for big files?
Are there file seeks due to traversing zf.infolist() ?

> Daniel wrote:
> the advantage is that it already works in python 2.7 so there is no need
to patch Python

Yes, indeed.

If I am right about the pros of my patch, I stand for it.

Many thanks for you attention.

On Sat, 18 Apr 2020 at 15:45, Daniel Hillier <report@bugs.python.org> wrote:

>
> Daniel Hillier <daniel.hillier@gmail.com> added the comment:
>
> Hi Massimo,
>
> Unless I'm missing something about your requirements, the advantage is that
> it already works in python 2.7 so there is no need to patch Python. Just
> bundle the above function with your analysis tool and you're good to go.
>
> Cheers,
> Dan
>
> On Sat, Apr 18, 2020 at 11:36 PM Massimo Sala <report@bugs.python.org>
> wrote:
>
> >
> > Massimo Sala <massimo.sala.71@gmail.com> added the comment:
> >
> > Hi Daniel
> >
> > Could you please elaborate the advantages of your loop versus my two
> lines
> > of code?
> > I don't grasp...
> >
> > Thanks, Massimo
> >
> > On Sat, 18 Apr 2020 at 03:26, Daniel Hillier <report@bugs.python.org>
> > wrote:
> >
> > >
> > > Daniel Hillier <daniel.hillier@gmail.com> added the comment:
> > >
> > > Could something similar be achieved by looking for the earliest file
> > > header offset?
> > >
> > > def find_earliest_header_offset(zf):
> > >     earliest_offset = None
> > >     for zinfo in zf.infolist():
> > >         if earliest_offset is None:
> > >             earliest_offset = zinfo.header_offset
> > >         else:
> > >             earliest_offset = min(zinfo.header_offset, earliest_offset)
> > >     return earliest_offset
> > >
> > >
> > > You could also adapt this using
> > >
> > >     zinfo.compress_size + len(zinfo.FileHeader())
> > >
> > > to see if there were any sections inside the archive which were not
> > > referenced from the central directory. Not sure if zip files with
> > arbitrary
> > > bytes inside the archive would be valid everywhere, but I think they
> are
> > > using zipfile.
> > >
> > > You can also have zipped content inside an archive which has a valid
> > > fileheader but no reference from the central directory. Those entries
> are
> > > discoverable by implementations which process content serially from the
> > > start of the file but not implementations which rely on the central
> > > directory.
> > >
> > > ----------
> > > nosy: +dhillier
> > >
> > > _______________________________________
> > > Python tracker <report@bugs.python.org>
> > > <https://bugs.python.org/issue40301>
> > > _______________________________________
> > >
> >
> > ----------
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <https://bugs.python.org/issue40301>
> > _______________________________________
> >
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40301>
> _______________________________________
>
History
Date User Action Args
2022-04-11 14:59:29adminsetgithub: 84481
2020-04-18 14:35:44massimosalasetmessages: + msg366720
2020-04-18 14:16:52massimosalasetmessages: + msg366717
2020-04-18 13:45:40serhiy.storchakasetmessages: + msg366714
2020-04-18 13:45:35dhilliersetmessages: + msg366713
2020-04-18 13:36:10massimosalasetmessages: + msg366712
2020-04-18 13:31:36massimosalasetmessages: + msg366711
2020-04-18 12:58:58massimosalasetmessages: + msg366709
2020-04-18 07:36:23serhiy.storchakasetmessages: + msg366695
2020-04-18 02:36:53steven.dapranosetmessages: + msg366690
2020-04-18 01:26:18dhilliersetnosy: + dhillier
messages: + msg366689
2020-04-17 23:09:55massimosalasetmessages: + msg366685
2020-04-16 16:02:57zach.waresetnosy: + twouters, alanmcintyre, serhiy.storchaka
2020-04-16 14:43:14steven.dapranosetnosy: + steven.daprano

messages: + msg366601
versions: - Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8
2020-04-16 14:15:13massimosalasettitle: zipfile module: new feature (two lines of code) -> zipfile module: new feature (two lines of code), useful for test, security and forensics
2020-04-16 13:59:27massimosalacreate