Issue 40301: zipfile module: new feature (two lines of code), useful for test, security and forensics

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84481

classification

Title:	zipfile module: new feature (two lines of code), useful for test, security and forensics
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	alanmcintyre, dhillier, massimosala, serhiy.storchaka, steven.daprano, twouters
Priority:	normal	Keywords:	patch

Created on 2020-04-16 13:59 by massimosala, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
py27_zipfile.patch	massimosala, 2020-04-16 13:59	patch for Python 2.7.17

Messages (13)
msg366597 - (view)	Author: Massimo Sala (massimosala) *	Date: 2020-04-16 13:59
module zipfile Tag "Components": I am not sure "Library (Lib)" is the correct one. If it isn't, please fix. I use python to check zip files against malware. In these files the are binary blobs outside the ZIP archive. The malware payload isn't inside the ZIP file structure. Example: a file "openme.zip" with this content : [blob from offset 0 to offset 5678] [ZIP archive from offset 5679 to end of file] zipfile already handles this, finding the ZIP structure inside the file. My change is just to add a new public property, to expose an internal variable: the file offset of the ZIP structure. I know, I am after the code freeze of Python 2.7.18. But the change is really trivial, see the diff. I hope you can approve this patch for all the Python versions, also for 2.7, to have consistency. For 2.7 this is the last call.
msg366601 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2020-04-16 14:43
This is a new feature and cannot be added to older versions which are in feature-freeze. Adding the feature to (say) Python 2.7.18 would be inconsistent, because it wouldn't exist in 2.7.0 through .17. Likewise for all the other versions before 3.9. Personally, this sounds like a nice feature to have, and your use-case sounds convincing to me.
msg366685 - (view)	Author: Massimo Sala (massimosala) *	Date: 2020-04-17 23:09
Hi Steven Every software "ecosystem" has its guidelines and I am a newbie about python development. Mmh I see your concerns. I agree about your deletions of all py 3 versions before the latest 3.9. About Py 2, I remark these facts: - there are a lot of forensics tools still written for py 2; - python 2.7.18 will be forever the last python 2 and I think is it fine to end-users to have zipfile with this feature both in py 2.7 and py 3.9; - in the code there isn't any new routine to test: the change is just to expose one internal variable. I agree my request is an exception but I think you have to agree this situation is exceptional. IMHO rules must exist to help us and I think this request doesn't carry any burden. I ask you please - to reconsider my request - anyway, to put me in contact with zipfile mainteners, I don't know how to reach them but I want to hear them about this. Many thanks, Massimo
msg366689 - (view)	Author: Daniel Hillier (dhillier) *	Date: 2020-04-18 01:26
Could something similar be achieved by looking for the earliest file header offset? def find_earliest_header_offset(zf): earliest_offset = None for zinfo in zf.infolist(): if earliest_offset is None: earliest_offset = zinfo.header_offset else: earliest_offset = min(zinfo.header_offset, earliest_offset) return earliest_offset You could also adapt this using zinfo.compress_size + len(zinfo.FileHeader()) to see if there were any sections inside the archive which were not referenced from the central directory. Not sure if zip files with arbitrary bytes inside the archive would be valid everywhere, but I think they are using zipfile. You can also have zipped content inside an archive which has a valid fileheader but no reference from the central directory. Those entries are discoverable by implementations which process content serially from the start of the file but not implementations which rely on the central directory.
msg366690 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2020-04-18 02:36
Sorry Massimo, there are no new features being added to 2.7, not even critical security fixes. That's not my decision. https://www.python.org/doc/sunset-python-2/ Python 2 is effectively now a dead project from the point of view of us here at CPython. The very final bug fix release, 2.7.18, is due out any time soon, but it only includes fixes up to 1st of January. You could try submitting your feature request to third-party bundlers of 2.7, such as Red Hat, but I expect they will say no to new features. For what it is worth, I don't agree that this situation is exceptional. Even if 2.7 wasn't obsolete, this is still a new feature. If we made an exception for you, then people using Python 2.7 still couldn't use this feature: `myzipfile.offset` would fail on code using Python 2.7, 2.7.1, 2.7.2, 2.7.3, ... 2.7.17 and only work with 2.7.18. Nobody could use it unless their application required 2.7.18. If you want this in 2.7 for your own personal use, wait for the 2.7.18 final release, then add it into your personal copy. It is open source and you are completely permitted!
msg366695 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-04-18 07:36
I am not sure it would help you. There are legitimate files which contain a payload followed by the ZIP archive (self-extracting archives, programs with embedded ZIP archives). And the malware can make the offset of the ZIP archive be zero. If you want to check whether the file looks like an executable, analyze first few bytes of the file. All executable files should start by one of well recognized signatures, otherwise the OS would not know how to execute them and they would not be malware.
msg366709 - (view)	Author: Massimo Sala (massimosala) *	Date: 2020-04-18 12:58
On Sat, 18 Apr 2020 at 04:37, Steven D'Aprano <report@bugs.python.org> wrote: If we made an exception for you, then people using Python 2.7 still couldn't use this feature: `myzipfile.offset` would fail on code using Python 2.7, 2.7.1, 2.7.2, 2.7.3, ... 2.7.17 and only work with 2.7.18. Nobody could use it unless their application required 2.7.18. Yes, it seems to me obvious it will work only with Python 2.7.18, and I see no problem. If you need new features, you have always to update (to a new MINOR version or, like you said, MAJOR version). I am used to other softwares where some features are backported to older versions and IMHO it is very useful. Sometimes you just need a specific feature and it isn't possible to update to a MAJOR version. You have to consider there are many legacy softwares, also in business, and a version leap means a lot of work and tests. Speaking in general, not only python: if the maintainers backport that specific feature, bingo! you have only to update to the same MAJOR new MINOR version. And this is good for the user base, there isn't "one size fits all". I shot my bullet but I cannot change python.org way of life. Steven many thanks for your answers and patience to explain. BTW yes I will patch python 2.7 sources and compile it... also on legacy, intranet, centos 5 servers we cannot update :-)
msg366711 - (view)	Author: Massimo Sala (massimosala) *	Date: 2020-04-18 13:31
Hi Serhiy Thanks for the suggestion but I don't need to analyse different self-extraction payloads (and I think it is always unreliable, there are too many self-extractors in the wild). I spend two words about my work. I analyze ZIP archives because they are the "incarnation" also of microsoft OOXML and openoffice OASIS ODF documents. I always find these kind of files with not zero offset aren't strictly compliant documents (by their respective file formats specifications). Sometimes there is a self-extrator, sometimes there are pieces of malware blobs (outside the ZIP structure or inside it, into the compressed files), sometimes other errors. For us checking the offset is very effective: we discard "bad" documents at maximum speed before any other checks and it is more reliable than antivirus (checking against specific blobs signatures, everytime changing). With just a single test we have a 100% go/nogo result. Every colleague grasp this check, there aren't hard to read and maintain routines. Massimo On Sat, 18 Apr 2020 at 09:36, Serhiy Storchaka <report@bugs.python.org> wrote: > > Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: > > I am not sure it would help you. There are legitimate files which contain > a payload followed by the ZIP archive (self-extracting archives, programs > with embedded ZIP archives). And the malware can make the offset of the ZIP > archive be zero. > > If you want to check whether the file looks like an executable, analyze > first few bytes of the file. All executable files should start by one of > well recognized signatures, otherwise the OS would not know how to execute > them and they would not be malware. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue40301> > _______________________________________ >
msg366712 - (view)	Author: Massimo Sala (massimosala) *	Date: 2020-04-18 13:36
Hi Daniel Could you please elaborate the advantages of your loop versus my two lines of code? I don't grasp... Thanks, Massimo On Sat, 18 Apr 2020 at 03:26, Daniel Hillier <report@bugs.python.org> wrote: > > Daniel Hillier <daniel.hillier@gmail.com> added the comment: > > Could something similar be achieved by looking for the earliest file > header offset? > > def find_earliest_header_offset(zf): > earliest_offset = None > for zinfo in zf.infolist(): > if earliest_offset is None: > earliest_offset = zinfo.header_offset > else: > earliest_offset = min(zinfo.header_offset, earliest_offset) > return earliest_offset > > > You could also adapt this using > > zinfo.compress_size + len(zinfo.FileHeader()) > > to see if there were any sections inside the archive which were not > referenced from the central directory. Not sure if zip files with arbitrary > bytes inside the archive would be valid everywhere, but I think they are > using zipfile. > > You can also have zipped content inside an archive which has a valid > fileheader but no reference from the central directory. Those entries are > discoverable by implementations which process content serially from the > start of the file but not implementations which rely on the central > directory. > > ---------- > nosy: +dhillier > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue40301> > _______________________________________ >
msg366713 - (view)	Author: Daniel Hillier (dhillier) *	Date: 2020-04-18 13:45
Hi Massimo, Unless I'm missing something about your requirements, the advantage is that it already works in python 2.7 so there is no need to patch Python. Just bundle the above function with your analysis tool and you're good to go. Cheers, Dan On Sat, Apr 18, 2020 at 11:36 PM Massimo Sala <report@bugs.python.org> wrote: > > Massimo Sala <massimo.sala.71@gmail.com> added the comment: > > Hi Daniel > > Could you please elaborate the advantages of your loop versus my two lines > of code? > I don't grasp... > > Thanks, Massimo > > On Sat, 18 Apr 2020 at 03:26, Daniel Hillier <report@bugs.python.org> > wrote: > > > > > Daniel Hillier <daniel.hillier@gmail.com> added the comment: > > > > Could something similar be achieved by looking for the earliest file > > header offset? > > > > def find_earliest_header_offset(zf): > > earliest_offset = None > > for zinfo in zf.infolist(): > > if earliest_offset is None: > > earliest_offset = zinfo.header_offset > > else: > > earliest_offset = min(zinfo.header_offset, earliest_offset) > > return earliest_offset > > > > > > You could also adapt this using > > > > zinfo.compress_size + len(zinfo.FileHeader()) > > > > to see if there were any sections inside the archive which were not > > referenced from the central directory. Not sure if zip files with > arbitrary > > bytes inside the archive would be valid everywhere, but I think they are > > using zipfile. > > > > You can also have zipped content inside an archive which has a valid > > fileheader but no reference from the central directory. Those entries are > > discoverable by implementations which process content serially from the > > start of the file but not implementations which rely on the central > > directory. > > > > ---------- > > nosy: +dhillier > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <https://bugs.python.org/issue40301> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue40301> > _______________________________________ >
msg366714 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-04-18 13:45
Just check the first 4 bytes of the file. In "normal" ZIP archive they are b'PK\3\4' (or b'PK\5\6' if it is empty). It is so reliable as checking the offset, and more efficient. It is even more reliable, because a malware can have zero ZIP archive offset, but it cannot start with b'PK\3\4'.
msg366717 - (view)	Author: Massimo Sala (massimosala) *	Date: 2020-04-18 14:16
Sorry I forgot to mention one specific case. We have valid archives with a starting "blob": digitally signed zip files, their filename extension is ".zip.p7m". I agree your tip can be useful to other readers. Best regards, Sala On Sat, 18 Apr 2020 at 15:45, Serhiy Storchaka <report@bugs.python.org> wrote: > > Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: > > Just check the first 4 bytes of the file. In "normal" ZIP archive they are > b'PK\3\4' (or b'PK\5\6' if it is empty). It is so reliable as checking the > offset, and more efficient. It is even more reliable, because a malware can > have zero ZIP archive offset, but it cannot start with b'PK\3\4'. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue40301> > _______________________________________ >
msg366720 - (view)	Author: Massimo Sala (massimosala) *	Date: 2020-04-18 14:35
I choosed to use the internal variable concat because - if I recollect correctly, it is calculated before successive routines; - I didn't see your solution (!), there is a very nice computed variable in front of my eyes. Mmh 1) Reliability Cannot be sure this always run with malformed files : for zinfo in zf.infolist(): We can try / except but we loose the computation. If concat is already computed (unless completely damaged files), IMHO my solution is better. 2) Performance What are the performance for big files? Are there file seeks due to traversing zf.infolist() ? > Daniel wrote: > the advantage is that it already works in python 2.7 so there is no need to patch Python Yes, indeed. If I am right about the pros of my patch, I stand for it. Many thanks for you attention. On Sat, 18 Apr 2020 at 15:45, Daniel Hillier <report@bugs.python.org> wrote: > > Daniel Hillier <daniel.hillier@gmail.com> added the comment: > > Hi Massimo, > > Unless I'm missing something about your requirements, the advantage is that > it already works in python 2.7 so there is no need to patch Python. Just > bundle the above function with your analysis tool and you're good to go. > > Cheers, > Dan > > On Sat, Apr 18, 2020 at 11:36 PM Massimo Sala <report@bugs.python.org> > wrote: > > > > > Massimo Sala <massimo.sala.71@gmail.com> added the comment: > > > > Hi Daniel > > > > Could you please elaborate the advantages of your loop versus my two > lines > > of code? > > I don't grasp... > > > > Thanks, Massimo > > > > On Sat, 18 Apr 2020 at 03:26, Daniel Hillier <report@bugs.python.org> > > wrote: > > > > > > > > Daniel Hillier <daniel.hillier@gmail.com> added the comment: > > > > > > Could something similar be achieved by looking for the earliest file > > > header offset? > > > > > > def find_earliest_header_offset(zf): > > > earliest_offset = None > > > for zinfo in zf.infolist(): > > > if earliest_offset is None: > > > earliest_offset = zinfo.header_offset > > > else: > > > earliest_offset = min(zinfo.header_offset, earliest_offset) > > > return earliest_offset > > > > > > > > > You could also adapt this using > > > > > > zinfo.compress_size + len(zinfo.FileHeader()) > > > > > > to see if there were any sections inside the archive which were not > > > referenced from the central directory. Not sure if zip files with > > arbitrary > > > bytes inside the archive would be valid everywhere, but I think they > are > > > using zipfile. > > > > > > You can also have zipped content inside an archive which has a valid > > > fileheader but no reference from the central directory. Those entries > are > > > discoverable by implementations which process content serially from the > > > start of the file but not implementations which rely on the central > > > directory. > > > > > > ---------- > > > nosy: +dhillier > > > > > > _______________________________________ > > > Python tracker <report@bugs.python.org> > > > <https://bugs.python.org/issue40301> > > > _______________________________________ > > > > > > > ---------- > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <https://bugs.python.org/issue40301> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue40301> > _______________________________________ >

History
Date	User	Action	Args
2022-04-11 14:59:29	admin	set	github: 84481
2020-04-18 14:35:44	massimosala	set	messages: + msg366720
2020-04-18 14:16:52	massimosala	set	messages: + msg366717
2020-04-18 13:45:40	serhiy.storchaka	set	messages: + msg366714
2020-04-18 13:45:35	dhillier	set	messages: + msg366713
2020-04-18 13:36:10	massimosala	set	messages: + msg366712
2020-04-18 13:31:36	massimosala	set	messages: + msg366711
2020-04-18 12:58:58	massimosala	set	messages: + msg366709
2020-04-18 07:36:23	serhiy.storchaka	set	messages: + msg366695
2020-04-18 02:36:53	steven.daprano	set	messages: + msg366690
2020-04-18 01:26:18	dhillier	set	nosy: + dhillier messages: + msg366689
2020-04-17 23:09:55	massimosala	set	messages: + msg366685
2020-04-16 16:02:57	zach.ware	set	nosy: + twouters, alanmcintyre, serhiy.storchaka
2020-04-16 14:43:14	steven.daprano	set	nosy: + steven.daprano messages: + msg366601 versions: - Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8
2020-04-16 14:15:13	massimosala	set	title: zipfile module: new feature (two lines of code) -> zipfile module: new feature (two lines of code), useful for test, security and forensics
2020-04-16 13:59:27	massimosala	create