classification
Title: zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder: is_zipfile false positives
View: 28494
Assigned To: serhiy.storchaka Nosy List: aroussel, bckohan, gregory.p.smith, iritkatriel, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2020-10-20 08:51 by aroussel, last changed 2020-10-28 08:02 by serhiy.storchaka. This issue is now closed.

Messages (13)
msg379105 - (view) Author: Alex Roussel (aroussel) Date: 2020-10-20 08:51
Hello, 

I've come across an issue that seems similar to the false positives problem outlined in this ticket (https://bugs.python.org/issue28494), however this issue relates to a single gzipped json file which is incorrectly identified as a .zip archive because (I suspect) is_zipfile is mistaking bytes in the file's data for the ending bytes that correspond to a .zip archive.

I'm afraid I'm not well versed on the way is_zipfile 'seeks' the bytes of a file to compare its magic number, so I apologise if my description isn't completely accurate. 

Here's my attempt at a summary of the problem:

My .json.gz file includes the correct magic number (1f8b) to identify it as a gzipped file, however when zipfile.is_zipfile is called on the filepath, it returns True.

I'm going to ask if I'm allowed to upload the file directly for you (it's work related), in the meantime I've included a head and tail of the file's hexdump below.

I am running Python 3.6.9 on Ubuntu 18.04.

----------

~/Téléchargements » python3                                                   
Python 3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zipfile
>>> zipfile.is_zipfile('2020-10-18-1602979256-http_get_7549.json.gz')
True

----------

~/Téléchargements » xxd 2020-10-18-1602979256-http_get_7549.json.gz | head                   
00000000: 1f8b 0800 7090 8b5f 0003 ecbd 5993 e338  ....p.._....Y..8
00000010: 922e fa7e 7fc6 bcea 588a 8b28 8ad7 ec3c  ...~....X..(...<
00000020: 883b 292e e2be bc71 91b8 5394 b8f3 d8fd  .;)....q..S.....
00000030: ef17 9422 3223 22ab 2ab3 7ba6 67a6 e774  ..."2#".*.{.g..t
00000040: 5bab 4209 6175 7738 3e77 3880 fff3 6f71  [.B.auw8>w8...oq
00000050: d005 fff6 fffe 9bc1 ea96 451d 2629 6712  ..........E.&)g.
00000060: 853e 4202 830d 3145 7221 6af7 fe11 3ae9  .>B...1Er!j...:.
00000070: 1c0b f9e6 2db1 c0bf 25ea 38a9 1479 f650  ....-...%.8..y.P

----------

~/Téléchargements » xxd 2020-10-18-1602979256-http_get_7549.json.gz | tail
01d492b0: 98a2 3d5c 25e1 c5b8 d9c5 3287 c5d8 3d7c  ..=\%.....2...=|
01d492c0: 968f 3652 6fd4 4a0c 243b 166d 5640 97b5  ..6Ro.J.$;.mV@..
01d492d0: 9308 8376 fe17 1fac 0c90 0fdb b3e3 4e4a  ...v..........NJ
01d492e0: 605c a870 5120 955b 6267 e318 406f e1e2  `\.pQ .[bg..@o..
01d492f0: 2c50 12ec 5eb0 43cc 8d97 4daf 6017 3412  ,P..^.C...M.`.4.
01d49300: 3bdb 40ce 743f 7aa8 6ff9 f30d 796f f784  ;.@.t?z.o...yo..
01d49310: cec2 c45d b012 7e07 c70c dafd e16e fee2  ...]..~......n..
01d49320: c8a6 c01c 627f e004 f9c7 4770 e5e7 6bbf  ....b.....Gp..k.
01d49330: 44d5 97bb ffdf ffe7 ff03 2263 b46f 3d5c  D........."c.o=\
01d49340: cb04                                     ..
msg379246 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-10-21 20:31
Are you able to attach a file with which you see this problem?
Have you tried with newer Python versions?
msg379281 - (view) Author: Alex Roussel (aroussel) Date: 2020-10-22 08:19
Hi Irit, 

Thank you for the response, I'm afraid I'm not allowed to upload the file myself, however the file in question is '2020-10-18-1602979256-http_get_7549.json.gz', which is available at this link https://opendata.rapid7.com/sonar.http/?page=1. It becomes free to download one month after release so I'll make sure to come back on the 18th Nov and upload it once it is made freely available.

In the meantime, I've tested on python 3.8.5 and 3.9 and I get the same issue.
msg379317 - (view) Author: Brian Kohan (bckohan) Date: 2020-10-22 17:33
Hi all,

I'm experiencing the same issue. I took a look at the is_zipfile code - seems like its not checking the start of the file for the magic numbers, but looking deeper in. I presume because the magic numbers at the start are considered unreliable for some reason? Seems like this opens the check up to unlucky random false positives though.

Offending file:

https://www.dropbox.com/s/t2kafn6ek1m2huy/CHPI_Rinex.crx?dl=1
msg379410 - (view) Author: Alex Roussel (aroussel) Date: 2020-10-23 07:36
The impression I got from reading https://bugs.python.org/issue28494 was that this was fixed in python 3.7 ? Or perhaps as you say it's just a matter of stumbling across the rare files that generate false positives.
msg379534 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-24 19:05
for what it's worth: false positives are always going to be possible in any such "magic" check as is_zipfile is.

we don't check the start of the file because zip files are defined by their end of file central directory which contains length information to determine where within the file the zip archive actually starts.

The issue28494 tests are a demonstration of this; It is somewhat common practice to append a zipfile to an executable of various forms for use as application specific data.

If you need more more reliable determination of file type not tied to a specific Python release, you might look at what the various file type sniffing magic libraries do for you, some examples include:
 https://pypi.org/project/filetype/
 https://pypi.org/project/puremagic/
 https://pypi.org/project/python-magic/

I _can_ reproduce this issue with the testdata @bckohan provided.

But I can't promise there is anything to fix here.  Even if we make the test slightly more robust by looking at another byte or two, it is always possible for files to appear to be a bunch of things at once based on small data signatures.

If nothing else we should reinforce in the documentation that is_zipfile is at best a guess.  False means it is not as far as the zipfile module is concerned.  True cannot guarantee that it is.
msg379652 - (view) Author: Alex Roussel (aroussel) Date: 2020-10-26 08:20
OK understood. Thanks for the explanation, I wasn't aware of those and will take a look.
msg379720 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-27 03:07
ZipFile.open() checks the first 4 bytes:

            # Skip the file header:
            fheader = zef_file.read(sizeFileHeader)
            if len(fheader) != sizeFileHeader:
                raise BadZipFile("Truncated file header")
            fheader = struct.unpack(structFileHeader, fheader)
            if fheader[_FH_SIGNATURE] != stringFileHeader:
                raise BadZipFile("Bad magic number for file header")

But is_zipfile() does not. Code could be shared for that.

.gz and .zip files don't start by the same bytes, so this check should reduce the number of false positives.

--

You may have a look at the validate() methods of my old Hachoir project, they check a few bytes to check if a file looks a valid gzip or ZIP archive.

gzip:

https://github.com/vstinner/hachoir/blob/0f56883d7cea7082e784bfbdd2882e0f2dd2f34b/hachoir/parser/archive/gzip_parser.py#L51-L62

zip:

https://github.com/vstinner/hachoir/blob/0f56883d7cea7082e784bfbdd2882e0f2dd2f34b/hachoir/parser/archive/zip.py#L411-L430
msg379741 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-27 07:24
ZipFile.open() is not the code for opening a zip file. :)

That's the code for opening a file embedded within an already constructed mode='r' archive as done the ZipFile.__init__() constructor.  By the time you've gotten to the open() method, you've loaded the entire unbounded in size central directory into memory as ZipInfo objects [constructor] and are checking signature of an individual file header you are attempting to read out of the archive.

Follow the ZipFile() constructor, it calls ZipFile._RealGetContents() which is the true start of parsing the archive.  https://github.com/python/cpython/blob/master/Lib/zipfile.py#L1317

Sure, more and more steps can be done.  But if you want to do that, you may as well just get rid of is_zipfile() entirely - a functions who's point is to be fast and not consume an amount of memory determined by the input data - and have people just call `zipfile.ZipFile(path_in_question, mode='r')` and live with the consequences of attempting to load and parse the whole thing.  If that doesn't raise an exception, it is more likely to be a zip file.  But that could still raise an exception when trying to open each of the files inside, so you'd need to iterate over this and open those and make sure they're valid.

is_zipfile() isn't a verify_zipfile_integrity() routine.  Just a quick best guess.

is_zipfile() cannot be perfect and is not intended to be.  There is always going to be yet another thing it _could_ try.  It isn't worth chasing the impossible goal and making it not be fast.

Just update the is_zipfile() docs.
msg379791 - (view) Author: Brian Kohan (bckohan) Date: 2020-10-27 18:53
I concur with Gregory. It seems that the action here is to just make it apparent in the docs the very real possibility of false positives.

In my experience processing data from the wild, I see a pretty high rate of about 1/1000. I'm sure the probability is a function of the types of files I'm working with. But in any case, is_zipfile can't be made to be sufficient in and of itself for reliably identifying zip files. It still has utility in weeding out true negatives though. In my case I don't ever expect to see a self extracting file or a file compounded into an executable so I use the results of is_zipfile as well as a manual check of the magic bytes at the start. So far so good.
msg379802 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-27 20:47
>  functions who's point is to be fast and not consume an amount of memory determined by the input data

I proposed to read the first 4 bytes of the file. It's a fixed length.
msg379814 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-28 03:17
The first four bytes of the file do not identify a zip file.  A zip file is identified by the end of file central directory.  Which you then must read entries of to determine where the start of the archive may be... often not at position zero.
msg379819 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-10-28 08:02
This is a duplicate of issue28494.

I concur with Gregory. What we can do -- improve the documentation.
History
Date User Action Args
2020-10-28 08:02:36serhiy.storchakasetstatus: open -> closed
superseder: is_zipfile false positives
messages: + msg379819

resolution: duplicate
stage: resolved
2020-10-28 03:17:59gregory.p.smithsetmessages: + msg379814
2020-10-27 20:47:25vstinnersetmessages: + msg379802
2020-10-27 18:53:33bckohansetmessages: + msg379791
2020-10-27 07:24:36gregory.p.smithsetmessages: + msg379741
2020-10-27 06:50:19serhiy.storchakasetassignee: serhiy.storchaka
2020-10-27 06:50:04serhiy.storchakasetnosy: + serhiy.storchaka
2020-10-27 03:07:16vstinnersetmessages: + msg379720
2020-10-26 08:20:24arousselsetmessages: + msg379652
2020-10-24 19:05:16gregory.p.smithsetmessages: + msg379534
2020-10-24 09:26:39iritkatrielsetnosy: + gregory.p.smith, vstinner
2020-10-23 07:36:52arousselsetmessages: + msg379410
2020-10-22 17:33:31bckohansetnosy: + bckohan

messages: + msg379317
versions: + Python 3.7, Python 3.8
2020-10-22 08:19:37arousselsetmessages: + msg379281
2020-10-21 20:31:03iritkatrielsetnosy: + iritkatriel
messages: + msg379246
2020-10-20 08:51:47arousselcreate