classification
Title: zipfile cannot handle files larger than 2GB (inside archive)
Type: compile error Stage:
Components: Library (Lib) Versions: Python 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Kevin Ar18, gregory.p.smith, loewis
Priority: normal Keywords:

Created on 2007-08-29 21:34 by Kevin Ar18, last changed 2015-01-23 23:16 by gregory.p.smith. This issue is now closed.

Messages (8)
msg55444 - (view) Author: (Kevin Ar18) Date: 2007-08-29 21:34
Summary:
If you have a zip file that contains a file inside of it greater than
2GB, then the zipfile module is unable to read that file.

Steps to Reproduce:
1. Create a zip file several GB in size with a file inside of it that is
over 2GB in size.
2. Attempt to read the large file inside the zip file.  Here's some
sample code:
import zipfile
import re

dataObj = zipfile.ZipFile("zip.zip","r")

for i in dataObj.namelist():
   if(i[-1] == "/"):
      print "dir"
   else:
      fileName = re.split(r".*/",i,0)[1]
      fileData = dataObj.read(i)


Result:
Python returns the following error:
File "...\zipfile.py", line 491, in read bytes =
self.fp.read(zinfo.compress_size) 
OverflowError: long it too large to convert to int

Expected Result:
It should copy the data into the variable fileData...

I'll try to post more info in a follow-up.
msg55461 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2007-08-30 04:48
i'll take care of it.  any more info in the interim will be appreciated.
msg55482 - (view) Author: (Kevin Ar18) Date: 2007-08-30 14:52
Here's another bug report that talks about a 2GB file limit:
http://bugs.python.org/issue1189216
The diff offered there does not solve the problem; actually it's
possible that the diff may not have anything to do with fixing the
problem (though I'm not certain), but may just be a readability change.

I tried to program a solution based on other stuff I saw/read on the
internet, but ran into different problems....

I took the line:
bytes = self.fp.read(zinfo.compress_size)
and made it read a little bit at a time and add the result to bytes as
it went along.  This was really slow (as it had to add the result to the
larger and larger bytes string each time); I tried with a list, but I
couldn't find how to join the list back together into a string when done
(similar to the javascript join() method).  However, even with the list
method, I ran into an odd "memory error," as it looped through into the
higher numbers, that I have no idea why it was happening, so I gave up
at that point.

Also, I have no idea if this one line in the zipfile module is the only
problem or if there are others that will pop up once you get that part
fixed.
msg55485 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-08-30 15:33
I now see the problem. What you want to do cannot possibly work.

You are trying to create a string object that is larger than 2GB; this
is not possible on a 32-bit system (which I assume you are using). No
matter how you modify the read() function, it would always return a
string that is so large it cannot fit into the address space.

This will be fixed in Python 2.6, which has a separate .open method,
allowing to read the individual files in the zipfile as streams.
msg55486 - (view) Author: (Kevin Ar18) Date: 2007-08-30 15:43
Just some thoughts....
In posting about this problem elsewhere, it has been argued that you
shouldn't be copying that much stuff into memory anyways (though there
are possible cases for a need for that).
However, the question is what should the zipfile module do.  At the very
least it should account for this 2GB limitation and say it can't do it.
 However, how it should interact with the programmer is another
question.  In one of the replies, I am told that strings have a 2GB
limitation, which means the zipfile module can't be used in it's current
form, even if fixed.  Does this mean that the zipfile module needs to
add some additional methods for incrementally getting data and writing
data?  Or does it mean that programmers should be the ones to program an
incremental system when they need it... Or?
msg55487 - (view) Author: (Kevin Ar18) Date: 2007-08-30 15:45
So, just add an error to the module (so it won't crash)?

BTW, is Python 2.6 ready for use?  I could use that feature now. :)
msg55488 - (view) Author: (Kevin Ar18) Date: 2007-08-30 15:46
Maybe a message that says that strings on 32-bit CPUs cannot handle more
than 2GB of data; use the stream instead?
msg60248 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-01-19 23:39
The issue here was that reading more data than will fit into an in
memory string fails.  While the zipfile module could detect this in some
cases, it is not really worth such a runtime check.  This is just a fact
of python and of sane programming, if you're reading data from a file
like object you should never use unbounded reads without having checked
your input for sanity first.
History
Date User Action Args
2015-01-23 23:16:44gregory.p.smithsetassignee: gregory.p.smith ->
2008-01-19 23:39:56gregory.p.smithsetstatus: open -> closed
resolution: wont fix
messages: + msg60248
2007-09-17 06:33:42jafosetpriority: normal
2007-08-30 15:46:31Kevin Ar18setmessages: + msg55488
2007-08-30 15:45:24Kevin Ar18setmessages: + msg55487
2007-08-30 15:43:53Kevin Ar18setmessages: + msg55486
2007-08-30 15:33:29loewissetnosy: + loewis
messages: + msg55485
versions: + Python 2.5, - Python 2.6
2007-08-30 14:52:28Kevin Ar18setmessages: + msg55482
2007-08-30 04:48:04gregory.p.smithsetassignee: gregory.p.smith
messages: + msg55461
nosy: + gregory.p.smith
2007-08-29 21:34:55Kevin Ar18create