classification
Title: zipfile simultaneous open broken and/or needlessly(?) consumes unreasonable number of file descriptors
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Preventing errors of simultaneous access in zipfile
View: 16569
Assigned To: Nosy List: dw, r.david.murray
Priority: normal Keywords:

Created on 2014-11-10 23:58 by dw, last changed 2014-11-11 01:37 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
mymy.zip dw, 2014-11-10 23:58 test case
Messages (3)
msg230987 - (view) Author: David Wilson (dw) * Date: 2014-11-10 23:58
There is some really funky behaviour in the zipfile module, where, depending on whether zipfile.ZipFile() is passed a string filename or a file-like object, one of two things happens:

a) Given a file-like object, zipfile does not (since it cannot) consume excess file descriptors on each call to '.open()', however simultaneous calls to .open() the zip file's members (from the same thread) will produce file-like objects for each member that appear intertwingled in some unfortunate manner:

Traceback (most recent call last):
  File "my.py", line 23, in <module>
    b()
  File "my.py", line 18, in b
    m.readline()
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 689, in readline
    return io.BufferedIOBase.readline(self, limit)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 727, in peek
    chunk = self.read(n)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 763, in read
    data = self._read1(n)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 839, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths



b) Given a string filename, simultaneous use of .open() produces a new file descriptor for each opened member, which does not result in the above error, but triggers an even worse one: file descriptor exhaustion given a sufficiently large zip file.


This tripped me up rather badly last week during consulting work, and I'd like to see both these behaviours fixed somehow. The ticket is more an RFC to see if anyone has thoughts on how this fix should happen; it seems to me a no-brainer that, since the ZIP file format fundamentally always requires a seekable file, that in both the "constructed using file-like object" case, and the "constructed using filename" case, we should somehow reuse the sole file object passed to us to satisfy all reads of compressed member data.

It seems the problems can be fixed in both cases without damaging interface semantics by simply tracking the expected 'current' read offset in each ZipExtFile instance. Prior to any read, we simply call .seek() on the file object prior to performing any .read().

Of course the result would not be thread safe, but at least in the current code, ZipExtFile for a "constructed from a file-like object" edition zipfile is already not thread-safe. With some additional work, we could make the module thread-safe in both cases, however this is not the current semantic and doesn't appear to be guaranteed by the module documentation.

---

Finally as to why you'd want to simultaneously open huge numbers of ZIP members, well, ZIP itself easily supports streamy reads, and ZIP files can be quite large, even larger than RAM. So it should be possible, as I needed last week, to read streamily from a large number of members.

---

The attached my.zip is sufficient to demonstrate both problems.

The attached my.py has function a() to demonstrate the FD leak and b() to demonstrate the interwingly state.
msg230990 - (view) Author: David Wilson (dw) * Date: 2014-11-11 00:04
As a random side-note, this is another case where I really wish Python had a .pread() function. It's uniquely valuable for coordinating positioned reads in a threaded app without synchronization (at user level anyway) or extraneous system calls.
msg230995 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-11-11 01:37
This is a duplicate of issue 16569 and issue 14099.  Since the former links to the latter I'm using that as the superseder.
History
Date User Action Args
2014-11-11 01:37:58r.david.murraysetstatus: open -> closed

superseder: Preventing errors of simultaneous access in zipfile

nosy: + r.david.murray
messages: + msg230995
resolution: duplicate
stage: resolved
2014-11-11 00:04:43dwsetmessages: + msg230990
2014-11-10 23:58:48dwcreate