Message 230987 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dw
Recipients	dw
Date	2014-11-10.23:58:47
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1415663928.91.0.265332652866.issue22842@psf.upfronthosting.co.za>
In-reply-to

Content
There is some really funky behaviour in the zipfile module, where, depending on whether zipfile.ZipFile() is passed a string filename or a file-like object, one of two things happens: a) Given a file-like object, zipfile does not (since it cannot) consume excess file descriptors on each call to '.open()', however simultaneous calls to .open() the zip file's members (from the same thread) will produce file-like objects for each member that appear intertwingled in some unfortunate manner: Traceback (most recent call last): File "my.py", line 23, in <module> b() File "my.py", line 18, in b m.readline() File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 689, in readline return io.BufferedIOBase.readline(self, limit) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 727, in peek chunk = self.read(n) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 763, in read data = self._read1(n) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 839, in _read1 data = self._decompressor.decompress(data, n) zlib.error: Error -3 while decompressing data: invalid stored block lengths b) Given a string filename, simultaneous use of .open() produces a new file descriptor for each opened member, which does not result in the above error, but triggers an even worse one: file descriptor exhaustion given a sufficiently large zip file. This tripped me up rather badly last week during consulting work, and I'd like to see both these behaviours fixed somehow. The ticket is more an RFC to see if anyone has thoughts on how this fix should happen; it seems to me a no-brainer that, since the ZIP file format fundamentally always requires a seekable file, that in both the "constructed using file-like object" case, and the "constructed using filename" case, we should somehow reuse the sole file object passed to us to satisfy all reads of compressed member data. It seems the problems can be fixed in both cases without damaging interface semantics by simply tracking the expected 'current' read offset in each ZipExtFile instance. Prior to any read, we simply call .seek() on the file object prior to performing any .read(). Of course the result would not be thread safe, but at least in the current code, ZipExtFile for a "constructed from a file-like object" edition zipfile is already not thread-safe. With some additional work, we could make the module thread-safe in both cases, however this is not the current semantic and doesn't appear to be guaranteed by the module documentation. --- Finally as to why you'd want to simultaneously open huge numbers of ZIP members, well, ZIP itself easily supports streamy reads, and ZIP files can be quite large, even larger than RAM. So it should be possible, as I needed last week, to read streamily from a large number of members. --- The attached my.zip is sufficient to demonstrate both problems. The attached my.py has function a() to demonstrate the FD leak and b() to demonstrate the interwingly state.

There is some really funky behaviour in the zipfile module, where, depending on whether zipfile.ZipFile() is passed a string filename or a file-like object, one of two things happens:

a) Given a file-like object, zipfile does not (since it cannot) consume excess file descriptors on each call to '.open()', however simultaneous calls to .open() the zip file's members (from the same thread) will produce file-like objects for each member that appear intertwingled in some unfortunate manner:

Traceback (most recent call last):
  File "my.py", line 23, in <module>
    b()
  File "my.py", line 18, in b
    m.readline()
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 689, in readline
    return io.BufferedIOBase.readline(self, limit)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 727, in peek
    chunk = self.read(n)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 763, in read
    data = self._read1(n)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", line 839, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths



b) Given a string filename, simultaneous use of .open() produces a new file descriptor for each opened member, which does not result in the above error, but triggers an even worse one: file descriptor exhaustion given a sufficiently large zip file.


This tripped me up rather badly last week during consulting work, and I'd like to see both these behaviours fixed somehow. The ticket is more an RFC to see if anyone has thoughts on how this fix should happen; it seems to me a no-brainer that, since the ZIP file format fundamentally always requires a seekable file, that in both the "constructed using file-like object" case, and the "constructed using filename" case, we should somehow reuse the sole file object passed to us to satisfy all reads of compressed member data.

It seems the problems can be fixed in both cases without damaging interface semantics by simply tracking the expected 'current' read offset in each ZipExtFile instance. Prior to any read, we simply call .seek() on the file object prior to performing any .read().

Of course the result would not be thread safe, but at least in the current code, ZipExtFile for a "constructed from a file-like object" edition zipfile is already not thread-safe. With some additional work, we could make the module thread-safe in both cases, however this is not the current semantic and doesn't appear to be guaranteed by the module documentation.

---

Finally as to why you'd want to simultaneously open huge numbers of ZIP members, well, ZIP itself easily supports streamy reads, and ZIP files can be quite large, even larger than RAM. So it should be possible, as I needed last week, to read streamily from a large number of members.

---

The attached my.zip is sufficient to demonstrate both problems.

The attached my.py has function a() to demonstrate the FD leak and b() to demonstrate the interwingly state.

History
Date	User	Action	Args
2014-11-10 23:58:49	dw	set	recipients: + dw
2014-11-10 23:58:48	dw	set	messageid: <1415663928.91.0.265332652866.issue22842@psf.upfronthosting.co.za>
2014-11-10 23:58:48	dw	link	issue22842 messages
2014-11-10 23:58:47	dw	create