Issue 3978: ZipFileExt.read() can be incredibly slow; patch included

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48228

classification

Title:	ZipFileExt.read() can be incredibly slow; patch included
Type:	performance	Stage:
Components:	Extension Modules	Versions:	Python 3.2

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	andreb, lightstruk, pitrou
Priority:	normal	Keywords:	patch

Created on 2008-09-26 19:23 by lightstruk, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
zipfile_read_perf.patch	lightstruk, 2008-09-26 19:23	zipfile.py extraction performance improvement
zeroes.zip	lightstruk, 2008-09-26 19:24	demonstration ZIP, explodes from 100 KiB to 100 MiB
zipperf.patch	pitrou, 2008-12-13 19:32

Messages (8)
msg73880 - (view)	Author: James Athey (lightstruk)	Date: 2008-09-26 19:23
I've created a patch that improves the decompression performance of zipfile.py by up to two orders of magnitude. In ZipFileExt.read(), decompressed bytes waiting to be read() sit in a string buffer, self.readbuffer. When a piece of that string is read, the string is split in two, with the first piece being returned, and the second piece becoming the new self.readbuffer. Each of these two pieces must be allocated space and have their contents copied into them. When the length of the readbuffer far exceeds the number of bytes requested, allocating space for the two substrings and copying in their contents becomes very expensive. The attached zeroes.zip demonstrates a worst-case scenario for this problem. It contains one 100 MiB file filled with zeroes. This file compresses to just 100 KiB, however, because it is so repetitive. This repetitive data means that the zlib decompressor returns many MiBs of uncompressed data when fed just 64 KiB of compressed data. Each call to read() requests only 16 KiB, so each call must reallocate and copy many MiBs. The attached patch makes the read buffer a StringIO instead of a string. Each call to the decompressor creates a new StringIO buffer. Reading from the StringIO does not create a new string for the unread data. When the buffer has been exhausted, a new StringIO is created with the next batch of uncompressed bytes. The patch also fixes the behavior of zipfile.py when called as a script with the -e flag. Before, to extract a file, it decompressed the entire file to memory, and then wrote the entire file to disk. This behavior is undesirable if the decompressed file is even remotely large. Now, it uses ZipFile.extractall(), which correctly streams the decompressed data to disk. unzip vs. Python's zipfile.py vs. patched zipfile.py: $ time unzip -e zeroes.zip Archive: zeroes.zip inflating: zeroes_unzip/zeroes real 0m0.707s user 0m0.463s sys 0m0.244s $ time python zipfileold.py -e zeroes.zip zeroes_old real 3m42.012s user 0m57.670s sys 2m43.510s $ time python zipfile.py -e zeroes.zip zeroes_patched real 0m0.986s user 0m0.409s sys 0m0.490s In this test, the patched version is 246x faster than the unpatched version, and is not far off the pace of the C version. Incidentally, this patch also improves performance when the data is not repetitive. I tested a ZIP containing a single compressed file filled with random data, created by running $ dd if=/dev/urandom of=random bs=1024 count=1024 $ zip random.zip random This archive demonstrates the opposite scenario - where compression has next to no impact on file size, and the read buffer will never be dramatically larger than the amount of data fed to the zlib decompressor. $ time python zipfileold.py -e random.zip random_old real 0m0.063s user 0m0.053s sys 0m0.010s $ time python zipfile.py -e random.zip random_patched real 0m0.059s user 0m0.047s sys 0m0.012s
msg73921 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-09-27 12:38
Very interesting, but it will have to wait for 2.7/3.1. 2.6 and 3.0 are in the final phases of the release process.
msg74135 - (view)	Author: James Athey (lightstruk)	Date: 2008-10-01 16:17
Why not include this in 2.6.1 or 3.0.1? The patch fixes several bugs; it does not provide any new functionality.
msg77761 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-13 19:32
Attaching a cleanup of the proposed patch. The funny thing is that for me, both the unpatched and patched versions are as fast as the unzip binary.
msg115643 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-05 12:48
The patch has been outdated by other independent performance work on the zipfile module. In Python 3.2, the zipfile module is actually slightly faster than the "unzip" program: - first with the supplied "zeroes.zip" file: $ rm -f zeroes && time -p unzip -e zeroes.zip Archive: zeroes.zip inflating: zeroes real 0.56 user 0.50 sys 0.06 $ time -p ./python -m zipfile -e zeroes.zip . real 0.45 user 0.34 sys 0.10 - Then with a 100MB random file: $ rm -f random && time -p unzip -e random.zip Archive: random.zip inflating: random real 0.69 user 0.61 sys 0.07 $ rm -f random && time -p ./python -m zipfile -e random.zip . real 0.33 user 0.18 sys 0.14
msg126260 - (view)	Author: Andre Berg (andreb)	Date: 2011-01-14 13:39
If I may chime in, as I don't know where else to put this. I am still seeing the same performance as the OP when I use extractall() with a password protected ZIP of size 287 MB (containing one compressed movie file of size 297 MB). The total running time for extractall.py was real 35m24.448s user 34m52.423s sys 0m1.448s For a bash script using unzip -P the running time on the same file was real 0m19.026s user 0m8.359s sys 0m0.414s extractall.py loops over the contents of a directory using os.walk, identifies zip files by file extension and extracts a certain portion of the filename as password using a regex. If I leave the ZipFile.extractall part out of it and run it it takes 0.15 s. This is with Python 2.7.1 and Python 3.1.2 on Mac OS X 10.6.4 on an 8-core MacPro with 16 GB of RAM. The file is read from an attached USB drive. Maybe that makes a difference. I wish I could tell you more. This is just for the record. I don't expect this to be fixed.
msg126261 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-01-14 13:48
> I am still seeing the same performance as the OP when I use > extractall() with a password protected ZIP of size 287 MB (containing > one compressed movie file of size 297 MB). Please try with a non-password protected file.
msg126275 - (view)	Author: Andre Berg (andreb)	Date: 2011-01-14 16:39
"Decryption is extremely slow as it is implemented in native Python rather than C" Right, of course, I missed this when reading the docs. I have a habit of jumping straight to the point. As I was asked to try it with a non-password protected zip file here's the numbers for comparison. Same file, re-zipped without encryption, extractall.py now finishes in 16 s.

History
Date	User	Action	Args
2022-04-11 14:56:39	admin	set	github: 48228
2011-01-14 16:39:32	andreb	set	nosy: pitrou, lightstruk, andreb messages: + msg126275
2011-01-14 13:48:07	pitrou	set	nosy: pitrou, lightstruk, andreb messages: + msg126261
2011-01-14 13:39:35	andreb	set	nosy: + andreb messages: + msg126260
2010-09-05 12:48:59	pitrou	set	status: open -> closed resolution: out of date messages: + msg115643 versions: + Python 3.2, - Python 3.1, Python 2.7
2008-12-13 19:32:39	pitrou	set	files: + zipperf.patch messages: + msg77761
2008-12-05 12:49:48	lightstruk	set	title: ZipFileExt.read() can be incredibly slow -> ZipFileExt.read() can be incredibly slow; patch included
2008-10-01 16:17:41	lightstruk	set	messages: + msg74135
2008-09-27 12:38:46	pitrou	set	priority: normal nosy: + pitrou messages: + msg73921 versions: + Python 3.1, - Python 2.6
2008-09-26 19:24:27	lightstruk	set	files: + zeroes.zip
2008-09-26 19:23:38	lightstruk	create