Message 73880 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lightstruk
Recipients	lightstruk
Date	2008-09-26.19:23:35
SpamBayes Score	1.110223e-16
Marked as misclassified	No
Message-id	<1222457019.27.0.267576306603.issue3978@psf.upfronthosting.co.za>
In-reply-to

Content
I've created a patch that improves the decompression performance of zipfile.py by up to two orders of magnitude. In ZipFileExt.read(), decompressed bytes waiting to be read() sit in a string buffer, self.readbuffer. When a piece of that string is read, the string is split in two, with the first piece being returned, and the second piece becoming the new self.readbuffer. Each of these two pieces must be allocated space and have their contents copied into them. When the length of the readbuffer far exceeds the number of bytes requested, allocating space for the two substrings and copying in their contents becomes very expensive. The attached zeroes.zip demonstrates a worst-case scenario for this problem. It contains one 100 MiB file filled with zeroes. This file compresses to just 100 KiB, however, because it is so repetitive. This repetitive data means that the zlib decompressor returns many MiBs of uncompressed data when fed just 64 KiB of compressed data. Each call to read() requests only 16 KiB, so each call must reallocate and copy many MiBs. The attached patch makes the read buffer a StringIO instead of a string. Each call to the decompressor creates a new StringIO buffer. Reading from the StringIO does not create a new string for the unread data. When the buffer has been exhausted, a new StringIO is created with the next batch of uncompressed bytes. The patch also fixes the behavior of zipfile.py when called as a script with the -e flag. Before, to extract a file, it decompressed the entire file to memory, and then wrote the entire file to disk. This behavior is undesirable if the decompressed file is even remotely large. Now, it uses ZipFile.extractall(), which correctly streams the decompressed data to disk. unzip vs. Python's zipfile.py vs. patched zipfile.py: $ time unzip -e zeroes.zip Archive: zeroes.zip inflating: zeroes_unzip/zeroes real 0m0.707s user 0m0.463s sys 0m0.244s $ time python zipfileold.py -e zeroes.zip zeroes_old real 3m42.012s user 0m57.670s sys 2m43.510s $ time python zipfile.py -e zeroes.zip zeroes_patched real 0m0.986s user 0m0.409s sys 0m0.490s In this test, the patched version is 246x faster than the unpatched version, and is not far off the pace of the C version. Incidentally, this patch also improves performance when the data is not repetitive. I tested a ZIP containing a single compressed file filled with random data, created by running $ dd if=/dev/urandom of=random bs=1024 count=1024 $ zip random.zip random This archive demonstrates the opposite scenario - where compression has next to no impact on file size, and the read buffer will never be dramatically larger than the amount of data fed to the zlib decompressor. $ time python zipfileold.py -e random.zip random_old real 0m0.063s user 0m0.053s sys 0m0.010s $ time python zipfile.py -e random.zip random_patched real 0m0.059s user 0m0.047s sys 0m0.012s

I've created a patch that improves the decompression performance of
zipfile.py by up to two orders of magnitude.

In ZipFileExt.read(), decompressed bytes waiting to be read() sit in a
string buffer, self.readbuffer.  When a piece of that string is read,
the string is split in two, with the first piece being returned, and the
second piece becoming the new self.readbuffer.  Each of these two pieces
must be allocated space and have their contents copied into them.  When
the length of the readbuffer far exceeds the number of bytes requested,
allocating space for the two substrings and copying in their contents
becomes very expensive.

The attached zeroes.zip demonstrates a worst-case scenario for this
problem.  It contains one 100 MiB file filled with zeroes.  This file
compresses to just 100 KiB, however, because it is so repetitive.  This
repetitive data means that the zlib decompressor returns many MiBs of
uncompressed data when fed just 64 KiB of compressed data.  Each call to
read() requests only 16 KiB, so each call must reallocate and copy many
MiBs.

The attached patch makes the read buffer a StringIO instead of a string.
  Each call to the decompressor creates a new StringIO buffer.  Reading
from the StringIO does not create a new string for the unread data. 
When the buffer has been exhausted, a new StringIO is created with the
next batch of uncompressed bytes.

The patch also fixes the behavior of zipfile.py when called as a script
with the -e flag.  Before, to extract a file, it decompressed the entire
file to memory, and then wrote the entire file to disk.  This behavior
is undesirable if the decompressed file is even remotely large.  Now, it
uses ZipFile.extractall(), which correctly streams the decompressed data
to disk.

unzip vs. Python's zipfile.py vs. patched zipfile.py:

$ time unzip -e zeroes.zip
Archive:  zeroes.zip
  inflating: zeroes_unzip/zeroes

real    0m0.707s
user    0m0.463s
sys     0m0.244s

$ time python zipfileold.py -e zeroes.zip zeroes_old

real    3m42.012s
user    0m57.670s
sys     2m43.510s

$ time python zipfile.py -e zeroes.zip zeroes_patched

real    0m0.986s
user    0m0.409s
sys     0m0.490s

In this test, the patched version is 246x faster than the unpatched
version, and is not far off the pace of the C version.

Incidentally, this patch also improves performance when the data is not
repetitive.  I tested a ZIP containing a single compressed file filled
with random data, created by running
$ dd if=/dev/urandom of=random bs=1024 count=1024
$ zip random.zip random
This archive demonstrates the opposite scenario - where compression has
next to no impact on file size, and the read buffer will never be
dramatically larger than the amount of data fed to the zlib decompressor.

$ time python zipfileold.py -e random.zip random_old

real    0m0.063s
user    0m0.053s
sys     0m0.010s

$ time python zipfile.py -e random.zip random_patched

real    0m0.059s
user    0m0.047s
sys     0m0.012s

History
Date	User	Action	Args
2008-09-26 19:23:39	lightstruk	set	recipients: + lightstruk
2008-09-26 19:23:39	lightstruk	set	messageid: <1222457019.27.0.267576306603.issue3978@psf.upfronthosting.co.za>
2008-09-26 19:23:37	lightstruk	link	issue3978 messages
2008-09-26 19:23:36	lightstruk	create