classification
Title: ZipFile unzip is unbuffered
Type: performance Stage: resolved
Components: IO, Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Jimbofbx, docs@python, nadeem.vawda, pitrou, python-dev, serhiy.storchaka, xuanji
Priority: normal Keywords: patch

Created on 2010-11-09 15:51 by Jimbofbx, last changed 2012-06-23 14:51 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
zipfiletest.py Jimbofbx, 2012-05-01 20:26
zipfile_optimize_read.patch serhiy.storchaka, 2012-05-10 22:24
zipfile_optimize_read.patch loewis, 2012-05-13 18:32 regenerate patch for review (without manually deleted chunks) review
zipfile_optimize_read_2.patch serhiy.storchaka, 2012-05-31 07:44 review
Messages (12)
msg120871 - (view) Author: James Hutchison (Jimbofbx) Date: 2010-11-09 15:51
The Unzip module is always unbuffered (tested v.3.1.2 Windows XP, 32-bit). This means that if one has to do many small reads it is a lot slower than reading a chunk of data to a buffer and then reading from that buffer. It seems logical that the unzip module should default to buffered reading and/or have a buffered argument. Likewise, the documentation should clarify that there is no buffering involved when doing a read, which runs contrary to the default behavior of a normal read.

start Zipfile read
done
27432 reads done
took 0.859 seconds
start buffered Zipfile read
done
27432 reads done
took 0.072 seconds
start normal read (default buffer)
done
27432 reads done
took 0.139 seconds
start buffered normal read
done
27432
took 0.137 seconds
msg120873 - (view) Author: James Hutchison (Jimbofbx) Date: 2010-11-09 15:55
I should clarify that this is the zipfile constructor I am using:

zipfile.ZipFile(filename, mode='r', allowZip64=True);
msg159603 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-29 11:56
Actually reading from the zip file is buffered (at least 4 KiB of uncompressed data at a time).

Can you give tests, scripts and data, which show the problem?
msg159767 - (view) Author: James Hutchison (Jimbofbx) Date: 2012-05-01 20:26
See attached, which will open a zipfile that contains one file and reads it a bunch of times using unbuffered and buffered idioms. This was tested on windows using python 3.2

You're in charge of coming up with a file to test it on. Sorry.

Example output:

Enter filename: test.zip
Timing unbuffered read, 5 bytes at a time. 10 loops
took 6.671999931335449
Timing buffered read, 5 bytes at a time (4000 byte buffer). 10 loops
took 0.7350001335144043
msg160377 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-10 22:24
This is not because zipfile module is unbuffered. This is the difference between expensive function call and cheap bytes slicing. Replace `zf.open(namelist [0])` to `io.BufferedReader(zf.open(namelist [0]))` to see the effect of a good buffering. In 3.2 zipfile read() implemented not optimal, so it slower (twice), but in 3.3 it will be almost as fast as using io.BufferedReader. It is still several times more slowly than bytes slicing, but there's nothing you can do with it.

Here is a patch, which is speeds up (+20%) the reading from a zip file by small chunks. Microbenchmark:

./python -m zipfile -c test.zip python
./python -m timeit -n 1 -s "import zipfile;zf=zipfile.ZipFile('test.zip')"  "with zf.open('python') as f:"  "  while f.read(1):pass"

Python 3.3 (vanilla):  1 loops, best of 3: 36.4 sec per loop
Python 3.3 (patched):  1 loops, best of 3: 30.1 sec per loop
Python 3.3 (with io.BufferedReader):  1 loops, best of 3: 30.2 sec per loop
And, for comparison, Python 3.2:  1 loops, best of 3: 74.5 sec per loop
msg160542 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 18:36
Thank you, Martin, now I understood why not work Rietveld review.
msg161985 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-31 07:44
The patch updated to reflect Martin's stylistic comments.

Sorry for the delay, Martin. I have not received an email with your review from 2012-05-13, and only today accidentally discovered your comments in Rietveld. It seems to have been some bug in Rietveld.
msg162831 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-14 21:26
Martin, now the patch is good?
msg163582 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-23 11:28
Any chance to commit the patch before final feature freeze?
msg163603 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-06-23 13:15
Patch looks fine to me.

Antoine, can you commit this? I'm currently away from the computer that
has my SSH key on it.
msg163616 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-06-23 14:48
New changeset 0e8285321659 by Antoine Pitrou in branch 'default':
On behalf of Nadeem Vawda: issue #10376: micro-optimize reading from a Zipfile.
http://hg.python.org/cpython/rev/0e8285321659
msg163618 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-23 14:51
> Antoine, can you commit this?

Ok, done.
History
Date User Action Args
2012-06-23 14:51:21pitrousetstatus: open -> closed
resolution: fixed
messages: + msg163618

stage: patch review -> resolved
2012-06-23 14:48:24python-devsetnosy: + python-dev
messages: + msg163616
2012-06-23 13:15:59nadeem.vawdasetmessages: + msg163603
2012-06-23 11:34:20pitrousetassignee: docs@python ->

nosy: + nadeem.vawda
stage: patch review
2012-06-23 11:28:37serhiy.storchakasetmessages: + msg163582
2012-06-14 21:26:34serhiy.storchakasetmessages: + msg162831
2012-05-31 07:44:31serhiy.storchakasetfiles: + zipfile_optimize_read_2.patch

messages: + msg161985
2012-05-13 18:36:01serhiy.storchakasetmessages: + msg160542
2012-05-13 18:32:04loewissetfiles: + zipfile_optimize_read.patch
2012-05-10 22:26:06vstinnersetnosy: + pitrou
2012-05-10 22:24:06serhiy.storchakasetfiles: + zipfile_optimize_read.patch
versions: - Python 2.7, Python 3.2
messages: + msg160377

components: - Documentation
keywords: + patch
2012-05-01 20:26:59Jimbofbxsetfiles: + zipfiletest.py

messages: + msg159767
2012-04-29 11:56:57serhiy.storchakasetmessages: + msg159603
2012-04-07 17:59:01serhiy.storchakasetnosy: + serhiy.storchaka
2011-06-06 11:27:14xuanjisetnosy: + xuanji
2011-06-01 06:23:11terry.reedysetversions: + Python 3.2, Python 3.3, - Python 2.6, Python 2.5, Python 3.1
2010-11-09 15:55:12Jimbofbxsetmessages: + msg120873
2010-11-09 15:51:48Jimbofbxcreate