ZipFile unzip is unbuffered #54585

Jimbofbx · 2010-11-09T15:51:48Z

BPO	10376
Nosy	@pitrou, @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll, @serhiy-storchaka
Files	zipfiletest.py zipfile_optimize_read.patch zipfile_optimize_read.patch: regenerate patch for review (without manually deleted chunks) zipfile_optimize_read_2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-06-23.14:51:21.731>
created_at = <Date 2010-11-09.15:51:48.085>
labels = ['library', 'expert-IO', 'performance']
title = 'ZipFile unzip is unbuffered'
updated_at = <Date 2012-06-23.14:51:21.730>
user = 'https://bugs.python.org/Jimbofbx'

bugs.python.org fields:

activity = <Date 2012-06-23.14:51:21.730>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2012-06-23.14:51:21.731>
closer = 'pitrou'
components = ['Library (Lib)', 'IO']
creation = <Date 2010-11-09.15:51:48.085>
creator = 'Jimbofbx'
dependencies = []
files = ['25432', '25530', '25565', '25769']
hgrepos = []
issue_num = 10376
keywords = ['patch']
message_count = 12.0
messages = ['120871', '120873', '159603', '159767', '160377', '160542', '161985', '162831', '163582', '163603', '163616', '163618']
nosy_count = 7.0
nosy_names = ['pitrou', 'nadeem.vawda', 'docs@python', 'Jimbofbx', 'xuanji', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue10376'
versions = ['Python 3.3']

Jimbofbx · 2010-11-09T15:51:47Z

The Unzip module is always unbuffered (tested v.3.1.2 Windows XP, 32-bit). This means that if one has to do many small reads it is a lot slower than reading a chunk of data to a buffer and then reading from that buffer. It seems logical that the unzip module should default to buffered reading and/or have a buffered argument. Likewise, the documentation should clarify that there is no buffering involved when doing a read, which runs contrary to the default behavior of a normal read.

start Zipfile read
done
27432 reads done
took 0.859 seconds
start buffered Zipfile read
done
27432 reads done
took 0.072 seconds
start normal read (default buffer)
done
27432 reads done
took 0.139 seconds
start buffered normal read
done
27432
took 0.137 seconds

Jimbofbx · 2010-11-09T15:55:12Z

I should clarify that this is the zipfile constructor I am using:

zipfile.ZipFile(filename, mode='r', allowZip64=True);

serhiy-storchaka · 2012-04-29T11:56:58Z

Actually reading from the zip file is buffered (at least 4 KiB of uncompressed data at a time).

Can you give tests, scripts and data, which show the problem?

Jimbofbx · 2012-05-01T20:26:59Z

See attached, which will open a zipfile that contains one file and reads it a bunch of times using unbuffered and buffered idioms. This was tested on windows using python 3.2

You're in charge of coming up with a file to test it on. Sorry.

Example output:

Enter filename: test.zip
Timing unbuffered read, 5 bytes at a time. 10 loops
took 6.671999931335449
Timing buffered read, 5 bytes at a time (4000 byte buffer). 10 loops
took 0.7350001335144043

serhiy-storchaka · 2012-05-10T22:24:06Z

This is not because zipfile module is unbuffered. This is the difference between expensive function call and cheap bytes slicing. Replace zf.open(namelist [0]) to io.BufferedReader(zf.open(namelist [0])) to see the effect of a good buffering. In 3.2 zipfile read() implemented not optimal, so it slower (twice), but in 3.3 it will be almost as fast as using io.BufferedReader. It is still several times more slowly than bytes slicing, but there's nothing you can do with it.

Here is a patch, which is speeds up (+20%) the reading from a zip file by small chunks. Microbenchmark:

./python -m zipfile -c test.zip python
./python -m timeit -n 1 -s "import zipfile;zf=zipfile.ZipFile('test.zip')" "with zf.open('python') as f:" " while f.read(1):pass"

Python 3.3 (vanilla): 1 loops, best of 3: 36.4 sec per loop
Python 3.3 (patched): 1 loops, best of 3: 30.1 sec per loop
Python 3.3 (with io.BufferedReader): 1 loops, best of 3: 30.2 sec per loop
And, for comparison, Python 3.2: 1 loops, best of 3: 74.5 sec per loop

serhiy-storchaka · 2012-05-13T18:36:02Z

Thank you, Martin, now I understood why not work Rietveld review.

serhiy-storchaka · 2012-05-31T07:44:31Z

The patch updated to reflect Martin's stylistic comments.

Sorry for the delay, Martin. I have not received an email with your review from 2012-05-13, and only today accidentally discovered your comments in Rietveld. It seems to have been some bug in Rietveld.

serhiy-storchaka · 2012-06-14T21:26:35Z

Martin, now the patch is good?

serhiy-storchaka · 2012-06-23T11:28:34Z

Any chance to commit the patch before final feature freeze?

nadeemvawda · 2012-06-23T13:15:59Z

Patch looks fine to me.

Antoine, can you commit this? I'm currently away from the computer that
has my SSH key on it.

python-dev · 2012-06-23T14:48:24Z

New changeset 0e8285321659 by Antoine Pitrou in branch 'default':
On behalf of Nadeem Vawda: issue bpo-10376: micro-optimize reading from a Zipfile.
http://hg.python.org/cpython/rev/0e8285321659

pitrou · 2012-06-23T14:51:22Z

Antoine, can you commit this?

Ok, done.

Jimbofbx mannequin added the docs Documentation in the Doc dir label Nov 9, 2010

Jimbofbx mannequin assigned docspython Nov 9, 2010

Jimbofbx mannequin added stdlib Python modules in the Lib dir topic-IO performance Performance or resource usage labels Nov 9, 2010

serhiy-storchaka removed the docs Documentation in the Doc dir label May 10, 2012

pitrou unassigned docspython Jun 23, 2012

pitrou closed this as completed Jun 23, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZipFile unzip is unbuffered #54585

ZipFile unzip is unbuffered #54585

Jimbofbx mannequin commented Nov 9, 2010

Jimbofbx mannequin commented Nov 9, 2010

Jimbofbx mannequin commented Nov 9, 2010

serhiy-storchaka commented Apr 29, 2012

Jimbofbx mannequin commented May 1, 2012

serhiy-storchaka commented May 10, 2012

serhiy-storchaka commented May 13, 2012

serhiy-storchaka commented May 31, 2012

serhiy-storchaka commented Jun 14, 2012

serhiy-storchaka commented Jun 23, 2012

nadeemvawda mannequin commented Jun 23, 2012

python-dev mannequin commented Jun 23, 2012

pitrou commented Jun 23, 2012

ZipFile unzip is unbuffered #54585

ZipFile unzip is unbuffered #54585

Comments

Jimbofbx mannequin commented Nov 9, 2010

Jimbofbx mannequin commented Nov 9, 2010

Jimbofbx mannequin commented Nov 9, 2010

serhiy-storchaka commented Apr 29, 2012

Jimbofbx mannequin commented May 1, 2012

serhiy-storchaka commented May 10, 2012

serhiy-storchaka commented May 13, 2012

serhiy-storchaka commented May 31, 2012

serhiy-storchaka commented Jun 14, 2012

serhiy-storchaka commented Jun 23, 2012

nadeemvawda mannequin commented Jun 23, 2012

python-dev mannequin commented Jun 23, 2012

pitrou commented Jun 23, 2012