Message 231132 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dw
Recipients	Arfrever, alanmcintyre, dw, eric.araujo, kasal, loewis, mcepl, ocean-city, pitrou, r.david.murray, serhiy.storchaka
Date	2014-11-13.18:53:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1415904785.79.0.669595612591.issue14099@psf.upfronthosting.co.za>
In-reply-to

Content
Per my comment on issue16569, the overhead of performing one seek before each (raw file data) read is quite minimal. I have attached a new (but incomplete) patch, on which the following microbenchmarks are based. The patch is essentially identical to Stepan's 2012 patch, except I haven't yet decided how best to preserve the semantics of ZipFile.close(). "my.zip" is the same my.zip from issue22842. It contains 10,000 files each containing 10 bytes over 2 lines. "my2.zip" contains 8,000 files each containing the same copy of 64kb of /dev/urandom output. The resulting ZIP is 500mb. For each test, the first run is the existing zipfile module, and the second run is with the patch. In summary: * There is a 35% perf increase in str mode when handling many small files (on OS X at least) * There is a 6% perf decrease in file mode when handling small sequential reads. * There is a 2.4% perf decrease in file mode when handling large sequential reads. From my reading of zipfile.py, it is clear there are _many_ ways to improve its performance (probably starting with readline()), and rejection of a functional fix should almost certainly be at the bottom of that list. For each of the tests below, the functions used were: def a(): """ Test concurrent line reads to a str mode ZipFile. """ zf = zipfile.ZipFile('my2.zip') members = [zf.open(n) for n in zf.namelist()] for m in members: m.readline() for m in members: m.readline() def c(): """ Test sequential small reads to a str mode ZipFile. """ zf = zipfile.ZipFile('my2.zip') for name in zf.namelist(): with zf.open(name) as zfp: zfp.read(1000) def d(): """ Test sequential small reads to a file mode ZipFile. """ fp = open('my2.zip', 'rb') zf = zipfile.ZipFile(fp) for name in zf.namelist(): with zf.open(name) as zfp: zfp.read(1000) def e(): """ Test sequential large reads to a file mode ZipFile. """ fp = open('my2.zip', 'rb') zf = zipfile.ZipFile(fp) for name in zf.namelist(): with zf.open(name) as zfp: zfp.read() ---- my.zip ---- $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.47 sec per loop $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 950 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 1.3 sec per loop $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 865 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 800 msec per loop $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 851 msec per loop ---- my2.zip ---- $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.46 sec per loop $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.16 sec per loop --- $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 1.13 sec per loop $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 892 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 842 msec per loop $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 882 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.e()' 10 loops, best of 3: 1.65 sec per loop $ python3.4 -m timeit -s 'import my' 'my.e()' 10 loops, best of 3: 1.69 sec per loop

Per my comment on issue16569, the overhead of performing one seek before each (raw file data) read is quite minimal. I have attached a new (but incomplete) patch, on which the following microbenchmarks are based. The patch is essentially identical to Stepan's 2012 patch, except I haven't yet decided how best to preserve the semantics of ZipFile.close().

"my.zip" is the same my.zip from issue22842. It contains 10,000 files each containing 10 bytes over 2 lines.

"my2.zip" contains 8,000 files each containing the same copy of 64kb of /dev/urandom output. The resulting ZIP is 500mb.

For each test, the first run is the existing zipfile module, and the second run is with the patch. In summary:

* There is a 35% perf increase in str mode when handling many small files (on OS X at least)
* There is a 6% perf decrease in file mode when handling small sequential reads.
* There is a 2.4% perf decrease in file mode when handling large sequential reads.


From my reading of zipfile.py, it is clear there are _many_ ways to improve its performance (probably starting with readline()), and rejection of a functional fix should almost certainly be at the bottom of that list.


For each of the tests below, the functions used were:

    def a():
        """
        Test concurrent line reads to a str mode ZipFile.
        """
        zf = zipfile.ZipFile('my2.zip')
        members = [zf.open(n) for n in zf.namelist()]
        for m in members:
            m.readline()
        for m in members:
            m.readline()

    def c():
        """
        Test sequential small reads to a str mode ZipFile.
        """
        zf = zipfile.ZipFile('my2.zip')
        for name in zf.namelist():
            with zf.open(name) as zfp:
                zfp.read(1000)

    def d():
        """
        Test sequential small reads to a file mode ZipFile.
        """
        fp = open('my2.zip', 'rb')
        zf = zipfile.ZipFile(fp)
        for name in zf.namelist():
            with zf.open(name) as zfp:
                zfp.read(1000)

    def e():
        """
        Test sequential large reads to a file mode ZipFile.
        """
        fp = open('my2.zip', 'rb')
        zf = zipfile.ZipFile(fp)
        for name in zf.namelist():
            with zf.open(name) as zfp:
                zfp.read()


---- my.zip ----

$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 1.47 sec per loop

$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 950 msec per loop

---

$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 1.3 sec per loop

$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 865 msec per loop

---

$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 800 msec per loop

$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 851 msec per loop


---- my2.zip ----

$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 1.46 sec per loop

$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 1.16 sec per loop

---

$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 1.13 sec per loop

$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 892 msec per loop

---

$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 842 msec per loop

$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 882 msec per loop

---

$ python3.4 -m timeit -s 'import my' 'my.e()'
10 loops, best of 3: 1.65 sec per loop

$ python3.4 -m timeit -s 'import my' 'my.e()'
10 loops, best of 3: 1.69 sec per loop

History
Date	User	Action	Args
2014-11-13 18:53:05	dw	set	recipients: + dw, loewis, alanmcintyre, pitrou, ocean-city, mcepl, eric.araujo, Arfrever, r.david.murray, kasal, serhiy.storchaka
2014-11-13 18:53:05	dw	set	messageid: <1415904785.79.0.669595612591.issue14099@psf.upfronthosting.co.za>
2014-11-13 18:53:05	dw	link	issue14099 messages
2014-11-13 18:53:04	dw	create