Message 244582 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	Arfrever, Michael.Fox, eric.araujo, martin.panter, nadeem.vawda, pitrou, rhettinger, serhiy.storchaka, vstinner
Date	2015-06-01.13:51:13
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1433166675.09.0.664536742806.issue18003@psf.upfronthosting.co.za>
In-reply-to

Content
This bug was originally raised against Python 3.3, and the speed has improved a lot since then. Perhaps this bug can be closed as it is, or maybe people would like to consider my decomp-optim.patch which squeezes a bit more speed out. I don’t actually have a strong opinion either way. Python 3.4 was apparently much faster than 3.3 courtesy of Issue 16034. In Python 3.5, all three decompression modules (LZMA, gzip and bzip) now use a BufferedReader internally, due to my work in Issue 23529. The modules delegate method calls to the internal BufferedReader, rather than returning an instance directly, for backwards compatibility. I found that bypassing the readline() delegation speeds things up significantly, and adding a custom “closed” property on the underlying raw reader class also helps. However, I did not think it would be wise to bypass the locking in the “bz2” module, I didn’t bypass BZ2File.readline() in the patch. Timing results and a test script I used to investigate different options below: lzma gzip bz2 ======= ======== ======== Unpatched 3.2 s 2.513 s 5.180 s Custom __iter__() 1.31 s 1.317 s 2.433 s __iter__() and closed 0.53 s* 0.543 s* 1.650 s closed change only 4.047 s* External BufferedReader 0.64 s 0.597 s 1.750 s Direct from BytesIO 0.33 s 0.370 s 1.280 s Command-line tool 0.063 s 0.053 s 0.993 s * Option implemented in decomp-optim.patch --- import lzma, io filename = "pacman.log.xz" # 256206 lines; 389 kB -> 13 MB # Basic case reader = lzma.LZMAFile(filename) # 3.2 s # Add __iter__() optimization def lzma_iter(self): self._check_can_read() return iter(self._buffer) lzma.LZMAFile.__iter__ = lzma_iter # 1.31 s # Add “closed” optimization def decompressor_closed(self): return self._decompressor is None import _compression _compression.DecompressReader.closed = property(decompressor_closed) # 0.53 s #~ # External BufferedReader baseline #~ reader = io.BufferedReader(lzma.LZMAFile(filename)) # 0.64 s #~ # Direct from BytesIO baseline #~ with open(filename, "rb") as file: #~ data = file.read() #~ reader = io.BytesIO(lzma.decompress(data)) # 0.33 s for line in reader: pass

This bug was originally raised against Python 3.3, and the speed has improved a lot since then. Perhaps this bug can be closed as it is, or maybe people would like to consider my decomp-optim.patch which squeezes a bit more speed out. I don’t actually have a strong opinion either way.

Python 3.4 was apparently much faster than 3.3 courtesy of Issue 16034. In Python 3.5, all three decompression modules (LZMA, gzip and bzip) now use a BufferedReader internally, due to my work in Issue 23529. The modules delegate method calls to the internal BufferedReader, rather than returning an instance directly, for backwards compatibility.

I found that bypassing the readline() delegation speeds things up significantly, and adding a custom “closed” property on the underlying raw reader class also helps. However, I did not think it would be wise to bypass the locking in the “bz2” module, I didn’t bypass BZ2File.readline() in the patch. Timing results and a test script I used to investigate different options below:

                         lzma     gzip      bz2
                         =======  ========  ========
Unpatched                3.2 s    2.513 s   5.180 s
Custom __iter__()        1.31 s   1.317 s   2.433 s
__iter__() and closed    0.53 s*  0.543 s*  1.650 s
closed change only                          4.047 s*
External BufferedReader  0.64 s   0.597 s   1.750 s
Direct from BytesIO      0.33 s   0.370 s   1.280 s
Command-line tool        0.063 s  0.053 s   0.993 s

* Option implemented in decomp-optim.patch

---

import lzma, io
filename = "pacman.log.xz"  # 256206 lines; 389 kB -> 13 MB

# Basic case
reader = lzma.LZMAFile(filename)  # 3.2 s

# Add __iter__() optimization
def lzma_iter(self):
    self._check_can_read()
    return iter(self._buffer)
lzma.LZMAFile.__iter__ = lzma_iter  # 1.31 s

# Add “closed” optimization
def decompressor_closed(self):
    return self._decompressor is None
import _compression
_compression.DecompressReader.closed = property(decompressor_closed)  # 0.53 s

#~ # External BufferedReader baseline
#~ reader = io.BufferedReader(lzma.LZMAFile(filename))  # 0.64 s

#~ # Direct from BytesIO baseline
#~ with open(filename, "rb") as file:
    #~ data = file.read()
#~ reader = io.BytesIO(lzma.decompress(data))  # 0.33 s

for line in reader:
    pass

History
Date	User	Action	Args
2015-06-01 13:51:15	martin.panter	set	recipients: + martin.panter, rhettinger, pitrou, vstinner, nadeem.vawda, eric.araujo, Arfrever, serhiy.storchaka, Michael.Fox
2015-06-01 13:51:15	martin.panter	set	messageid: <1433166675.09.0.664536742806.issue18003@psf.upfronthosting.co.za>
2015-06-01 13:51:15	martin.panter	link	issue18003 messages
2015-06-01 13:51:13	martin.panter	create