Author martin.panter
Recipients Arfrever, Michael.Fox, eric.araujo, martin.panter, nadeem.vawda, pitrou, rhettinger, serhiy.storchaka, vstinner
Date 2015-06-01.13:51:13
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1433166675.09.0.664536742806.issue18003@psf.upfronthosting.co.za>
In-reply-to
Content
This bug was originally raised against Python 3.3, and the speed has improved a lot since then. Perhaps this bug can be closed as it is, or maybe people would like to consider my decomp-optim.patch which squeezes a bit more speed out. I don’t actually have a strong opinion either way.

Python 3.4 was apparently much faster than 3.3 courtesy of Issue 16034. In Python 3.5, all three decompression modules (LZMA, gzip and bzip) now use a BufferedReader internally, due to my work in Issue 23529. The modules delegate method calls to the internal BufferedReader, rather than returning an instance directly, for backwards compatibility.

I found that bypassing the readline() delegation speeds things up significantly, and adding a custom “closed” property on the underlying raw reader class also helps. However, I did not think it would be wise to bypass the locking in the “bz2” module, I didn’t bypass BZ2File.readline() in the patch. Timing results and a test script I used to investigate different options below:

                         lzma     gzip      bz2
                         =======  ========  ========
Unpatched                3.2 s    2.513 s   5.180 s
Custom __iter__()        1.31 s   1.317 s   2.433 s
__iter__() and closed    0.53 s*  0.543 s*  1.650 s
closed change only                          4.047 s*
External BufferedReader  0.64 s   0.597 s   1.750 s
Direct from BytesIO      0.33 s   0.370 s   1.280 s
Command-line tool        0.063 s  0.053 s   0.993 s

* Option implemented in decomp-optim.patch

---

import lzma, io
filename = "pacman.log.xz"  # 256206 lines; 389 kB -> 13 MB

# Basic case
reader = lzma.LZMAFile(filename)  # 3.2 s

# Add __iter__() optimization
def lzma_iter(self):
    self._check_can_read()
    return iter(self._buffer)
lzma.LZMAFile.__iter__ = lzma_iter  # 1.31 s

# Add “closed” optimization
def decompressor_closed(self):
    return self._decompressor is None
import _compression
_compression.DecompressReader.closed = property(decompressor_closed)  # 0.53 s

#~ # External BufferedReader baseline
#~ reader = io.BufferedReader(lzma.LZMAFile(filename))  # 0.64 s

#~ # Direct from BytesIO baseline
#~ with open(filename, "rb") as file:
    #~ data = file.read()
#~ reader = io.BytesIO(lzma.decompress(data))  # 0.33 s

for line in reader:
    pass
History
Date User Action Args
2015-06-01 13:51:15martin.pantersetrecipients: + martin.panter, rhettinger, pitrou, vstinner, nadeem.vawda, eric.araujo, Arfrever, serhiy.storchaka, Michael.Fox
2015-06-01 13:51:15martin.pantersetmessageid: <1433166675.09.0.664536742806.issue18003@psf.upfronthosting.co.za>
2015-06-01 13:51:15martin.panterlinkissue18003 messages
2015-06-01 13:51:13martin.pantercreate