🔵 bz2/lzma module's current growth algorithm
bz2/lzma module's initial output buffer size is 8KB [1][2], and they are using this output buffer growth algorithm [3][4]:
newsize = size + (size >> 3) + 6
[1] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L109
[2] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L124
[3] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L133
[4] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L121
For many case, the output buffer is resized too many times.
You may paste this code to REPL to see the growth step:
size = 8*1024
for i in range(1, 120):
print('Step %d ' % i, format(size, ','), 'bytes')
size = size + (size >> 3) + 6
Step 1 8,192 bytes
Step 2 9,222 bytes
Step 3 10,380 bytes
Step 4 11,683 bytes
Step 5 13,149 bytes
Step 6 14,798 bytes
...
🔵 zlib module's current growth algorithm
zlib module's initial output buffer size is 16KB [5], in each growth the buffer size doubles [6].
[5] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L32
[6] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L174
This algorithm has a higher risk of running out of memory:
...
Step 14 256 MB
Step 15 512 MB
Step 16 1 GB
Step 17 2 GB
Step 18 4 GB
Step 19 8 GB
Step 20 16 GB
Step 21 32 GB
Step 22 64 GB
...
🔵 Add _BlocksOutputBuffer for bz2/lzma/zlib module
Proposed PR uses a list of bytes object to represent output buffer.
It can eliminate the overhead of resizing (bz2/lzma), and prevent excessive memory footprint (zlib).
I only tested decompression, because the result is more obvious than compression.
For special data benchmark (all data consists of b'a'), see these attached pictures, _BlocksOutputBuffer has linear performance:
(Benchmark by attached file benchmark.py)
0to2GB_step30MB.png (Decompress from 0 to 2GB, 30MB step)
0to200MB_step2MB.png (Decompress from 0 to 200MB, 2MB step)
0to20MB_step64KB.png (Decompress from 0 to 20MB, 64KB step)
After switching to _BlocksOutputBuffer, the code of bz2/lzma is more concise, the code of zlib is basically translated statement by statement, IMO it's safe and easy for review.
🔵 Real data benchmark
For real data, the weight of resizing output buffer is not big, so the performance improvement is not as big as above pictures:
(Benchmark by attached file benchmark_real.py)
----------------- bz2 -----------------
linux-2.6.39.4.tar.bz2
input size: 76,097,195, output size: 449,638,400
best of 5: [baseline_raw] 12.954 sec -> [patched_raw] 11.600 sec, 1.12x faster (-10%)
firefox-79.0.linux-i686.tar.bz2
input size: 74,109,706, output size: 228,055,040
best of 5: [baseline_raw] 8.511 sec -> [patched_raw] 7.829 sec, 1.09x faster (-8%)
ffmpeg-4.3.1.tar.bz2
input size: 11,301,038, output size: 74,567,680
best of 5: [baseline_raw] 1.915 sec -> [patched_raw] 1.671 sec, 1.15x faster (-13%)
gimp-2.10.20.tar.bz2
input size: 33,108,938, output size: 214,179,840
best of 5: [baseline_raw] 5.794 sec -> [patched_raw] 4.964 sec, 1.17x faster (-14%)
sde-external-8.56.0-2020-07-05-lin.tar.bz2
input size: 26,746,086, output size: 92,129,280
best of 5: [baseline_raw] 3.153 sec -> [patched_raw] 2.835 sec, 1.11x faster (-10%)
----------------- lzma -----------------
linux-5.7.10.tar.xz
input size: 112,722,840, output size: 966,062,080
best of 5: [baseline_raw] 9.813 sec -> [patched_raw] 7.434 sec, 1.32x faster (-24%)
linux-2.6.39.4.tar.xz
input size: 63,243,812, output size: 449,638,400
best of 5: [baseline_raw] 5.256 sec -> [patched_raw] 4.200 sec, 1.25x faster (-20%)
gcc-9.3.0.tar.xz
input size: 70,533,868, output size: 618,608,640
best of 5: [baseline_raw] 6.398 sec -> [patched_raw] 4.878 sec, 1.31x faster (-24%)
Python-3.8.5.tar.xz
input size: 18,019,640, output size: 87,531,520
best of 5: [baseline_raw] 1.315 sec -> [patched_raw] 1.098 sec, 1.20x faster (-16%)
firefox-79.0.source.tar.xz
input size: 333,220,776, output size: 2,240,573,440
best of 5: [baseline_raw] 25.339 sec -> [patched_raw] 19.661 sec, 1.29x faster (-22%)
----------------- zlib -----------------
linux-5.7.10.tar.gz
input size: 175,493,557, output size: 966,062,080
best of 5: [baseline_raw] 2.360 sec -> [patched_raw] 2.401 sec, 1.02x slower (+2%)
linux-2.6.39.4.tar.gz
input size: 96,011,459, output size: 449,638,400
best of 5: [baseline_raw] 1.215 sec -> [patched_raw] 1.216 sec, 1.00x slower (+0%)
gcc-9.3.0.tar.gz
input size: 124,140,228, output size: 618,608,640
best of 5: [baseline_raw] 1.668 sec -> [patched_raw] 1.555 sec, 1.07x faster (-7%)
Python-3.8.5.tgz
input size: 24,149,093, output size: 87,531,520
best of 5: [baseline_raw] 0.263 sec -> [patched_raw] 0.253 sec, 1.04x faster (-4%)
openjdk-14.0.2_linux-x64_bin.tar.gz
input size: 198,606,190, output size: 335,175,680
best of 5: [baseline_raw] 1.273 sec -> [patched_raw] 1.221 sec, 1.04x faster (-4%)
postgresql-10.12-1-linux-x64-binaries.tar.gz
input size: 159,090,037, output size: 408,678,400
best of 5: [baseline_raw] 1.415 sec -> [patched_raw] 1.401 sec, 1.01x faster (-1%)
-------------- benchmark end --------------
🔵 Add .readall() function to _compression.DecompressReader class
The last commit add .readall() function to _compression.DecompressReader, it optimize the .read(-1) for BZ2File/LZMAFile/GzipFile object.
The following description is a bit complicate:
1, BZ2File/LZMAFile/GzipFile object has a `self._buffer` attribute [7]:
raw = _compression.DecompressReader(fp, BZ2Decompressor/LZMADecompressor/zlib.decompressobj)
self._buffer = io.BufferedReader(raw)
[7] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L130-L132
2, When call `.read(-1)` function (read all data) on BZ2File/LZMAFile/GzipFile
object, will dispatch to `self._buffer`'s .read(-1) function [8].
[8] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L200
3, `self._buffer` is an io.BufferedReader object, it will dispatch to underlying
stream's readall() function [9].
[9] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/bufferedio.c#L1538-L1542
4, The underlying stream is a _compression.DecompressReader object, which is a
subclass of io.RawIOBase [10].
[10] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L33
5, Then io.RawIOBase's readall() function is called, but it's very inefficient.
It only read DEFAULT_BUFFER_SIZE (8KB) data each time. [11]
[11] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/iobase.c#L968-L969
6, So each time the decompressor will try to decompress 8KB input data to a 8KB
output buffer [12]:
data = self._decompressor.decompress(8KB_input_buffer, 8KB_output_buffer)
[12] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L103
In most cases, the input buffer will not be fully decompressed, this brings
unnecessary overhead.
After adding the .readall() function to _compression.DecompressReader, the
limit of the output buffer size has been removed.
This change requires _BlocksOutputBuffer, otherwise, in extreme cases (small
input buffer be decompressed to 100MB data), it will be slower due to the
overhead of buffer resizing.
Benchmark with real data, call .read(-1) on XXXXFile object:
(Benchmark by attached file benchmark_real.py)
----------------- BZ2File -----------------
linux-2.6.39.4.tar.bz2
input size: 76,097,195, output size: 449,638,400
best of 5: [baseline_file] 12.371 sec -> [patched_file] 12.035 sec, 1.03x faster (-3%)
firefox-79.0.linux-i686.tar.bz2
input size: 74,109,706, output size: 228,055,040
best of 5: [baseline_file] 8.233 sec -> [patched_file] 8.124 sec, 1.01x faster (-1%)
ffmpeg-4.3.1.tar.bz2
input size: 11,301,038, output size: 74,567,680
best of 5: [baseline_file] 1.747 sec -> [patched_file] 1.724 sec, 1.01x faster (-1%)
gimp-2.10.20.tar.bz2
input size: 33,108,938, output size: 214,179,840
best of 5: [baseline_file] 5.220 sec -> [patched_file] 5.162 sec, 1.01x faster (-1%)
sde-external-8.56.0-2020-07-05-lin.tar.bz2
input size: 26,746,086, output size: 92,129,280
best of 5: [baseline_file] 2.935 sec -> [patched_file] 2.899 sec, 1.01x faster (-1%)
----------------- LZMAFile -----------------
linux-5.7.10.tar.xz
input size: 112,722,840, output size: 966,062,080
best of 5: [baseline_file] 7.712 sec -> [patched_file] 7.670 sec, 1.01x faster (-1%)
linux-2.6.39.4.tar.xz
input size: 63,243,812, output size: 449,638,400
best of 5: [baseline_file] 4.342 sec -> [patched_file] 4.274 sec, 1.02x faster (-2%)
gcc-9.3.0.tar.xz
input size: 70,533,868, output size: 618,608,640
best of 5: [baseline_file] 5.077 sec -> [patched_file] 4.974 sec, 1.02x faster (-2%)
Python-3.8.5.tar.xz
input size: 18,019,640, output size: 87,531,520
best of 5: [baseline_file] 1.093 sec -> [patched_file] 1.087 sec, 1.01x faster (-1%)
firefox-79.0.source.tar.xz
input size: 333,220,776, output size: 2,240,573,440
best of 5: [baseline_file] 20.748 sec -> [patched_file] 20.276 sec, 1.02x faster (-2%)
----------------- GzipFile -----------------
linux-5.7.10.tar.gz
input size: 175,493,567, output size: 966,062,080
best of 5: [baseline_file] 4.422 sec -> [patched_file] 3.803 sec, 1.16x faster (-14%)
linux-2.6.39.4.tar.gz
input size: 96,011,469, output size: 449,638,400
best of 5: [baseline_file] 2.119 sec -> [patched_file] 1.863 sec, 1.14x faster (-12%)
gcc-9.3.0.tar.gz
input size: 124,140,238, output size: 618,608,640
best of 5: [baseline_file] 2.825 sec -> [patched_file] 2.409 sec, 1.17x faster (-15%)
Python-3.8.5.tgz
input size: 24,149,103, output size: 87,531,520
best of 5: [baseline_file] 0.410 sec -> [patched_file] 0.349 sec, 1.18x faster (-15%)
openjdk-14.0.2_linux-x64_bin.tar.gz
input size: 198,606,200, output size: 335,175,680
best of 5: [baseline_file] 1.885 sec -> [patched_file] 1.702 sec, 1.11x faster (-10%)
postgresql-10.12-1-linux-x64-binaries.tar.gz
input size: 159,090,047, output size: 408,678,400
best of 5: [baseline_file] 2.236 sec -> [patched_file] 2.002 sec, 1.12x faster (-10%)
-------------- benchmark end -------------- |