classification
Title: Add _BlocksOutputBuffer for bz2/lzma/zlib module
Type: performance Stage: patch review
Components: Library (Lib) Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: malin, vstinner
Priority: normal Keywords: patch

Created on 2020-08-05 09:55 by malin, last changed 2020-10-28 06:28 by malin.

Files
File name Uploaded Description Edit
0to2GB_step30MB.png malin, 2020-08-05 09:58 Decompress from 0 to 2GB, 30MB step
0to200MB_step2MB.png malin, 2020-08-05 09:58 Decompress from 0 to 200MB, 2MB step
0to20MB_step64KB.png malin, 2020-08-05 09:59 Decompress from 0 to 20MB, 64KB step
benchmark.py malin, 2020-08-05 09:59 benchmark special data (all data consists of b'a')
benchmark_real.py malin, 2020-08-05 09:59 benchmark real data
different_factors.png malin, 2020-10-28 06:28
Pull Requests
URL Status Linked Edit
PR 21740 open malin, 2020-08-05 10:02
Messages (4)
msg374867 - (view) Author: Ma Lin (malin) * Date: 2020-08-05 09:55
🔵  bz2/lzma module's current growth algorithm

bz2/lzma module's initial output buffer size is 8KB [1][2], and they are using this output buffer growth algorithm [3][4]:

    newsize = size + (size >> 3) + 6

[1] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L109
[2] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L124
[3] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L133
[4] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L121

For many case, the output buffer is resized too many times.
You may paste this code to REPL to see the growth step:

size = 8*1024
for i in range(1, 120):
    print('Step %d ' % i, format(size, ','), 'bytes')
    size = size + (size >> 3) + 6

Step 1  8,192 bytes
Step 2  9,222 bytes
Step 3  10,380 bytes
Step 4  11,683 bytes
Step 5  13,149 bytes
Step 6  14,798 bytes
...

🔵  zlib module's current growth algorithm

zlib module's initial output buffer size is 16KB [5], in each growth the buffer size doubles [6]. 

[5] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L32
[6] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L174

This algorithm has a higher risk of running out of memory:

...
Step 14  256 MB
Step 15  512 MB
Step 16  1 GB
Step 17  2 GB
Step 18  4 GB
Step 19  8 GB
Step 20  16 GB
Step 21  32 GB
Step 22  64 GB
...

🔵  Add _BlocksOutputBuffer for bz2/lzma/zlib module

Proposed PR uses a list of bytes object to represent output buffer.
It can eliminate the overhead of resizing (bz2/lzma), and prevent excessive memory footprint (zlib).

I only tested decompression, because the result is more obvious than compression.

For special data benchmark (all data consists of b'a'), see these attached pictures, _BlocksOutputBuffer has linear performance:
(Benchmark by attached file benchmark.py)

    0to2GB_step30MB.png    (Decompress from 0 to 2GB, 30MB step)
    0to200MB_step2MB.png   (Decompress from 0 to 200MB, 2MB step)
    0to20MB_step64KB.png   (Decompress from 0 to 20MB, 64KB step)

After switching to _BlocksOutputBuffer, the code of bz2/lzma is more concise, the code of zlib is basically translated statement by statement, IMO it's safe and easy for review.

🔵  Real data benchmark

For real data, the weight of resizing output buffer is not big, so the performance improvement is not as big as above pictures:
(Benchmark by attached file benchmark_real.py)

----------------- bz2 -----------------

linux-2.6.39.4.tar.bz2
input size: 76,097,195, output size: 449,638,400
best of 5: [baseline_raw] 12.954 sec -> [patched_raw] 11.600 sec, 1.12x faster (-10%)

firefox-79.0.linux-i686.tar.bz2
input size: 74,109,706, output size: 228,055,040
best of 5: [baseline_raw] 8.511 sec -> [patched_raw] 7.829 sec, 1.09x faster (-8%)

ffmpeg-4.3.1.tar.bz2
input size: 11,301,038, output size: 74,567,680
best of 5: [baseline_raw] 1.915 sec -> [patched_raw] 1.671 sec, 1.15x faster (-13%)

gimp-2.10.20.tar.bz2
input size: 33,108,938, output size: 214,179,840
best of 5: [baseline_raw] 5.794 sec -> [patched_raw] 4.964 sec, 1.17x faster (-14%)

sde-external-8.56.0-2020-07-05-lin.tar.bz2
input size: 26,746,086, output size: 92,129,280
best of 5: [baseline_raw] 3.153 sec -> [patched_raw] 2.835 sec, 1.11x faster (-10%)

----------------- lzma -----------------

linux-5.7.10.tar.xz
input size: 112,722,840, output size: 966,062,080
best of 5: [baseline_raw] 9.813 sec -> [patched_raw] 7.434 sec, 1.32x faster (-24%)

linux-2.6.39.4.tar.xz
input size: 63,243,812, output size: 449,638,400
best of 5: [baseline_raw] 5.256 sec -> [patched_raw] 4.200 sec, 1.25x faster (-20%)

gcc-9.3.0.tar.xz
input size: 70,533,868, output size: 618,608,640
best of 5: [baseline_raw] 6.398 sec -> [patched_raw] 4.878 sec, 1.31x faster (-24%)

Python-3.8.5.tar.xz
input size: 18,019,640, output size: 87,531,520
best of 5: [baseline_raw] 1.315 sec -> [patched_raw] 1.098 sec, 1.20x faster (-16%)

firefox-79.0.source.tar.xz
input size: 333,220,776, output size: 2,240,573,440
best of 5: [baseline_raw] 25.339 sec -> [patched_raw] 19.661 sec, 1.29x faster (-22%)

----------------- zlib -----------------

linux-5.7.10.tar.gz
input size: 175,493,557, output size: 966,062,080
best of 5: [baseline_raw] 2.360 sec -> [patched_raw] 2.401 sec, 1.02x slower (+2%)

linux-2.6.39.4.tar.gz
input size: 96,011,459, output size: 449,638,400
best of 5: [baseline_raw] 1.215 sec -> [patched_raw] 1.216 sec, 1.00x slower (+0%)

gcc-9.3.0.tar.gz
input size: 124,140,228, output size: 618,608,640
best of 5: [baseline_raw] 1.668 sec -> [patched_raw] 1.555 sec, 1.07x faster (-7%)

Python-3.8.5.tgz
input size: 24,149,093, output size: 87,531,520
best of 5: [baseline_raw] 0.263 sec -> [patched_raw] 0.253 sec, 1.04x faster (-4%)

openjdk-14.0.2_linux-x64_bin.tar.gz
input size: 198,606,190, output size: 335,175,680
best of 5: [baseline_raw] 1.273 sec -> [patched_raw] 1.221 sec, 1.04x faster (-4%)

postgresql-10.12-1-linux-x64-binaries.tar.gz
input size: 159,090,037, output size: 408,678,400
best of 5: [baseline_raw] 1.415 sec -> [patched_raw] 1.401 sec, 1.01x faster (-1%)

-------------- benchmark end --------------

🔵  Add .readall() function to _compression.DecompressReader class

The last commit add .readall() function to _compression.DecompressReader, it optimize the .read(-1) for BZ2File/LZMAFile/GzipFile object.

The following description is a bit complicate:

    1, BZ2File/LZMAFile/GzipFile object has a `self._buffer` attribute [7]:

        raw = _compression.DecompressReader(fp, BZ2Decompressor/LZMADecompressor/zlib.decompressobj)
        self._buffer = io.BufferedReader(raw)

    [7] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L130-L132

    2, When call `.read(-1)` function (read all data) on BZ2File/LZMAFile/GzipFile
    object, will dispatch to `self._buffer`'s .read(-1) function [8].

    [8] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L200

    3, `self._buffer` is an io.BufferedReader object, it will dispatch to underlying
    stream's readall() function [9].

    [9] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/bufferedio.c#L1538-L1542

    4, The underlying stream is a _compression.DecompressReader object, which is a
    subclass of io.RawIOBase [10].

    [10] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L33

    5, Then io.RawIOBase's readall() function is called, but it's very inefficient.
    It only read DEFAULT_BUFFER_SIZE (8KB) data each time. [11]
    
    [11] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/iobase.c#L968-L969
    
    6, So each time the decompressor will try to decompress 8KB input data to a 8KB
    output buffer [12]:
    
        data = self._decompressor.decompress(8KB_input_buffer, 8KB_output_buffer)

    [12] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L103
    
    In most cases, the input buffer will not be fully decompressed, this brings
    unnecessary overhead.
    
    After adding the .readall() function to _compression.DecompressReader, the
    limit of the output buffer size has been removed.
    
    This change requires _BlocksOutputBuffer, otherwise, in extreme cases (small
    input buffer be decompressed to 100MB data), it will be slower due to the
    overhead of buffer resizing.

Benchmark with real data, call .read(-1) on XXXXFile object:
(Benchmark by attached file benchmark_real.py)

----------------- BZ2File -----------------

linux-2.6.39.4.tar.bz2
input size: 76,097,195, output size: 449,638,400
best of 5: [baseline_file] 12.371 sec -> [patched_file] 12.035 sec, 1.03x faster (-3%)

firefox-79.0.linux-i686.tar.bz2
input size: 74,109,706, output size: 228,055,040
best of 5: [baseline_file] 8.233 sec -> [patched_file] 8.124 sec, 1.01x faster (-1%)

ffmpeg-4.3.1.tar.bz2
input size: 11,301,038, output size: 74,567,680
best of 5: [baseline_file] 1.747 sec -> [patched_file] 1.724 sec, 1.01x faster (-1%)

gimp-2.10.20.tar.bz2
input size: 33,108,938, output size: 214,179,840
best of 5: [baseline_file] 5.220 sec -> [patched_file] 5.162 sec, 1.01x faster (-1%)

sde-external-8.56.0-2020-07-05-lin.tar.bz2
input size: 26,746,086, output size: 92,129,280
best of 5: [baseline_file] 2.935 sec -> [patched_file] 2.899 sec, 1.01x faster (-1%)

----------------- LZMAFile -----------------

linux-5.7.10.tar.xz
input size: 112,722,840, output size: 966,062,080
best of 5: [baseline_file] 7.712 sec -> [patched_file] 7.670 sec, 1.01x faster (-1%)

linux-2.6.39.4.tar.xz
input size: 63,243,812, output size: 449,638,400
best of 5: [baseline_file] 4.342 sec -> [patched_file] 4.274 sec, 1.02x faster (-2%)

gcc-9.3.0.tar.xz
input size: 70,533,868, output size: 618,608,640
best of 5: [baseline_file] 5.077 sec -> [patched_file] 4.974 sec, 1.02x faster (-2%)

Python-3.8.5.tar.xz
input size: 18,019,640, output size: 87,531,520
best of 5: [baseline_file] 1.093 sec -> [patched_file] 1.087 sec, 1.01x faster (-1%)

firefox-79.0.source.tar.xz
input size: 333,220,776, output size: 2,240,573,440
best of 5: [baseline_file] 20.748 sec -> [patched_file] 20.276 sec, 1.02x faster (-2%)

----------------- GzipFile -----------------

linux-5.7.10.tar.gz
input size: 175,493,567, output size: 966,062,080
best of 5: [baseline_file] 4.422 sec -> [patched_file] 3.803 sec, 1.16x faster (-14%)

linux-2.6.39.4.tar.gz
input size: 96,011,469, output size: 449,638,400
best of 5: [baseline_file] 2.119 sec -> [patched_file] 1.863 sec, 1.14x faster (-12%)

gcc-9.3.0.tar.gz
input size: 124,140,238, output size: 618,608,640
best of 5: [baseline_file] 2.825 sec -> [patched_file] 2.409 sec, 1.17x faster (-15%)

Python-3.8.5.tgz
input size: 24,149,103, output size: 87,531,520
best of 5: [baseline_file] 0.410 sec -> [patched_file] 0.349 sec, 1.18x faster (-15%)

openjdk-14.0.2_linux-x64_bin.tar.gz
input size: 198,606,200, output size: 335,175,680
best of 5: [baseline_file] 1.885 sec -> [patched_file] 1.702 sec, 1.11x faster (-10%)

postgresql-10.12-1-linux-x64-binaries.tar.gz
input size: 159,090,047, output size: 408,678,400
best of 5: [baseline_file] 2.236 sec -> [patched_file] 2.002 sec, 1.12x faster (-10%)

-------------- benchmark end --------------
msg379682 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-26 19:11
Ma Lin proposed this approach (PR 21740) for _PyBytesWriter/_PyUnicodeWriter on python-dev:
https://mail.python.org/archives/list/python-dev@python.org/message/UMB52BEZCX424K5K2ZNPWV7ZTQAGYL53/
msg379683 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-26 19:45
It would be interested to see if using _PyBytesWriter in bz2, lzma, zlib and _io.FileIO.readall() would speed up the code. I would prefer to centralize the buffer allocation logic in _PyBytesWriter rather than having custom code in each file. _PyBytesWriter is designed to reduce the number of realloc() by using overallocation, it uses a different overallocation ratio on Windows, and it uses a small buffer allocated on the stack for strings up to 512 bytes.
msg379817 - (view) Author: Ma Lin (malin) * Date: 2020-10-28 06:28
I modify lzma module to use different growth factors, see attached picture different_factors.png

1.5x should be the growth factor of _PyBytesWriter under Windows.

So if change _PyBytesWriter to use memory blocks, maybe there will be no performance improvement.

Over allocate factor of _PyBytesWriter:

    # ifdef MS_WINDOWS
        # define OVERALLOCATE_FACTOR 2
    # else
        # define OVERALLOCATE_FACTOR 4
    # endif

(I'm using Windows 10)
History
Date User Action Args
2020-10-28 06:28:16malinsetfiles: + different_factors.png

messages: + msg379817
2020-10-26 19:45:11vstinnersetmessages: + msg379683
2020-10-26 19:11:27vstinnersetnosy: + vstinner
messages: + msg379682
2020-08-05 10:02:14malinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request20886
2020-08-05 09:59:54malinsetfiles: + benchmark_real.py
2020-08-05 09:59:34malinsetfiles: + benchmark.py
2020-08-05 09:59:05malinsetfiles: + 0to20MB_step64KB.png
2020-08-05 09:58:49malinsetfiles: + 0to200MB_step2MB.png
2020-08-05 09:58:27malinsetfiles: + 0to2GB_step30MB.png
2020-08-05 09:55:44malincreate