Message 374867 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	malin
Recipients	malin
Date	2020-08-05.09:55:43
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1596621344.28.0.22443935276.issue41486@roundup.psfhosted.org>
In-reply-to

Content
🔵 bz2/lzma module's current growth algorithm bz2/lzma module's initial output buffer size is 8KB [1][2], and they are using this output buffer growth algorithm [3][4]: newsize = size + (size >> 3) + 6 [1] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L109 [2] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L124 [3] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L133 [4] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L121 For many case, the output buffer is resized too many times. You may paste this code to REPL to see the growth step: size = 8*1024 for i in range(1, 120): print('Step %d ' % i, format(size, ','), 'bytes') size = size + (size >> 3) + 6 Step 1 8,192 bytes Step 2 9,222 bytes Step 3 10,380 bytes Step 4 11,683 bytes Step 5 13,149 bytes Step 6 14,798 bytes ... 🔵 zlib module's current growth algorithm zlib module's initial output buffer size is 16KB [5], in each growth the buffer size doubles [6]. [5] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L32 [6] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L174 This algorithm has a higher risk of running out of memory: ... Step 14 256 MB Step 15 512 MB Step 16 1 GB Step 17 2 GB Step 18 4 GB Step 19 8 GB Step 20 16 GB Step 21 32 GB Step 22 64 GB ... 🔵 Add _BlocksOutputBuffer for bz2/lzma/zlib module Proposed PR uses a list of bytes object to represent output buffer. It can eliminate the overhead of resizing (bz2/lzma), and prevent excessive memory footprint (zlib). I only tested decompression, because the result is more obvious than compression. For special data benchmark (all data consists of b'a'), see these attached pictures, _BlocksOutputBuffer has linear performance: (Benchmark by attached file benchmark.py) 0to2GB_step30MB.png (Decompress from 0 to 2GB, 30MB step) 0to200MB_step2MB.png (Decompress from 0 to 200MB, 2MB step) 0to20MB_step64KB.png (Decompress from 0 to 20MB, 64KB step) After switching to _BlocksOutputBuffer, the code of bz2/lzma is more concise, the code of zlib is basically translated statement by statement, IMO it's safe and easy for review. 🔵 Real data benchmark For real data, the weight of resizing output buffer is not big, so the performance improvement is not as big as above pictures: (Benchmark by attached file benchmark_real.py) ----------------- bz2 ----------------- linux-2.6.39.4.tar.bz2 input size: 76,097,195, output size: 449,638,400 best of 5: [baseline_raw] 12.954 sec -> [patched_raw] 11.600 sec, 1.12x faster (-10%) firefox-79.0.linux-i686.tar.bz2 input size: 74,109,706, output size: 228,055,040 best of 5: [baseline_raw] 8.511 sec -> [patched_raw] 7.829 sec, 1.09x faster (-8%) ffmpeg-4.3.1.tar.bz2 input size: 11,301,038, output size: 74,567,680 best of 5: [baseline_raw] 1.915 sec -> [patched_raw] 1.671 sec, 1.15x faster (-13%) gimp-2.10.20.tar.bz2 input size: 33,108,938, output size: 214,179,840 best of 5: [baseline_raw] 5.794 sec -> [patched_raw] 4.964 sec, 1.17x faster (-14%) sde-external-8.56.0-2020-07-05-lin.tar.bz2 input size: 26,746,086, output size: 92,129,280 best of 5: [baseline_raw] 3.153 sec -> [patched_raw] 2.835 sec, 1.11x faster (-10%) ----------------- lzma ----------------- linux-5.7.10.tar.xz input size: 112,722,840, output size: 966,062,080 best of 5: [baseline_raw] 9.813 sec -> [patched_raw] 7.434 sec, 1.32x faster (-24%) linux-2.6.39.4.tar.xz input size: 63,243,812, output size: 449,638,400 best of 5: [baseline_raw] 5.256 sec -> [patched_raw] 4.200 sec, 1.25x faster (-20%) gcc-9.3.0.tar.xz input size: 70,533,868, output size: 618,608,640 best of 5: [baseline_raw] 6.398 sec -> [patched_raw] 4.878 sec, 1.31x faster (-24%) Python-3.8.5.tar.xz input size: 18,019,640, output size: 87,531,520 best of 5: [baseline_raw] 1.315 sec -> [patched_raw] 1.098 sec, 1.20x faster (-16%) firefox-79.0.source.tar.xz input size: 333,220,776, output size: 2,240,573,440 best of 5: [baseline_raw] 25.339 sec -> [patched_raw] 19.661 sec, 1.29x faster (-22%) ----------------- zlib ----------------- linux-5.7.10.tar.gz input size: 175,493,557, output size: 966,062,080 best of 5: [baseline_raw] 2.360 sec -> [patched_raw] 2.401 sec, 1.02x slower (+2%) linux-2.6.39.4.tar.gz input size: 96,011,459, output size: 449,638,400 best of 5: [baseline_raw] 1.215 sec -> [patched_raw] 1.216 sec, 1.00x slower (+0%) gcc-9.3.0.tar.gz input size: 124,140,228, output size: 618,608,640 best of 5: [baseline_raw] 1.668 sec -> [patched_raw] 1.555 sec, 1.07x faster (-7%) Python-3.8.5.tgz input size: 24,149,093, output size: 87,531,520 best of 5: [baseline_raw] 0.263 sec -> [patched_raw] 0.253 sec, 1.04x faster (-4%) openjdk-14.0.2_linux-x64_bin.tar.gz input size: 198,606,190, output size: 335,175,680 best of 5: [baseline_raw] 1.273 sec -> [patched_raw] 1.221 sec, 1.04x faster (-4%) postgresql-10.12-1-linux-x64-binaries.tar.gz input size: 159,090,037, output size: 408,678,400 best of 5: [baseline_raw] 1.415 sec -> [patched_raw] 1.401 sec, 1.01x faster (-1%) -------------- benchmark end -------------- 🔵 Add .readall() function to _compression.DecompressReader class The last commit add .readall() function to _compression.DecompressReader, it optimize the .read(-1) for BZ2File/LZMAFile/GzipFile object. The following description is a bit complicate: 1, BZ2File/LZMAFile/GzipFile object has a `self._buffer` attribute [7]: raw = _compression.DecompressReader(fp, BZ2Decompressor/LZMADecompressor/zlib.decompressobj) self._buffer = io.BufferedReader(raw) [7] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L130-L132 2, When call `.read(-1)` function (read all data) on BZ2File/LZMAFile/GzipFile object, will dispatch to `self._buffer`'s .read(-1) function [8]. [8] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L200 3, `self._buffer` is an io.BufferedReader object, it will dispatch to underlying stream's readall() function [9]. [9] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/bufferedio.c#L1538-L1542 4, The underlying stream is a _compression.DecompressReader object, which is a subclass of io.RawIOBase [10]. [10] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L33 5, Then io.RawIOBase's readall() function is called, but it's very inefficient. It only read DEFAULT_BUFFER_SIZE (8KB) data each time. [11] [11] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/iobase.c#L968-L969 6, So each time the decompressor will try to decompress 8KB input data to a 8KB output buffer [12]: data = self._decompressor.decompress(8KB_input_buffer, 8KB_output_buffer) [12] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L103 In most cases, the input buffer will not be fully decompressed, this brings unnecessary overhead. After adding the .readall() function to _compression.DecompressReader, the limit of the output buffer size has been removed. This change requires _BlocksOutputBuffer, otherwise, in extreme cases (small input buffer be decompressed to 100MB data), it will be slower due to the overhead of buffer resizing. Benchmark with real data, call .read(-1) on XXXXFile object: (Benchmark by attached file benchmark_real.py) ----------------- BZ2File ----------------- linux-2.6.39.4.tar.bz2 input size: 76,097,195, output size: 449,638,400 best of 5: [baseline_file] 12.371 sec -> [patched_file] 12.035 sec, 1.03x faster (-3%) firefox-79.0.linux-i686.tar.bz2 input size: 74,109,706, output size: 228,055,040 best of 5: [baseline_file] 8.233 sec -> [patched_file] 8.124 sec, 1.01x faster (-1%) ffmpeg-4.3.1.tar.bz2 input size: 11,301,038, output size: 74,567,680 best of 5: [baseline_file] 1.747 sec -> [patched_file] 1.724 sec, 1.01x faster (-1%) gimp-2.10.20.tar.bz2 input size: 33,108,938, output size: 214,179,840 best of 5: [baseline_file] 5.220 sec -> [patched_file] 5.162 sec, 1.01x faster (-1%) sde-external-8.56.0-2020-07-05-lin.tar.bz2 input size: 26,746,086, output size: 92,129,280 best of 5: [baseline_file] 2.935 sec -> [patched_file] 2.899 sec, 1.01x faster (-1%) ----------------- LZMAFile ----------------- linux-5.7.10.tar.xz input size: 112,722,840, output size: 966,062,080 best of 5: [baseline_file] 7.712 sec -> [patched_file] 7.670 sec, 1.01x faster (-1%) linux-2.6.39.4.tar.xz input size: 63,243,812, output size: 449,638,400 best of 5: [baseline_file] 4.342 sec -> [patched_file] 4.274 sec, 1.02x faster (-2%) gcc-9.3.0.tar.xz input size: 70,533,868, output size: 618,608,640 best of 5: [baseline_file] 5.077 sec -> [patched_file] 4.974 sec, 1.02x faster (-2%) Python-3.8.5.tar.xz input size: 18,019,640, output size: 87,531,520 best of 5: [baseline_file] 1.093 sec -> [patched_file] 1.087 sec, 1.01x faster (-1%) firefox-79.0.source.tar.xz input size: 333,220,776, output size: 2,240,573,440 best of 5: [baseline_file] 20.748 sec -> [patched_file] 20.276 sec, 1.02x faster (-2%) ----------------- GzipFile ----------------- linux-5.7.10.tar.gz input size: 175,493,567, output size: 966,062,080 best of 5: [baseline_file] 4.422 sec -> [patched_file] 3.803 sec, 1.16x faster (-14%) linux-2.6.39.4.tar.gz input size: 96,011,469, output size: 449,638,400 best of 5: [baseline_file] 2.119 sec -> [patched_file] 1.863 sec, 1.14x faster (-12%) gcc-9.3.0.tar.gz input size: 124,140,238, output size: 618,608,640 best of 5: [baseline_file] 2.825 sec -> [patched_file] 2.409 sec, 1.17x faster (-15%) Python-3.8.5.tgz input size: 24,149,103, output size: 87,531,520 best of 5: [baseline_file] 0.410 sec -> [patched_file] 0.349 sec, 1.18x faster (-15%) openjdk-14.0.2_linux-x64_bin.tar.gz input size: 198,606,200, output size: 335,175,680 best of 5: [baseline_file] 1.885 sec -> [patched_file] 1.702 sec, 1.11x faster (-10%) postgresql-10.12-1-linux-x64-binaries.tar.gz input size: 159,090,047, output size: 408,678,400 best of 5: [baseline_file] 2.236 sec -> [patched_file] 2.002 sec, 1.12x faster (-10%) -------------- benchmark end --------------

🔵  bz2/lzma module's current growth algorithm

bz2/lzma module's initial output buffer size is 8KB [1][2], and they are using this output buffer growth algorithm [3][4]:

    newsize = size + (size >> 3) + 6

[1] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L109
[2] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L124
[3] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_lzmamodule.c#L133
[4] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_bz2module.c#L121

For many case, the output buffer is resized too many times.
You may paste this code to REPL to see the growth step:

size = 8*1024
for i in range(1, 120):
    print('Step %d ' % i, format(size, ','), 'bytes')
    size = size + (size >> 3) + 6

Step 1  8,192 bytes
Step 2  9,222 bytes
Step 3  10,380 bytes
Step 4  11,683 bytes
Step 5  13,149 bytes
Step 6  14,798 bytes
...

🔵  zlib module's current growth algorithm

zlib module's initial output buffer size is 16KB [5], in each growth the buffer size doubles [6]. 

[5] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L32
[6] https://github.com/python/cpython/blob/v3.9.0b4/Modules/zlibmodule.c#L174

This algorithm has a higher risk of running out of memory:

...
Step 14  256 MB
Step 15  512 MB
Step 16  1 GB
Step 17  2 GB
Step 18  4 GB
Step 19  8 GB
Step 20  16 GB
Step 21  32 GB
Step 22  64 GB
...

🔵  Add _BlocksOutputBuffer for bz2/lzma/zlib module

Proposed PR uses a list of bytes object to represent output buffer.
It can eliminate the overhead of resizing (bz2/lzma), and prevent excessive memory footprint (zlib).

I only tested decompression, because the result is more obvious than compression.

For special data benchmark (all data consists of b'a'), see these attached pictures, _BlocksOutputBuffer has linear performance:
(Benchmark by attached file benchmark.py)

    0to2GB_step30MB.png    (Decompress from 0 to 2GB, 30MB step)
    0to200MB_step2MB.png   (Decompress from 0 to 200MB, 2MB step)
    0to20MB_step64KB.png   (Decompress from 0 to 20MB, 64KB step)

After switching to _BlocksOutputBuffer, the code of bz2/lzma is more concise, the code of zlib is basically translated statement by statement, IMO it's safe and easy for review.

🔵  Real data benchmark

For real data, the weight of resizing output buffer is not big, so the performance improvement is not as big as above pictures:
(Benchmark by attached file benchmark_real.py)

----------------- bz2 -----------------

linux-2.6.39.4.tar.bz2
input size: 76,097,195, output size: 449,638,400
best of 5: [baseline_raw] 12.954 sec -> [patched_raw] 11.600 sec, 1.12x faster (-10%)

firefox-79.0.linux-i686.tar.bz2
input size: 74,109,706, output size: 228,055,040
best of 5: [baseline_raw] 8.511 sec -> [patched_raw] 7.829 sec, 1.09x faster (-8%)

ffmpeg-4.3.1.tar.bz2
input size: 11,301,038, output size: 74,567,680
best of 5: [baseline_raw] 1.915 sec -> [patched_raw] 1.671 sec, 1.15x faster (-13%)

gimp-2.10.20.tar.bz2
input size: 33,108,938, output size: 214,179,840
best of 5: [baseline_raw] 5.794 sec -> [patched_raw] 4.964 sec, 1.17x faster (-14%)

sde-external-8.56.0-2020-07-05-lin.tar.bz2
input size: 26,746,086, output size: 92,129,280
best of 5: [baseline_raw] 3.153 sec -> [patched_raw] 2.835 sec, 1.11x faster (-10%)

----------------- lzma -----------------

linux-5.7.10.tar.xz
input size: 112,722,840, output size: 966,062,080
best of 5: [baseline_raw] 9.813 sec -> [patched_raw] 7.434 sec, 1.32x faster (-24%)

linux-2.6.39.4.tar.xz
input size: 63,243,812, output size: 449,638,400
best of 5: [baseline_raw] 5.256 sec -> [patched_raw] 4.200 sec, 1.25x faster (-20%)

gcc-9.3.0.tar.xz
input size: 70,533,868, output size: 618,608,640
best of 5: [baseline_raw] 6.398 sec -> [patched_raw] 4.878 sec, 1.31x faster (-24%)

Python-3.8.5.tar.xz
input size: 18,019,640, output size: 87,531,520
best of 5: [baseline_raw] 1.315 sec -> [patched_raw] 1.098 sec, 1.20x faster (-16%)

firefox-79.0.source.tar.xz
input size: 333,220,776, output size: 2,240,573,440
best of 5: [baseline_raw] 25.339 sec -> [patched_raw] 19.661 sec, 1.29x faster (-22%)

----------------- zlib -----------------

linux-5.7.10.tar.gz
input size: 175,493,557, output size: 966,062,080
best of 5: [baseline_raw] 2.360 sec -> [patched_raw] 2.401 sec, 1.02x slower (+2%)

linux-2.6.39.4.tar.gz
input size: 96,011,459, output size: 449,638,400
best of 5: [baseline_raw] 1.215 sec -> [patched_raw] 1.216 sec, 1.00x slower (+0%)

gcc-9.3.0.tar.gz
input size: 124,140,228, output size: 618,608,640
best of 5: [baseline_raw] 1.668 sec -> [patched_raw] 1.555 sec, 1.07x faster (-7%)

Python-3.8.5.tgz
input size: 24,149,093, output size: 87,531,520
best of 5: [baseline_raw] 0.263 sec -> [patched_raw] 0.253 sec, 1.04x faster (-4%)

openjdk-14.0.2_linux-x64_bin.tar.gz
input size: 198,606,190, output size: 335,175,680
best of 5: [baseline_raw] 1.273 sec -> [patched_raw] 1.221 sec, 1.04x faster (-4%)

postgresql-10.12-1-linux-x64-binaries.tar.gz
input size: 159,090,037, output size: 408,678,400
best of 5: [baseline_raw] 1.415 sec -> [patched_raw] 1.401 sec, 1.01x faster (-1%)

-------------- benchmark end --------------

🔵  Add .readall() function to _compression.DecompressReader class

The last commit add .readall() function to _compression.DecompressReader, it optimize the .read(-1) for BZ2File/LZMAFile/GzipFile object.

The following description is a bit complicate:

    1, BZ2File/LZMAFile/GzipFile object has a `self._buffer` attribute [7]:

        raw = _compression.DecompressReader(fp, BZ2Decompressor/LZMADecompressor/zlib.decompressobj)
        self._buffer = io.BufferedReader(raw)

    [7] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L130-L132

    2, When call `.read(-1)` function (read all data) on BZ2File/LZMAFile/GzipFile
    object, will dispatch to `self._buffer`'s .read(-1) function [8].

    [8] https://github.com/python/cpython/blob/v3.9.0b4/Lib/lzma.py#L200

    3, `self._buffer` is an io.BufferedReader object, it will dispatch to underlying
    stream's readall() function [9].

    [9] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/bufferedio.c#L1538-L1542

    4, The underlying stream is a _compression.DecompressReader object, which is a
    subclass of io.RawIOBase [10].

    [10] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L33

    5, Then io.RawIOBase's readall() function is called, but it's very inefficient.
    It only read DEFAULT_BUFFER_SIZE (8KB) data each time. [11]
    
    [11] https://github.com/python/cpython/blob/v3.9.0b4/Modules/_io/iobase.c#L968-L969
    
    6, So each time the decompressor will try to decompress 8KB input data to a 8KB
    output buffer [12]:
    
        data = self._decompressor.decompress(8KB_input_buffer, 8KB_output_buffer)

    [12] https://github.com/python/cpython/blob/v3.9.0b4/Lib/_compression.py#L103
    
    In most cases, the input buffer will not be fully decompressed, this brings
    unnecessary overhead.
    
    After adding the .readall() function to _compression.DecompressReader, the
    limit of the output buffer size has been removed.
    
    This change requires _BlocksOutputBuffer, otherwise, in extreme cases (small
    input buffer be decompressed to 100MB data), it will be slower due to the
    overhead of buffer resizing.

Benchmark with real data, call .read(-1) on XXXXFile object:
(Benchmark by attached file benchmark_real.py)

----------------- BZ2File -----------------

linux-2.6.39.4.tar.bz2
input size: 76,097,195, output size: 449,638,400
best of 5: [baseline_file] 12.371 sec -> [patched_file] 12.035 sec, 1.03x faster (-3%)

firefox-79.0.linux-i686.tar.bz2
input size: 74,109,706, output size: 228,055,040
best of 5: [baseline_file] 8.233 sec -> [patched_file] 8.124 sec, 1.01x faster (-1%)

ffmpeg-4.3.1.tar.bz2
input size: 11,301,038, output size: 74,567,680
best of 5: [baseline_file] 1.747 sec -> [patched_file] 1.724 sec, 1.01x faster (-1%)

gimp-2.10.20.tar.bz2
input size: 33,108,938, output size: 214,179,840
best of 5: [baseline_file] 5.220 sec -> [patched_file] 5.162 sec, 1.01x faster (-1%)

sde-external-8.56.0-2020-07-05-lin.tar.bz2
input size: 26,746,086, output size: 92,129,280
best of 5: [baseline_file] 2.935 sec -> [patched_file] 2.899 sec, 1.01x faster (-1%)

----------------- LZMAFile -----------------

linux-5.7.10.tar.xz
input size: 112,722,840, output size: 966,062,080
best of 5: [baseline_file] 7.712 sec -> [patched_file] 7.670 sec, 1.01x faster (-1%)

linux-2.6.39.4.tar.xz
input size: 63,243,812, output size: 449,638,400
best of 5: [baseline_file] 4.342 sec -> [patched_file] 4.274 sec, 1.02x faster (-2%)

gcc-9.3.0.tar.xz
input size: 70,533,868, output size: 618,608,640
best of 5: [baseline_file] 5.077 sec -> [patched_file] 4.974 sec, 1.02x faster (-2%)

Python-3.8.5.tar.xz
input size: 18,019,640, output size: 87,531,520
best of 5: [baseline_file] 1.093 sec -> [patched_file] 1.087 sec, 1.01x faster (-1%)

firefox-79.0.source.tar.xz
input size: 333,220,776, output size: 2,240,573,440
best of 5: [baseline_file] 20.748 sec -> [patched_file] 20.276 sec, 1.02x faster (-2%)

----------------- GzipFile -----------------

linux-5.7.10.tar.gz
input size: 175,493,567, output size: 966,062,080
best of 5: [baseline_file] 4.422 sec -> [patched_file] 3.803 sec, 1.16x faster (-14%)

linux-2.6.39.4.tar.gz
input size: 96,011,469, output size: 449,638,400
best of 5: [baseline_file] 2.119 sec -> [patched_file] 1.863 sec, 1.14x faster (-12%)

gcc-9.3.0.tar.gz
input size: 124,140,238, output size: 618,608,640
best of 5: [baseline_file] 2.825 sec -> [patched_file] 2.409 sec, 1.17x faster (-15%)

Python-3.8.5.tgz
input size: 24,149,103, output size: 87,531,520
best of 5: [baseline_file] 0.410 sec -> [patched_file] 0.349 sec, 1.18x faster (-15%)

openjdk-14.0.2_linux-x64_bin.tar.gz
input size: 198,606,200, output size: 335,175,680
best of 5: [baseline_file] 1.885 sec -> [patched_file] 1.702 sec, 1.11x faster (-10%)

postgresql-10.12-1-linux-x64-binaries.tar.gz
input size: 159,090,047, output size: 408,678,400
best of 5: [baseline_file] 2.236 sec -> [patched_file] 2.002 sec, 1.12x faster (-10%)

-------------- benchmark end --------------

History
Date	User	Action	Args
2020-08-05 09:55:44	malin	set	recipients: + malin
2020-08-05 09:55:44	malin	set	messageid: <1596621344.28.0.22443935276.issue41486@roundup.psfhosted.org>
2020-08-05 09:55:44	malin	link	issue41486 messages
2020-08-05 09:55:43	malin	create