This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author rhpvorderman
Recipients rhpvorderman
Date 2020-08-17.08:19:06
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1597652346.68.0.240792775725.issue41566@roundup.psfhosted.org>
In-reply-to
Content
The gzip file format is quite ubiquitous and so is its first (?) free/libre implementation zlib with the gzip command line tool. This uses the DEFLATE algorithm.

Lately some faster algorithms (most notable zstd) have popped up which have better speed and compression ratio vs zlib. Unfortunately switching over to zstd will not be seemless. It is not compatible with zlib/gzip in any way. 

Luckily some developers have tried to implement DEFLATE in a faster way. Most notably libdeflate (https://github.com/ebiggers/libdeflate) and Intel's storage acceleration library (https://github.com/intel/isa-l).

These libraries provide the libdeflate-gzip and igzip utilities respectively. These can compress and decompress the same gzip files. An igzip compressed file can be read with gzip and vice versa.

To give an idea of the speed improvements that can be obtained. Here are some benchmarks. All benchmarks were done using hyperfine (https://github.com/sharkdp/hyperfine). The system was a Ryzen 5 3600 with 2x16GB DDR4-3200 memory. Operating system Debian 10. All benchmarks were performed on a tmpfs which lives in memory to prevent IO bottlenecks. The test file was a 5 million read FASTQ file of 1.6 GB (https://en.wikipedia.org/wiki/FASTQ_format). These type of files are common in bioinformatics at 100+ GB sizes so are a good real-world benchmark.

I benchmarked pigz on one thread as well, as it implements zlib but in a faster way than gzip. Zstd was benchmarked as a comparison.

Versions: 
gzip 1.9 (provided by debian)
pigz 2.4 (provided by debian)
igzip 2.25.0 (provided by debian)
libdeflate-gzip 1.6 (compiled by conda-build with the recipe here: https://github.com/conda-forge/libdeflate-feedstock/pull/4)
zstd 1.3.8 (provided by debian)

By default level 1 is chosen for all compression benchmarks. Time is average over 10 runs.

COMPRESSION
program            time           size   memory
gzip               23.5 seconds   657M   1.5M
pigz (one thread)  22.2 seconds   658M   2.4M
libdeflate-gzip    10.1 seconds   623M   1.6G (reads entire file in memory)
igzip              4.6 seconds    620M   3.5M
zstd (to .zst)     6.1 seconds    584M   12.1M

Decompression. All programs decompressed the file created using gzip -1. (Even zstd which can also decompress gzip).

DECOMPRESSION
program            time           memory
gzip               10.5 seconds   744K
pigz (one-thread)  6.7 seconds    1.2M
libdeflate-gzip    3.6 seconds    2.2G (reads in mem before writing)
igzip              3.3 seconds    3.6M
zstd (from .gz)    6.4 seconds    2.2M
zstd (from .zst)   2.3 seconds    3.1M

As shown from the above benchmarks, using Intel's Storage Acceleration Libraries may improve performance quite substantially. Offering very fast compression and decompression. This gets igzip in the zstd ballpark in terms of speed while still offering backwards compatibility with gzip.

Intel's Storage Acceleration Libraries (isa-l) come with a bsd-3-clause license, so there should be no licensing issues when using that code inside of CPython.
History
Date User Action Args
2020-08-17 08:19:06rhpvordermansetrecipients: + rhpvorderman
2020-08-17 08:19:06rhpvordermansetmessageid: <1597652346.68.0.240792775725.issue41566@roundup.psfhosted.org>
2020-08-17 08:19:06rhpvordermanlinkissue41566 messages
2020-08-17 08:19:06rhpvordermancreate