This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.10
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, jack1142, jfrances21, rhpvorderman, scoder
Priority: normal Keywords: patch

Created on 2020-08-17 08:19 by rhpvorderman, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Repositories containing patches
10
Messages (15)
msg375533 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-08-17 08:19
The gzip file format is quite ubiquitous and so is its first (?) free/libre implementation zlib with the gzip command line tool. This uses the DEFLATE algorithm.

Lately some faster algorithms (most notable zstd) have popped up which have better speed and compression ratio vs zlib. Unfortunately switching over to zstd will not be seemless. It is not compatible with zlib/gzip in any way. 

Luckily some developers have tried to implement DEFLATE in a faster way. Most notably libdeflate (https://github.com/ebiggers/libdeflate) and Intel's storage acceleration library (https://github.com/intel/isa-l).

These libraries provide the libdeflate-gzip and igzip utilities respectively. These can compress and decompress the same gzip files. An igzip compressed file can be read with gzip and vice versa.

To give an idea of the speed improvements that can be obtained. Here are some benchmarks. All benchmarks were done using hyperfine (https://github.com/sharkdp/hyperfine). The system was a Ryzen 5 3600 with 2x16GB DDR4-3200 memory. Operating system Debian 10. All benchmarks were performed on a tmpfs which lives in memory to prevent IO bottlenecks. The test file was a 5 million read FASTQ file of 1.6 GB (https://en.wikipedia.org/wiki/FASTQ_format). These type of files are common in bioinformatics at 100+ GB sizes so are a good real-world benchmark.

I benchmarked pigz on one thread as well, as it implements zlib but in a faster way than gzip. Zstd was benchmarked as a comparison.

Versions: 
gzip 1.9 (provided by debian)
pigz 2.4 (provided by debian)
igzip 2.25.0 (provided by debian)
libdeflate-gzip 1.6 (compiled by conda-build with the recipe here: https://github.com/conda-forge/libdeflate-feedstock/pull/4)
zstd 1.3.8 (provided by debian)

By default level 1 is chosen for all compression benchmarks. Time is average over 10 runs.

COMPRESSION
program            time           size   memory
gzip               23.5 seconds   657M   1.5M
pigz (one thread)  22.2 seconds   658M   2.4M
libdeflate-gzip    10.1 seconds   623M   1.6G (reads entire file in memory)
igzip              4.6 seconds    620M   3.5M
zstd (to .zst)     6.1 seconds    584M   12.1M

Decompression. All programs decompressed the file created using gzip -1. (Even zstd which can also decompress gzip).

DECOMPRESSION
program            time           memory
gzip               10.5 seconds   744K
pigz (one-thread)  6.7 seconds    1.2M
libdeflate-gzip    3.6 seconds    2.2G (reads in mem before writing)
igzip              3.3 seconds    3.6M
zstd (from .gz)    6.4 seconds    2.2M
zstd (from .zst)   2.3 seconds    3.1M

As shown from the above benchmarks, using Intel's Storage Acceleration Libraries may improve performance quite substantially. Offering very fast compression and decompression. This gets igzip in the zstd ballpark in terms of speed while still offering backwards compatibility with gzip.

Intel's Storage Acceleration Libraries (isa-l) come with a bsd-3-clause license, so there should be no licensing issues when using that code inside of CPython.
msg375536 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-08-17 08:57
This has to be in a PEP. I am sorry I missplaced it on the bugtracker.
msg375545 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2020-08-17 14:02
> This has to be in a PEP

No, the bug tracker seems fine for this.
msg375546 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2020-08-17 14:05
What about building the library? The readme says it needs nasm? That's not a standard dependency for the CPython build currently, which relies solely on a C compiler.
msg375547 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-08-17 14:17
nasm or yasm will work. I only have experience building it with nasm.

But yes that is indeed a dependency. Personally I do not see the problem with adding nasm as a build dependency, as it opens up possibilities for even more performance optimizations in python by moving parts of the code to Assembly. But I can imagine that there might be complications with updating the build system.

Libdeflate does not have this problem as it entirely in C. So it could be interesting to use that. I think the unfortunate use of memory is due to the libdeflate-gzip coding, and not necessarily because of the library. Samtools is a tool that uses libdeflate library to great effect and its memory usage is fine.

> No, the bug tracker seems fine for this.

Well one thing is that libdeflate and isa-l use different compression ratio's for the levels. isa-l does not support anything above 3. libdeflate supports 1-12. So it is not a one-to-one mapping.
msg375549 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2020-08-17 14:29
> libdeflate and isa-l use different compression ratio's for the levels.

I don't see why these would need to translate 1:1. The zlib module is a Python API wrapper, it can do its own mapping here, or use the libraries selectively only for some compression levels. Python code also cannot rely on an exact bit pattern coming out of the zlib/gzip compressor, since that might change with the zlib version that is available. So I think we're fine when replacing the underlying implementation, as long as the API does not change and the output is strictly zlib/gzip compatible (and there are no visible performance/size regressions, as your numbers seem to suggest, but that would need some broader testing).


You also wrote on python-ideas that

> It is packaged in linux distros already

That might be an option then. CPython could use the existing library if it is available. It doesn't have to ship the sources. Most Linux distributions already build some standard library modules against the system-wide installed libraries rather than whatever CPython ships in its sources. And those distributions could then make the library a fixed dependency of their CPython packages.
msg375593 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-08-18 06:16
> That might be an option then. CPython could use the existing library if it is available.

Dynamic linking indeed seems like a great option here! Users who care about this will probably have the 'isal' and 'libdeflateO' packages installed on their machine anyway. I packaged isa-l on conda-forge, so it is available via the anaconda/miniconda installer.
msg375595 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-08-18 07:36
I just find out that libdeflate does not support streaming: https://github.com/ebiggers/libdeflate/issues/73 . I should have read the manual better.

So that explains the memory usage. Because of that I don't think it is suitable for usage in CPython.
msg375678 - (view) Author: Julia Frances (jfrances21) Date: 2020-08-19 22:58
welcome to python.org in this website there are many ways to do python coding. You need to follow the guide and the rules on how to do python coding.
msg375684 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-08-20 01:57
fwiw, no PEP is needed for things like this.  it'd just be an alternative library implementing the core of the zlib and/or gzip modules behind the scenes.

normally this kind of thing would be done using a check for the availability of the library in configure.ac with conditional compilation via #ifdef's from the configure output that controls what is set in pyconfig.h in the zlib and gzip modules itself.

Within the stdlib, I'd focus only on using things that can be used in a 100% api compatible way with the existing modules.

Otherwise creating a new module and putting it up on PyPI to expose the functionality from the libraries you want makes sense and will be easier to make available to everyone on existing Python versions rather than waiting on getting something into CPython.

FWIW I tend to avoid software provided by Intel given any other choice.

Look instead at Chromium's zlib as discussed in https://github.com/madler/zlib/issues/346 or possibly at a project like https://github.com/zlib-ng/zlib-ng.  These are much more likely to be drop in zlib replacements making this simply a build time thing which is more a matter of making sure we get the relevant compiler flags correct when the faster libraries are detected by configure.ac as available.

There is a caveat to using any of these: how well maintained and security vetted are all of the code paths in the implementation?  zlib proper gets massive security attention.  Its low rate of change and staleness are a feature.  Chromium's zlib also gets proper security attention.  No idea about zlib-ng and even less about Intels self serving arm-ignoring oddity.
msg375688 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-08-20 06:43
> Within the stdlib, I'd focus only on using things that can be used in a 100% api compatible way with the existing modules.

> Otherwise creating a new module and putting it up on PyPI to expose the functionality from the libraries you want makes sense and will be easier to make available to everyone on existing Python versions rather than waiting on getting something into CPython.

Agreed. 100% backwards compatibility is more important than speed. And getting it available to users is faster as a module than implementing it in CPython. 

I already have a PR open in the xopen module to implement the use of the igzip program. Implementing a module like the gzip module in CPython will be more work, but I will certainly consider it. Thanks for the suggestion! 

> There is a caveat to using any of these: how well maintained and security vetted are all of the code paths in the implementation?  zlib proper gets massive security attention.  Its low rate of change and staleness are a feature.

I didn't consider that. So I looked at the CVE page for ZLIB. The latest issues are from 2017. Before that 2005. This is the 2017 report: https://wiki.mozilla.org/images/0/09/Zlib-report.pdf. 
Note how it states that the old compiler support etc. are a basis for vulnerabilities. Precisely zlib-ng did get rid of these parts. On the other hand, Mozilla notes that Zlib is a great candidate for periodic security audits, precisely for the same reasons you mention.

> FWIW I tend to avoid software provided by Intel given any other choice.

I know the feeling. They rightfully have a very bad reputation for things they did in the past. But this particular code is open source and compilable with open source compilers. Part of it is written in Assembly, to get the speed advantage. I benchmarked it on my AMD processor and I too get enormous speed advantages out of it.

>  even less about Intels self serving arm-ignoring oddity.

They *do* provide instructions to build for arm. Right on their README. https://github.com/intel/isa-l#building-isa-l. I think it is very unfair to be so dismissive just because Intel pays the developers. This is good work, which speeds up bioinformatics workloads, which in turn helps us to help more patients.

On the whole I think the arguments to make a module are very strong. So I think that is the appropriate way forward. I'd love everyone to switch to more efficient deflate algorithms, but CPython may not be the right project to drive this change. At least this discussion is now here as a reference for other people who are curious about improving this performance aspect.
msg375718 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2020-08-20 15:48
> On the whole I think the arguments to make a module are very strong. So I think that is the appropriate way forward.

If you take this route, please don't write it directly against the CPython C-API (as you would for a CPython stdlib module). It is much easier to get something working, correct, testable and feature-rich with Cython. Once you have that, put it on PyPI to give it more testing. If you then decide to come back and provide a patch to get it integrated into CPython's zlib module, it's relatively easy to manually translate the Cython code to equivalent C-API code. But by that time, your code will be known to work, will have a test suite, and we can run real benchmarks on it before investing the time into spelling it out in C.
msg375719 - (view) Author: Julia Frances (jfrances21) Date: 2020-08-20 16:03
those are strong and good vocabulary skills you have just keep on making
more comments or responses and get that to me as soon as possible

*Julia Frances pascack hills*

On Thu, Aug 20, 2020 at 11:48 AM Stefan Behnel <report@bugs.python.org>
wrote:

>
> Stefan Behnel <stefan_ml@behnel.de> added the comment:
>
> > On the whole I think the arguments to make a module are very strong. So
> I think that is the appropriate way forward.
>
> If you take this route, please don't write it directly against the CPython
> C-API (as you would for a CPython stdlib module). It is much easier to get
> something working, correct, testable and feature-rich with Cython. Once you
> have that, put it on PyPI to give it more testing. If you then decide to
> come back and provide a patch to get it integrated into CPython's zlib
> module, it's relatively easy to manually translate the Cython code to
> equivalent C-API code. But by that time, your code will be known to work,
> will have a test suite, and we can run real benchmarks on it before
> investing the time into spelling it out in C.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue41566>
> _______________________________________
>
msg375836 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-08-24 06:17
> If you take this route, please don't write it directly against the CPython C-API (as you would for a CPython stdlib module).

Thanks for reminding me of this. I was planning to take the laziest route possible anyway, reusing as much code from cpython as possible.

I will probably start with a minimal implementation of these base classes https://github.com/python/cpython/blob/master/Lib/_compression.py using cython, using the gzip implementation in cpython as guidance.
msg376934 - (view) Author: Ruben Vorderman (rhpvorderman) * Date: 2020-09-15 09:19
Hi, thanks all for the comments and the help.

I have created the bindings using Cython. The project is still a work in progress as of this moment. I leave the link here for future reference.

Special thanks for the Cython developers for enabling these bindings.

https://github.com/rhpvorderman/python-isal
History
Date User Action Args
2022-04-11 14:59:34adminsetgithub: 85738
2020-09-15 09:19:04rhpvordermansetstatus: open -> closed
resolution: third party
messages: + msg376934

stage: needs patch -> resolved
2020-08-24 06:17:46rhpvordermansetmessages: + msg375836
2020-08-20 16:03:21jfrances21setmessages: + msg375719
2020-08-20 15:48:27scodersetmessages: + msg375718
2020-08-20 06:43:18rhpvordermansetmessages: + msg375688
2020-08-20 01:57:41gregory.p.smithsetnosy: + gregory.p.smith

messages: + msg375684
stage: patch review -> needs patch
2020-08-20 01:39:15gregory.p.smithsetpull_requests: - pull_request21037
2020-08-19 22:58:25jfrances21sethgrepos: + hgrepo392

keywords: + patch
nosy: + jfrances21
stage: needs patch -> patch review
messages: + msg375678
pull_requests: + pull_request21037
2020-08-18 07:36:54rhpvordermansetmessages: + msg375595
2020-08-18 06:16:59rhpvordermansetmessages: + msg375593
2020-08-17 15:11:05jack1142setnosy: + jack1142
2020-08-17 14:29:19scodersetmessages: + msg375549
2020-08-17 14:17:38rhpvordermansetmessages: + msg375547
2020-08-17 14:05:14scodersetmessages: + msg375546
2020-08-17 14:02:49scodersetstatus: closed -> open

type: performance

nosy: + scoder
messages: + msg375545
resolution: not a bug -> (no value)
stage: resolved -> needs patch
2020-08-17 08:57:31rhpvordermansetstatus: open -> closed
resolution: not a bug
messages: + msg375536

stage: resolved
2020-08-17 08:19:06rhpvordermancreate