Title: add BLAKE3 to hashlib
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.10
Status: open Resolution:
Dependencies: Superseder:
Assigned To: christian.heimes Nosy List: Zooko.Wilcox-O'Hearn, christian.heimes, corona10, jstasiak, kmaork, larry, mgorny, oconnor663, xtreak
Priority: normal Keywords: patch

Created on 2020-01-11 04:27 by larry, last changed 2021-09-05 16:08 by oconnor663.

Messages (19)
msg359777 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-01-11 04:27
From 3/4 of the team that brought you BLAKE2, now comes... BLAKE3!

BLAKE3 is a brand new hashing function.  It's fast, it's paralellizeable, and unlike BLAKE2 there's only one variant.

I've experimented with it a little.  On my laptop (2018 Intel i7 64-bit), the portable implementation is kind of middle-of-the-pack, but with AVX2 enabled it's second only to the "Haswell" build of KangarooTwelve.  On a 32-bit ARMv7 machine the results are more impressive--the portable implementation is neck-and-neck with MD4, and with NEON enabled it's definitely the fastest hash function I tested.  These tests are all single-threaded and eliminate I/O overhead.

The above Github repo has a reference implementation in C which includes Intel and ARM SIMD drivers.  Unsurprisingly, the interface looks roughly the same as the BLAKE2 interface(s), so if you took the existing BLAKE2 module and s/blake2b/blake3/ you'd be nearly done.  Not quite as close as blake2b and blake2s though ;-)
msg359794 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-01-11 13:37
I've been playing with the new algorithm, too. Pretty impressive!

Let's give the reference implementation a while to stabilize. The code has comments like: "This is only for benchmarking. The guy who wrote this file hasn't touched C since college. Please don't use this code in production."
msg359796 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-01-11 14:06
For what it's worth, I spent some time producing clean benchmarks.  All these were run on the same laptop, and all pre-load the same file (406668786 bytes) and run one update() on the whole thing to minimize overhead.  K12 and BLAKE3 are using a hand-written C driver, and compiled with both gcc and clang; all the rest of the algorithms are from, python3 configured with --enable-optimizations and compiled with gcc.  K12 and BLAKE3 support several SIMD extensions; this laptop only has AVX2 (no AVX512).  All these numbers are the best of 3.  All tests were run in a single thread.

   hash algorithm|elapsed s |mb/sec    |size|hash
      K12-Haswell 0.176949   2298224495  64  24693954fa0dfb059f99...
K12-Haswell-clang 0.181968   2234841926  64  24693954fa0dfb059f99...
BLAKE3-AVX2-clang 0.250482   1623547723  64  30149a073eab69f76583...
      BLAKE3-AVX2 0.256845   1583326242  64  30149a073eab69f76583...
              md4 0.37684668 1079135924  32  d8a66422a4f0ae430317...
             sha1 0.46739069  870083193  40  a7488d7045591450ded9...
        K12-clang 0.498058    816509323  64  24693954fa0dfb059f99...
           BLAKE3 0.561470    724292378  64  30149a073eab69f76583...
              K12 0.569490    714093306  64  24693954fa0dfb059f99...
     BLAKE3-clang 0.573743    708800001  64  30149a073eab69f76583...
          blake2b 0.58276098  697831191 128  809ca44337af39792f8f...
              md5 0.59936016  678504863  32  306d7de4d1622384b976...
           sha384 0.64208886  633352818  96  b107ce5d086e9757efa7...
       sha512_224 0.66094102  615287556  56  90931762b9e553bd07f3...
       sha512_256 0.66465768  611846969  64  27b03aacdfbde1c2628e...
           sha512 0.6776549   600111921 128  f0af29e2019a6094365b...
          blake2s 0.86828375  468359318  64  02bee0661cd88aa2be15...
           sha256 0.97720436  416155312  64  48b5243cfcd90d84cd3f...
           sha224 1.0255457   396538907  56  10fb56b87724d59761c6...
        shake_128 1.0895037   373260576  32  2ec12727ac9d59c2e842...
         md5-sha1 1.1171806   364013470  72  306d7de4d1622384b976...
         sha3_224 1.2059123   337229156  56  93eaf083ca3a9b348e14...
        shake_256 1.3039152   311882857  64  b92538fd701791db8c1b...
         sha3_256 1.3417314   303092540  64  69354bf585f21c567f1e...
        ripemd160 1.4846368   273918025  40  30f2fe48fec404990264...
         sha3_384 1.7710776   229616579  96  61af0469534633003d3b...
              sm3 1.8384831   221198006  64  1075d29c75b06cb0af3e...
         sha3_512 2.4839673   163717444 128  c7c250e79844d8dc856e...

If I can't have BLAKE3, I'm definitely switching to BLAKE2 ;-)
msg359936 - (view) Author: Jack O'Connor (oconnor663) * Date: 2020-01-13 22:51
I'm in the middle of adding some Rust bindings to the C implementation in, so that `cargo test` and `cargo bench` can cover both. Once that's done, I'll follow up with benchmark numbers from my laptop (Kaby Lake i5-8250U, also AVX2 with no AVX-512). For benchmark numbers with AVX-512 support, see the Performance section of the BLAKE3 paper ( Larry, what processor did you run your benchmarks on?

Also, is there anything currently in CPython that does dispatch based on runtime CPU feature detection? Is this something that BLAKE3 should do for itself, or is there existing machinery that we'd want to integrate with?
msg359941 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-01-13 23:52
According to my order details it is a "8th Generation Intel Core i7-8650U".
msg360152 - (view) Author: Jack O'Connor (oconnor663) * Date: 2020-01-16 23:16
Ok, I've added Rust bindings to the BLAKE3 C implementation, so that I can benchmark it in a vaguely consistent way. My laptop is an i5-8250U, which should be very similar to yours. (Both are "Kaby Lake Refresh".) My end result do look similar to yours with TurboBoost on, but pretty different with TurboBoost off:

with TurboBoost on
K12 GCC        | 2159 MB/s
BLAKE3 Rust    | 1787 MB/s
BLAKE3 C Clang | 1588 MB/s
BLAKE3 C GCC   | 1453 MB/s

with TurboBoost off
BLAKE3 Rust    | 1288 MB/s
K12 GCC        | 1060 MB/s
BLAKE3 C Clang | 1094 MB/s
BLAKE3 C GCC   |  943 MB/s

The difference seems to be that with TurboBoost on, the BLAKE3 benchmarks have my CPU sitting around 2.4 GHz, while for the K12 benchmarks it's more like 2.9 GHz. With TurboBoost off, both benchmarks run at 1.6 GHz, and BLAKE3 does better. I'm not sure what causes that frequency difference. Perhaps some high-power instruction that the BLAKE3 implementation is emitting?

To reproduce these numbers you can clone these two repos (the latter is where I happen to have a K12 benchmark):

Then in both cases checkout the "bench_406668786" branch, where I've put some benchmarks with the same input length you used.

For Rust BLAKE3, at the root of the BLAKE3 repo, run: cargo +nightly bench 406668786

For C BLAKE3, the command is the same, but run it in the "./c/blake3_c_rust_bindings" directory. The build defaults to GCC, and you can "export CC=clang" to switch it.

For my K12 benchmark, at the root of the blake2_simd repo, run: cargo +nightly bench --features=kangarootwelve 406668786
msg360215 - (view) Author: Jack O'Connor (oconnor663) * Date: 2020-01-17 22:55
I plan to bring the C code up to speed with the Rust code this week. As part of that, I'll probably remove comments like the one above :) Otherwise, is there anything else we can do on our end to help with this?
msg360535 - (view) Author: Jack O'Connor (oconnor663) * Date: 2020-01-23 03:52
Version 0.1.3 of the official BLAKE3 repo includes some significant performance improvements:

- The x86 implementations include explicit prefetch instructions, which helps with long inputs. (commit b8c33e1)
- The C implementation now uses the same parallel parent hashing strategy that the Rust implementation uses. (commit 163f522)

When I repeat the benchmarks above with TurboBoost on, here's what I see now:

BLAKE3 Rust          2578 MB/s
BLAKE3 C (clang -O3) 2502 MB/s
BLAKE3 C (gcc -O2)   2223 MB/s
K12 C (gcc -O2)      2175 MB/s

Larry, if you have time to repeat your benchmarks with the latest C code, I'd be curious to see if you get similar results.
msg360838 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-01-28 07:10
I gave it a go.  And yup, I see a definite improvement: it jumped from 1,583,326,242 bytes/sec to 2,376,741,703 bytes/sec on my Intel laptop using AVX2.  A 50% improvement!

I also *think* I'm seeing a 10% improvement in ARM using NEON.  On my DE10-Nano board, BLAKE3 portable gets about 50mb/sec, and now BLAKE3 using NEON gets about 55mb/sec.  (Roughly.)  I might have goofed up on the old benchmarks though, or just not written down the final correct numbers.

I observed no statistically significant performance change in the no-SIMD builds on Intel and ARM.

p.s. in my previous comment with that table of benchmarks I said "mb/sec".  I meant "bytes/sec".  Oops!
msg360840 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-01-28 07:26
I just tried it with clang, and uff-da!  2,737,446,868 bytes/sec!

p.s. I compiled with -O3 for both gcc and clang
msg361918 - (view) Author: Jack O'Connor (oconnor663) * Date: 2020-02-12 21:42
Version 0.2.0 of the BLAKE3 repo includes optimized assembly implementations. These are behind the "c" Cargo feature for the `blake3` Rust crate, but included by default for the internal bindings crate. So the easiest way to rerun our favorite benchmark is:

git clone
git fetch
# I rebased this branch on top of version 0.2.0 today.
git checkout origin/bench_406668786
cd c/blake3_c_rust_bindings
# Nightly is currently broken for unrelated reasons, so
# we use stable with this internal bootstrapping flag.
RUSTC_BOOTSTRAP=1 cargo bench 406668786

Running the above on my machine, I get 2888 MB/s, up another 12% from the 0.1.3 numbers. As a bonus, we don't need to worry about the difference between GCC and Clang.

These new assembly files are essentially drop-in replacements for the instruction-set-specific C files we had before, which are also still supported. The updated C README has more details:
msg361925 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-02-13 00:57
Personally I'm enjoying these BLAKE3 status updates, and I wouldn't mind at all being kept up-to-date during BLAKE3's development via messages on this issue.  But, given the tenor of the conversation so far, I'm guessing Python is gonna hold off until BLAKE3 reaches 1.0.
msg363397 - (view) Author: Jack O'Connor (oconnor663) * Date: 2020-03-04 22:31
I've just published some Python bindings for the Rust implementation on PyPI:

> I'm guessing Python is gonna hold off until BLAKE3 reaches 1.0.

That's very fair. The spec and test vectors are set in stone at this point, but the implementations are new, and I don't see any reason to rush things out. (Especially since early adopters can now use the library above.) That said, there aren't really any expected implementation changes that would be a natural moment for the implementations to tag 1.0. I'll probably end up tagging 1.0 as soon as a caller appears who needs it to be tagged to meet their own stability requirements.
msg391355 - (view) Author: Jack O'Connor (oconnor663) * Date: 2021-04-19 03:44
An update a year later: I have a proof-of-concept branch that adds BLAKE3 support to hashlib: That branch is API compatible with the current master branch of Both that module and the upstream BLAKE3 repo are ready to be tagged 1.0, just waiting to see whether any integrations like this one end up requesting changes.

Would anyone be interested in moving ahead with this? One of the open questions would be whether CPython would vendor the BLAKE3 optimized assembly files, or whether we'd prefer to stick to C intrinsics.
msg391356 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2021-04-19 03:56
I note that Python already ships with some #ifdefs around SSE and the like.  So, yes, we already do this sort of thing, although I think this usually uses compiler intrinsics rather than actual assembly.  A quick grep shows zero .s files and only one .asm file (./Modules/_decimal/libmpdec/vcdiv64.asm) in the Python tree.  Therefore it wouldn't be completely novel for Python but it's unusual.

I assume there's a completely generic platform-agnostic C implementation, for build environments where the assembly won't work, yes?

Disclaimer: I've been corresponding with Jack sporadically over the past year regarding the BLAKE3 Python API.  I also think BLAKE3 is super duper cool neat-o, and I have uses for it.  So I'd love to see it in Python 3.10.

One note, just to draw attention to it: the "blake3-py" module, also published by Jack, is written using the Rust implementation, which I understand is even more performant.  Obviously there's no chance Python would ship that implementation.  But by maintaining exact API compatibility between "blake3-py" and the "blake3" added to hashlib, this means code can use the fast one when it's available, and the built-in one when it isn't, a la CStringIO:

        from blake3 import blake3
    except ImportError:
        from hashlib import blake3
msg391360 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-04-19 04:22
3.10 feature freeze is in two weeks (May 3). I don't feel comfortable to add so much new C code shortly before beta 1. If I understandly correctly the code is new and hasn't been published on PyPI yet. I also don't have much time to properly review the code. OpenSSL 3.0.0 and PEP 644 is keeping me busy.

I would prefer to postpone the inclusion of blake3. Could you please publish the C version on PyPI first and let people test it?

Apropos OpenSSL, do you have plans to submit the algorithm to OpenSSL for inclusion in 3.1.0?
msg391418 - (view) Author: Jack O'Connor (oconnor663) * Date: 2021-04-20 00:48
Hey Christian, yes these are new bindings, and also incomplete. See comments in, but in short only x86-64 Unix is in working order. If 3.10 doesn't seem realistic, I'm happy to go the PyPI route. That said, this is my first time using the Python C API. (My code in that branch is going to make that pretty obvious.) Could you recommend any existing packages that I might be able to use as a model?

For OpenSSL, I'm very interested in the abstract but less familiar with their project and their schedules. Who might be a good person to get in touch with?

> I assume there's a completely generic platform-agnostic C implementation, for build environments where the assembly won't work, yes?

Yes, that's the vendored file blake3_portable.c. One TODO for my branch here is convincing the Python build system not to try to compile the x86-64-specific stuff on other platforms. The vendored file blake3_dispatch.c abstracts over all the different implementations and takes care of #ifdef'ing platform-specific function calls. (It also does runtime CPU feature detection on x86.)

> written using the Rust implementation, which I understand is even more performant

A few details here: The upstream Rust and C implementations have been matched in single threaded performance for a while now. They share the same assembly files, and the rest is a direct port. The big difference is that Rust also includes multithreading support, using the Rayon work-stealing runtime. The blake3-py module based on the Rust crate exposes this with a simple boolean flag, though we've been thinking about ways to give the caller more control over the number of threads used.
msg401070 - (view) Author: Michał Górny (mgorny) * Date: 2021-09-05 06:17
Jack, are you still working on this?  I was considering allocating the time to write the bindings for the C library myself but I've stumbled upon this bug and I suppose there's no point in duplicating work.  I'd love to see it on pypi, so we could play with it a bit.
msg401093 - (view) Author: Jack O'Connor (oconnor663) * Date: 2021-09-05 16:08
Hi Michał, no I haven't done any more work on this since my comments back in April. If you wanted to get started on a PyPI implementation, I think that would be fantastic. I'd be happy to collaborate over email: The branches I linked are still up, but I'm not sure my code will be very useful to someone who actually knows what they're doing :) Larry also had several ideas about how multithreading could fit in (which would be API changes in the Rust case, and forward-looking design work in the C case), and if I get permission from Larry I'll forward those emails.
Date User Action Args
2021-09-05 16:08:23oconnor663setmessages: + msg401093
2021-09-05 06:17:54mgornysetnosy: + mgorny
messages: + msg401070
2021-04-20 00:48:16oconnor663setmessages: + msg391418
2021-04-19 04:22:12christian.heimessetmessages: + msg391360
2021-04-19 03:56:17larrysetmessages: + msg391356
versions: + Python 3.10, - Python 3.9
2021-04-19 03:44:15oconnor663setmessages: + msg391355
2020-03-04 22:31:46oconnor663setmessages: + msg363397
2020-02-19 23:21:10kmaorksetnosy: + kmaork
2020-02-13 00:57:10larrysetmessages: + msg361925
2020-02-12 21:42:52oconnor663setmessages: + msg361918
2020-02-01 10:11:08jstasiaksetnosy: + jstasiak
2020-01-28 07:26:40larrysetmessages: + msg360840
2020-01-28 07:10:48larrysetmessages: + msg360838
2020-01-23 03:52:40oconnor663setmessages: + msg360535
2020-01-17 22:55:56oconnor663setmessages: + msg360215
2020-01-16 23:16:29oconnor663setmessages: + msg360152
2020-01-13 23:52:19larrysetmessages: + msg359941
2020-01-13 22:51:57oconnor663setnosy: + oconnor663
messages: + msg359936
2020-01-11 14:06:33larrysetmessages: + msg359796
2020-01-11 13:37:00christian.heimessetassignee: christian.heimes
messages: + msg359794
2020-01-11 06:19:09xtreaksetnosy: + xtreak
2020-01-11 06:17:15corona10setnosy: + corona10
2020-01-11 04:27:40larrycreate