Version 0.2.0 of the BLAKE3 repo includes optimized assembly implementations. These are behind the "c" Cargo feature for the `blake3` Rust crate, but included by default for the internal bindings crate. So the easiest way to rerun our favorite benchmark is:

git clone
git fetch
# I rebased this branch on top of version 0.2.0 today.
git checkout origin/bench_406668786
cd c/blake3_c_rust_bindings
# Nightly is currently broken for unrelated reasons, so
# we use stable with this internal bootstrapping flag.
RUSTC_BOOTSTRAP=1 cargo bench 406668786

Running the above on my machine, I get 2888 MB/s, up another 12% from the 0.1.3 numbers. As a bonus, we don't need to worry about the difference between GCC and Clang.

These new assembly files are essentially drop-in replacements for the instruction-set-specific C files we had before, which are also still supported. The updated C README has more details:
