I gave it a go.  And yup, I see a definite improvement: it jumped from 1,583,326,242 bytes/sec to 2,376,741,703 bytes/sec on my Intel laptop using AVX2.  A 50% improvement!

I also *think* I'm seeing a 10% improvement in ARM using NEON.  On my DE10-Nano board, BLAKE3 portable gets about 50mb/sec, and now BLAKE3 using NEON gets about 55mb/sec.  (Roughly.)  I might have goofed up on the old benchmarks though, or just not written down the final correct numbers.

I observed no statistically significant performance change in the no-SIMD builds on Intel and ARM.

p.s. in my previous comment with that table of benchmarks I said "mb/sec".  I meant "bytes/sec".  Oops!
