[Tim]
> Perhaps worth noting that FNV-1a works great if used as
> _intended_: on a stream of unsigned bytes.
> ...
>
> Py_uhash_t t = (Py_uhash_t)y;
> for (int i = 0; i < sizeof(t); ++i) {
> x = (x ^ (t & 0xff)) * (Py_uhash_t)1099511628211ULL;
> t >>= 8;
> }
And that's all - no hacks for nested tuples, no hacks for mixed-sign ints, just 100% pure FNV-1a.
So, just for fun, how about 100% pure Bernstein 33A or 33X? Those replace the 3rd line above with, respectively,
x = (x * 33) + (t & 0xff); // 33A
or
x = (x * 33) ^ (t & 0xff); // 33X
And those _also_ work great: no collisions at all across my collection, except for the new tuple test, where they suffer 10 and 8 collisions respectively (note that the new test retains only the 32 least-significant hash bits).
Those are remarkable partly because multiplying by 33 is a weak permutation on its own: it's just a left shift (by 5) and an add. And while you can find odd multipliers with lower order (mod 2**N) than 33, you have to work to find one - it's exceptionally poor in this respect. For example, pow(33, 8, 256) == 1, but half of the odd multipliers in range(256) have order 64 (mod 256). Only 16 of 'em have order <= 8.
Then again, we're applying that weak permutation 8 times per 64-bit input here, and folding in a fresh byte each time. It seems that, overall, that's "stronger" than applying a stronger - but still easily spelled - permutation just one or two times to a 64-bit input.
The only thing I've tried using 64-bit ops that was comparably robust across these tests used the frozenset hash's permutation on the inputs:
Py_uhash_t t = (Py_uhash_t)y;
t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;
x = (x * mult) + t;
That was quite insensitive to the choice of `mult`. For example, in this instead:
Py_uhash_t t = (Py_uhash_t)y;
t ^= t << 1;
x = (x * mult) + t;
the original tuple test had 24 collision and the new test 6 using the current multiplier; changing to the FNV-1a 32-bit multiplier, those zoomed to 857 and 44; then to the FNV-1a 64-bit multiplier, zoomed again to 477 and 2027; then to the "random" 1484268081, way down to 0 and 25; and finally to the "random" 5517301966916670289, down to 0 and 4. Which didn't inspire confidence ;-) |