Author tim.peters
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-09-28.19:15:49
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1538162149.8.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
[Tim]
> Perhaps worth noting that FNV-1a works great if used as
> _intended_:  on a stream of unsigned bytes.
> ...
>
>    Py_uhash_t t = (Py_uhash_t)y;
>    for (int i = 0; i < sizeof(t); ++i) {
>        x = (x ^ (t & 0xff)) * (Py_uhash_t)1099511628211ULL;
>        t >>= 8;
>    }

And that's all - no hacks for nested tuples, no hacks for mixed-sign ints, just 100% pure FNV-1a.

So, just for fun, how about 100% pure Bernstein 33A or 33X?  Those replace the 3rd line above with, respectively,

        x = (x * 33) + (t & 0xff); // 33A

or

        x = (x * 33) ^ (t & 0xff); // 33X


And those _also_ work great:  no collisions at all across my collection, except for the new tuple test, where they suffer 10 and 8 collisions respectively (note that the new test retains only the 32 least-significant hash bits).

Those are remarkable partly because multiplying by 33 is a weak permutation on its own:  it's just a left shift (by 5) and an add.  And while you can find odd multipliers with lower order (mod 2**N) than 33, you have to work to find one - it's exceptionally poor in this respect.  For example, pow(33, 8, 256) == 1, but half of the odd multipliers in range(256) have order 64 (mod 256).  Only 16 of 'em have order <= 8.

Then again, we're applying that weak permutation 8 times per 64-bit input here, and folding in a fresh byte each time.  It seems that, overall, that's "stronger" than applying a stronger - but still easily spelled - permutation just one or two times to a 64-bit input.

The only thing I've tried using 64-bit ops that was comparably robust across these tests used the frozenset hash's permutation on the inputs:

    Py_uhash_t t = (Py_uhash_t)y;
    t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;
    x = (x * mult) + t;

That was quite insensitive to the choice of `mult`.  For example, in this instead:

    Py_uhash_t t = (Py_uhash_t)y;
    t ^= t << 1;
    x = (x * mult) + t;

the original tuple test had 24 collision and the new test 6 using the current multiplier; changing to the FNV-1a 32-bit multiplier, those zoomed to 857 and 44; then to the FNV-1a 64-bit multiplier, zoomed again to 477 and 2027; then to the "random" 1484268081, way down to 0 and 25; and finally to the "random" 5517301966916670289, down to 0 and 4.  Which didn't inspire confidence ;-)
History
Date User Action Args
2018-09-28 19:15:49tim.peterssetrecipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-28 19:15:49tim.peterssetmessageid: <1538162149.8.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-09-28 19:15:49tim.peterslinkissue34751 messages
2018-09-28 19:15:49tim.peterscreate