Author tim.peters
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-10-01.18:47:34
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1538419654.8.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
If SeaHash is interesting to us (it is to me), I suggest dropping our DJB/FNV combining part entirely and using a purer form, like so:

    Py_uhash_t t = (Py_uhash_t)y;
    x ^= t ^ (t << 1);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
    x ^= (x >> 32) >> (x >> 60);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The only collisions with that were 14 in the new tuple test.  Rotate t right by 3 first, and that drops to 13.  Much better than utter catastrophe ;-)

100% pure SeaHash does

    x ^= t;
    
at the start first, instead of `t ^ (t << 1)` on the RHS.  That fails the second tuple test, with 98 collisions.  Not catastrophic, just "too many".  But there's only so much we can expect from a cheap non-crypto-strength hash.  `t ^ (t << 1)` is a pragmatic tweek to get rid of masses of uselessly repeated sign bits, which do occur in realistic tuples (mixing positive and negative ints).  Adding that tweak shouldn't harm any of SeaHash's provable global properties, since `t ^ (t << 1)` is just a permutation of the input space.

Different constants would be needed for a 32-bit version (best guesses, untried:  a 31-bit random prime; s/32/16/; s/60/29/).

I don't yet know about potential SeaHash licensing issues.  It's part of Rust:

https://docs.rs/seahash/3.0.5/seahash/
History
Date User Action Args
2018-10-01 18:47:34tim.peterssetrecipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-10-01 18:47:34tim.peterssetmessageid: <1538419654.8.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-01 18:47:34tim.peterslinkissue34751 messages
2018-10-01 18:47:34tim.peterscreate