Message 326820 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-10-01.18:47:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1538419654.8.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
If SeaHash is interesting to us (it is to me), I suggest dropping our DJB/FNV combining part entirely and using a purer form, like so: Py_uhash_t t = (Py_uhash_t)y; x ^= t ^ (t << 1); x = (Py_uhash_t)0x6eed0e9da4d94a4fULL; x ^= (x >> 32) >> (x >> 60); x = (Py_uhash_t)0x6eed0e9da4d94a4fULL; The only collisions with that were 14 in the new tuple test. Rotate t right by 3 first, and that drops to 13. Much better than utter catastrophe ;-) 100% pure SeaHash does x ^= t; at the start first, instead of `t ^ (t << 1)` on the RHS. That fails the second tuple test, with 98 collisions. Not catastrophic, just "too many". But there's only so much we can expect from a cheap non-crypto-strength hash. `t ^ (t << 1)` is a pragmatic tweek to get rid of masses of uselessly repeated sign bits, which do occur in realistic tuples (mixing positive and negative ints). Adding that tweak shouldn't harm any of SeaHash's provable global properties, since `t ^ (t << 1)` is just a permutation of the input space. Different constants would be needed for a 32-bit version (best guesses, untried: a 31-bit random prime; s/32/16/; s/60/29/). I don't yet know about potential SeaHash licensing issues. It's part of Rust: https://docs.rs/seahash/3.0.5/seahash/

If SeaHash is interesting to us (it is to me), I suggest dropping our DJB/FNV combining part entirely and using a purer form, like so:

    Py_uhash_t t = (Py_uhash_t)y;
    x ^= t ^ (t << 1);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
    x ^= (x >> 32) >> (x >> 60);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The only collisions with that were 14 in the new tuple test.  Rotate t right by 3 first, and that drops to 13.  Much better than utter catastrophe ;-)

100% pure SeaHash does

    x ^= t;
    
at the start first, instead of `t ^ (t << 1)` on the RHS.  That fails the second tuple test, with 98 collisions.  Not catastrophic, just "too many".  But there's only so much we can expect from a cheap non-crypto-strength hash.  `t ^ (t << 1)` is a pragmatic tweek to get rid of masses of uselessly repeated sign bits, which do occur in realistic tuples (mixing positive and negative ints).  Adding that tweak shouldn't harm any of SeaHash's provable global properties, since `t ^ (t << 1)` is just a permutation of the input space.

Different constants would be needed for a 32-bit version (best guesses, untried:  a 31-bit random prime; s/32/16/; s/60/29/).

I don't yet know about potential SeaHash licensing issues.  It's part of Rust:

https://docs.rs/seahash/3.0.5/seahash/

History
Date	User	Action	Args
2018-10-01 18:47:34	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-10-01 18:47:34	tim.peters	set	messageid: <1538419654.8.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-01 18:47:34	tim.peters	link	issue34751 messages
2018-10-01 18:47:34	tim.peters	create