Message 326874 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jdemeyer
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-10-02.10:41:53
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1538476913.47.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
This weekend I realized something important which I didn't realize before: some hash functions which I assumed to be good (i.e. small chance of collisions between any given two tuples) turned out to often fail the tests. This is because you don't just want to minimize collisions, you also want to minimize correlations between collisions. More precisely: for a given hash function (considering the multiplier as parameter), it can happen that there are 4 tuples t, u, v, w such that whether or not hash(t) == hash(u) is correlated to whether or not hash(v) == hash(w). Such correlations increase the standard deviation of the number of collisions in the tests a lot (even if the average is unaffected), which leads to significant chances of failing the tests. So with this in mind I stopped testing pairs of tuples but I ran the actual testsuites. The metric I'm using is now the probability that the testsuite passes for randomly chosen multipliers (3 mod 8). For example, the original tuple hash has a probability of around 97% of passing the original testsuite. None of the hash functions that I tried (DJB or FNV with input mangling like t ^= t << 7) achieved such a high probability of passing the original test. The only variation that I found which passes the original test and my new test (and a third "random" test which I haven't mentioned before) with a high enough probability was FNV with input mangling with a second multiplier: h = 1 for y in INPUT: t = hash(y) t ^= t * SOME_LARGE_EVEN_NUMBER # instead of t ^= t << SHIFT h = (h ^ t) * MULTIPLIER

This weekend I realized something important which I didn't realize before: some hash functions which I assumed to be good (i.e. small chance of collisions between any given two tuples) turned out to often fail the tests. This is because you don't just want to minimize collisions, you also want to minimize *correlations* between collisions.

More precisely: for a given hash function (considering the multiplier as parameter), it can happen that there are 4 tuples t, u, v, w such that whether or not hash(t) == hash(u) is correlated to whether or not hash(v) == hash(w). Such correlations increase the standard deviation of the number of collisions in the tests a lot (even if the average is unaffected), which leads to significant chances of failing the tests.

So with this in mind I stopped testing pairs of tuples but I ran the actual testsuites. The metric I'm using is now the probability that the testsuite passes for randomly chosen multipliers (3 mod 8). For example, the original tuple hash has a probability of around 97% of passing the original testsuite.

None of the hash functions that I tried (DJB or FNV with input mangling like t ^= t << 7) achieved such a high probability of passing the original test. The *only* variation that I found which passes the original test and my new test (and a third "random" test which I haven't mentioned before) with a high enough probability was FNV with input mangling with a second multiplier:

h = 1
for y in INPUT:
    t = hash(y)
    t ^= t * SOME_LARGE_EVEN_NUMBER   # instead of t ^= t << SHIFT
    h = (h ^ t) * MULTIPLIER

History
Date	User	Action	Args
2018-10-02 10:41:53	jdemeyer	set	recipients: + jdemeyer, tim.peters, rhettinger, mark.dickinson, eric.smith, sir-sigurd
2018-10-02 10:41:53	jdemeyer	set	messageid: <1538476913.47.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-02 10:41:53	jdemeyer	link	issue34751 messages
2018-10-02 10:41:53	jdemeyer	create