This weekend I realized something important which I didn't realize before: some hash functions which I assumed to be good (i.e. small chance of collisions between any given two tuples) turned out to often fail the tests. This is because you don't just want to minimize collisions, you also want to minimize *correlations* between collisions.
More precisely: for a given hash function (considering the multiplier as parameter), it can happen that there are 4 tuples t, u, v, w such that whether or not hash(t) == hash(u) is correlated to whether or not hash(v) == hash(w). Such correlations increase the standard deviation of the number of collisions in the tests a lot (even if the average is unaffected), which leads to significant chances of failing the tests.
So with this in mind I stopped testing pairs of tuples but I ran the actual testsuites. The metric I'm using is now the probability that the testsuite passes for randomly chosen multipliers (3 mod 8). For example, the original tuple hash has a probability of around 97% of passing the original testsuite.
None of the hash functions that I tried (DJB or FNV with input mangling like t ^= t << 7) achieved such a high probability of passing the original test. The *only* variation that I found which passes the original test and my new test (and a third "random" test which I haven't mentioned before) with a high enough probability was FNV with input mangling with a second multiplier:
h = 1
for y in INPUT:
t = hash(y)
t ^= t * SOME_LARGE_EVEN_NUMBER # instead of t ^= t << SHIFT
h = (h ^ t) * MULTIPLIER |