> advantage of my approach is that high-order bits become more
> important:
I don't much care about high-order bits, beyond that we don't systematically _lose_ them. The dict and set lookup routines have their own strategies for incorporating high-order bits into the probe sequence, and those routines are the primary reason hash() exists in Python.
So there's scant reason to try to let high bits influence low bits (but good reason _not_ to, because - as explained before - the low-order bits are already in great shape for the important tuple component types).
Something else appears to be going on in this specific example, which has a very touchy set of hash codes:
> BEFORE:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
> 500000
Which is what I get too on (at least) a 64-bit build.
> AFTER:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
> 1000000
Ditto. However, I get exactly the same from the code I posted before, which has no right shifts at all.
More, if I take the code exactly as it exists today, and merely comment out the
mult += (Py_hash_t)(82520UL + len + len);
line, I also get a million distinct hash codes.
Half of the hash codes have bit 2**60 set, and apart from those all the hash codes have no bits set higher than 2**6. Any number of accidents could cause the 2**60 bits to wipe each other out, and merely changing the sequence of multipliers was demonstrably enough to stop that.
Indeed, merely changing 82520 to 82524 in the current code is enough to get a million unique hash codes in this case. |