Logged In: YES
user_id=31435
I can't make more time for this, so unassigned myself.
This seems to make a good set of test inputs:
N = whatever
base = range(N)
xp = [(i, j) for i in base for j in base]
inps = base + [(i, j) for i in base for j in xp] + \
[(i, j) for i in xp for j in base]
When N == 50, inps contains 250,050 distinct elements, and
66,966 of them generate a duplicated hash code. Most
simple changes to the current algorithm don't do significantly
better, and some do much worse. For example, just replacing
the xor with an add zooms it to 119,666 duplicated hash
codes across that set.
Here's a hash function that yields no collisions across that
set:
def hx(x):
if isinstance(x, tuple):
result = 0x345678L
mult = 1000003
for elt in x:
y = hx(elt)
result = (((result + y) & 0xffffffffL) * mult) &
0xffffffffL
mult = (mult * 69069) & 0xffffffffL
result ^= len(x)
if result == -1:
result = -2
else:
result = hash(x)
return result
In C, none of the "& 0xffffffffL" clauses are needed; in Python
code, they're needed to cut results back to 32 bits. 69069 is
a well-known multiplier for a better-than-most pure
multiplicative PRNG.
This does add a multiply per loop trip over what we have
today, but it can proceed in parallel with the existing
multiply. Unlike a canned table, it won't repeat multipliers,
(unless the tuple has more than a billion elements). |