Author tim.peters
Recipients
Date 2004-05-16.21:54:55
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=31435

I can't make more time for this, so unassigned myself.

This seems to make a good set of test inputs:

N = whatever
base = range(N)
xp = [(i, j) for i in base for j in base]
inps = base + [(i, j) for i in base for j in xp] + \
             [(i, j) for i in xp for j in base]

When N == 50, inps contains 250,050 distinct elements, and 
66,966 of them generate a duplicated hash code.  Most 
simple changes to the current algorithm don't do significantly 
better, and some do much worse.  For example, just replacing 
the xor with an add zooms it to 119,666 duplicated hash 
codes across that set.

Here's a hash function that yields no collisions across that 
set:

def hx(x):
    if isinstance(x, tuple):
        result = 0x345678L
        mult = 1000003
        for elt in x:
            y = hx(elt)
            result = (((result + y) & 0xffffffffL) * mult) & 
0xffffffffL
            mult = (mult * 69069) & 0xffffffffL
        result ^= len(x)
        if result == -1:
            result = -2
    else:
        result = hash(x)
    return result

In C, none of the "& 0xffffffffL" clauses are needed; in Python 
code, they're needed to cut results back to 32 bits.  69069 is 
a well-known multiplier for a better-than-most pure 
multiplicative PRNG.

This does add a multiply per loop trip over what we have 
today, but it can proceed in parallel with the existing 
multiply.  Unlike a canned table, it won't repeat multipliers, 
(unless the tuple has more than a billion elements).
History
Date User Action Args
2007-08-23 14:21:13adminlinkissue942952 messages
2007-08-23 14:21:13admincreate