Author smst
Date 2004-04-27.11:41:34
SpamBayes Score
Marked as misclassified
I've encountered some performance problems when
constructing dictionaries with keys of a particular
form, and have pinned the problem down to the hashing
function.  I've reproduced the effect in Python 1.5.2,
Python 2.2 and Python 2.3.3.  I came across this when
loading a marshalled dictionary with about 40000
entries.  Loading other dictionaries of this size has
always been fast, but in this case I killed the
interpreter after a few minutes of CPU thrashing.  The
performance problem was caused because every key in the
dictionary had the same hash value.

The problem is as follows: for hashable values X and Y,
hash( (X, (X, Y)) ) == hash(Y).  This is always true
(except in the corner case where hash((X, Y)) is
internally calculated to be -1 (the error value) and so
-2 is forced as the return value).  With data in this
form where X varies more than Y (Y is constant, or
chosen from relatively few values compared to X)
chances of collision become high as X is effectively

The hash algorithm for tuples starts with a seed value,
then generates a new value for each item in the tuple
by multiplying the iteration's starting value by a
constant (keeping the lowest 32 bits) and XORing with
the hash of the item.  The final value is then XORed
with the tuple's length.  In Python (ignoring the
careful business with -1):

    # assume 'my_mul' would multiply two numbers and
return the low 32 bits
    value = seed
    for item in tpl:
        value = my_mul(const, value) ^ hash(item)
    value = value ^ len(tpl)

The tuple (X, Y) therefore has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash(Y) ^ 2
...and the tuple (X, (X, Y)) has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash((X, Y)) ^ 2

The outer multiplication is repeated, and is XORed with
itself (cancelling both of them). The XORed 2s cancel
also, leaving just hash(Y).  Note that this
cancellation property also means that the same hash
value is shared by (X, (X, (X, (X, Y)))), and (X, (X,
(X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z,
Y)) and so on.

I realise that choosing a hash function is a difficult
task, so it may be that the behaviour seen here is a
tradeoff against other desireable properties -- I don't
have the expertise to tell.  My naive suggestion would
be that an extra multiplication is necessary, which
presumably has a performance impact (multiplication
being more expensive than XOR) but would reduce the
possibility of cancellation. On the other hand, perhaps
this particular case is rare enough that it's not worth
the effort.

For my own application I'm fortunate in that I can
probably rearrange the data structure to avoid this case.

Date User Action Args
2007-08-23 14:21:12adminlinkissue942952 messages
2007-08-23 14:21:12admincreate