Message 20593 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	smst
Recipients
Date	2004-04-27.11:41:34
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
I've encountered some performance problems when constructing dictionaries with keys of a particular form, and have pinned the problem down to the hashing function. I've reproduced the effect in Python 1.5.2, Python 2.2 and Python 2.3.3. I came across this when loading a marshalled dictionary with about 40000 entries. Loading other dictionaries of this size has always been fast, but in this case I killed the interpreter after a few minutes of CPU thrashing. The performance problem was caused because every key in the dictionary had the same hash value. The problem is as follows: for hashable values X and Y, hash( (X, (X, Y)) ) == hash(Y). This is always true (except in the corner case where hash((X, Y)) is internally calculated to be -1 (the error value) and so -2 is forced as the return value). With data in this form where X varies more than Y (Y is constant, or chosen from relatively few values compared to X) chances of collision become high as X is effectively ignored. The hash algorithm for tuples starts with a seed value, then generates a new value for each item in the tuple by multiplying the iteration's starting value by a constant (keeping the lowest 32 bits) and XORing with the hash of the item. The final value is then XORed with the tuple's length. In Python (ignoring the careful business with -1): # assume 'my_mul' would multiply two numbers and return the low 32 bits value = seed for item in tpl: value = my_mul(const, value) ^ hash(item) value = value ^ len(tpl) The tuple (X, Y) therefore has hash value: my_mul(const, my_mul(const, seed) ^ hash(X)) ^ hash(Y) ^ 2 ...and the tuple (X, (X, Y)) has hash value: my_mul(const, my_mul(const, seed) ^ hash(X)) ^ hash((X, Y)) ^ 2 The outer multiplication is repeated, and is XORed with itself (cancelling both of them). The XORed 2s cancel also, leaving just hash(Y). Note that this cancellation property also means that the same hash value is shared by (X, (X, (X, (X, Y)))), and (X, (X, (X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z, Y)) and so on. I realise that choosing a hash function is a difficult task, so it may be that the behaviour seen here is a tradeoff against other desireable properties -- I don't have the expertise to tell. My naive suggestion would be that an extra multiplication is necessary, which presumably has a performance impact (multiplication being more expensive than XOR) but would reduce the possibility of cancellation. On the other hand, perhaps this particular case is rare enough that it's not worth the effort. For my own application I'm fortunate in that I can probably rearrange the data structure to avoid this case.

I've encountered some performance problems when
constructing dictionaries with keys of a particular
form, and have pinned the problem down to the hashing
function.  I've reproduced the effect in Python 1.5.2,
Python 2.2 and Python 2.3.3.  I came across this when
loading a marshalled dictionary with about 40000
entries.  Loading other dictionaries of this size has
always been fast, but in this case I killed the
interpreter after a few minutes of CPU thrashing.  The
performance problem was caused because every key in the
dictionary had the same hash value.

The problem is as follows: for hashable values X and Y,
hash( (X, (X, Y)) ) == hash(Y).  This is always true
(except in the corner case where hash((X, Y)) is
internally calculated to be -1 (the error value) and so
-2 is forced as the return value).  With data in this
form where X varies more than Y (Y is constant, or
chosen from relatively few values compared to X)
chances of collision become high as X is effectively
ignored.

The hash algorithm for tuples starts with a seed value,
then generates a new value for each item in the tuple
by multiplying the iteration's starting value by a
constant (keeping the lowest 32 bits) and XORing with
the hash of the item.  The final value is then XORed
with the tuple's length.  In Python (ignoring the
careful business with -1):

    # assume 'my_mul' would multiply two numbers and
return the low 32 bits
    value = seed
    for item in tpl:
        value = my_mul(const, value) ^ hash(item)
    value = value ^ len(tpl)

The tuple (X, Y) therefore has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash(Y) ^ 2
    
...and the tuple (X, (X, Y)) has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash((X, Y)) ^ 2

The outer multiplication is repeated, and is XORed with
itself (cancelling both of them). The XORed 2s cancel
also, leaving just hash(Y).  Note that this
cancellation property also means that the same hash
value is shared by (X, (X, (X, (X, Y)))), and (X, (X,
(X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z,
Y)) and so on.

I realise that choosing a hash function is a difficult
task, so it may be that the behaviour seen here is a
tradeoff against other desireable properties -- I don't
have the expertise to tell.  My naive suggestion would
be that an extra multiplication is necessary, which
presumably has a performance impact (multiplication
being more expensive than XOR) but would reduce the
possibility of cancellation. On the other hand, perhaps
this particular case is rare enough that it's not worth
the effort.

For my own application I'm fortunate in that I can
probably rearrange the data structure to avoid this case.

History
Date	User	Action	Args
2007-08-23 14:21:12	admin	link	issue942952 messages
2007-08-23 14:21:12	admin	create