classification
Title: Weakness in tuple hash
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: arigo, jimjjewett, rhettinger, smst, tim.peters, ygale
Priority: high Keywords:

Created on 2004-04-27 11:41 by smst, last changed 2004-06-01 06:36 by rhettinger. This issue is now closed.

Messages (18)
msg20593 - (view) Author: Steve Tregidgo (smst) Date: 2004-04-27 11:41
I've encountered some performance problems when
constructing dictionaries with keys of a particular
form, and have pinned the problem down to the hashing
function.  I've reproduced the effect in Python 1.5.2,
Python 2.2 and Python 2.3.3.  I came across this when
loading a marshalled dictionary with about 40000
entries.  Loading other dictionaries of this size has
always been fast, but in this case I killed the
interpreter after a few minutes of CPU thrashing.  The
performance problem was caused because every key in the
dictionary had the same hash value.

The problem is as follows: for hashable values X and Y,
hash( (X, (X, Y)) ) == hash(Y).  This is always true
(except in the corner case where hash((X, Y)) is
internally calculated to be -1 (the error value) and so
-2 is forced as the return value).  With data in this
form where X varies more than Y (Y is constant, or
chosen from relatively few values compared to X)
chances of collision become high as X is effectively
ignored.

The hash algorithm for tuples starts with a seed value,
then generates a new value for each item in the tuple
by multiplying the iteration's starting value by a
constant (keeping the lowest 32 bits) and XORing with
the hash of the item.  The final value is then XORed
with the tuple's length.  In Python (ignoring the
careful business with -1):

    # assume 'my_mul' would multiply two numbers and
return the low 32 bits
    value = seed
    for item in tpl:
        value = my_mul(const, value) ^ hash(item)
    value = value ^ len(tpl)

The tuple (X, Y) therefore has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash(Y) ^ 2
    
...and the tuple (X, (X, Y)) has hash value:

    my_mul(const, my_mul(const, seed) ^ hash(X)) ^
hash((X, Y)) ^ 2

The outer multiplication is repeated, and is XORed with
itself (cancelling both of them). The XORed 2s cancel
also, leaving just hash(Y).  Note that this
cancellation property also means that the same hash
value is shared by (X, (X, (X, (X, Y)))), and (X, (X,
(X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z,
Y)) and so on.

I realise that choosing a hash function is a difficult
task, so it may be that the behaviour seen here is a
tradeoff against other desireable properties -- I don't
have the expertise to tell.  My naive suggestion would
be that an extra multiplication is necessary, which
presumably has a performance impact (multiplication
being more expensive than XOR) but would reduce the
possibility of cancellation. On the other hand, perhaps
this particular case is rare enough that it's not worth
the effort.

For my own application I'm fortunate in that I can
probably rearrange the data structure to avoid this case.

msg20594 - (view) Author: Steve Tregidgo (smst) Date: 2004-04-27 11:45
Logged In: YES 
user_id=42335

I'll repeat the tuple hashing algorithm in a fixed-width
font (with the first line following "for..." being the loop
body):

 # assume 'my_mul' would multiply two numbers and
return the low 32 bits
value = seed
for item in tpl:
    value = my_mul(const, value) ^ hash(item)
value = value ^ len(tpl)
msg20595 - (view) Author: Jim Jewett (jimjjewett) Date: 2004-04-28 14:50
Logged In: YES 
user_id=764593

Any hash will have some collisions.  If there is going to be 
predictably bad data, this is probably a good place to have it.

The obvious alternatives are a more complicated hash (slows 
everything down), a different hash for embedded tuples (bad, 
since hash can't be cached then) or ignoring some elements 
when determining the hash (bad in the normal case of different 
data).

I would also expect your workaround of data rearrangement to 
be sensible almost any time (X, (X, Y)) is really a common 
case.  (The intuitive meaning for me is "X - then map X to Y", 
which could be done as (X, Y) or at least (X, (None, Y)), or 
perhaps d[X]=(X,Y).)

msg20596 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2004-04-28 16:36
Logged In: YES 
user_id=80475

The OP was not referring to "some collisions"; his app
collapsed all entries to a single hash value.  Changing XOR
to + would partially eliminate the self cancelling property
of this hash function. 

Also, I am concerned about the tuple hash using the same
multiplier as the hash for other objects. In sets.py, a
naive combination of the component hash values caused many
distinct sets to collapse to a handful of possibilities --
while tuples do not have an identical issue, it does
highlight the risks involved.

msg20597 - (view) Author: Jim Jewett (jimjjewett) Date: 2004-04-28 17:45
Logged In: YES 
user_id=764593

Wouldn't that just shift the pathological case from (x, (x, y)) 
to (x, (-x, y))?  My point was that any hash function will act 
badly on *some* pattern of input, and if a pattern must be 
penalized, this might be a good pattern to choose.
msg20598 - (view) Author: Yitz Gale (ygale) Date: 2004-05-02 02:50
Logged In: YES 
user_id=1033539

I suggest leaving the function as is, but replacing the
constant with a value that varies as we step through the
tuple. Take the values from a fixed table. When we reach the
end of the table, start again from the beginning and mung
the values slightly (e.g., increment them).

True, there will always be the possibilities of collisions,
but this will make it very unlikely except in very weird
cases. Using a table of constants is a standard technique
for hashes.
msg20599 - (view) Author: Jim Jewett (jimjjewett) Date: 2004-05-03 13:40
Logged In: YES 
user_id=764593

Note that in this case, the problem was because of nested 
tuples.  X was the first element of both tuples, and both 
tuples had the same length -- so the X's would still have been 
multiplied by the same constant.  (Not the same constant as 
Y, and hash(X, Y), but the same constant as each other.)  A 
non-linear function might work, but would do bad things to 
other data.

msg20600 - (view) Author: Yitz Gale (ygale) Date: 2004-05-16 12:25
Logged In: YES 
user_id=1033539

Hmm, you are correct. This is appears to be
an off-by-one problem: the original seed always gets
multiplied by the constant (which is silly), and the last
item in the tuple does not get multiplied (which causes
the bug).

The correct solution is to change:

value = my_mul(const, value) ^ hash(item)

in Steve's pseudo-code to:

value = my_mul(const, value ^ hash(item))

Of course, you still get a lot more robustness
for almost no cost if you vary "const" across
the tuple via a table.
msg20601 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-05-16 21:09
Logged In: YES 
user_id=31435

Noting that

    (X, (Y, Z))

and

    (Y, (X, Z))

hash to the same thing today too (module -1 endcases).  For 
example,

>>> hash((3, (4, 8))) == hash((4, (3, 8)))
True
>>> 
msg20602 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-05-16 21:54
Logged In: YES 
user_id=31435

I can't make more time for this, so unassigned myself.

This seems to make a good set of test inputs:

N = whatever
base = range(N)
xp = [(i, j) for i in base for j in base]
inps = base + [(i, j) for i in base for j in xp] + \
             [(i, j) for i in xp for j in base]

When N == 50, inps contains 250,050 distinct elements, and 
66,966 of them generate a duplicated hash code.  Most 
simple changes to the current algorithm don't do significantly 
better, and some do much worse.  For example, just replacing 
the xor with an add zooms it to 119,666 duplicated hash 
codes across that set.

Here's a hash function that yields no collisions across that 
set:

def hx(x):
    if isinstance(x, tuple):
        result = 0x345678L
        mult = 1000003
        for elt in x:
            y = hx(elt)
            result = (((result + y) & 0xffffffffL) * mult) & 
0xffffffffL
            mult = (mult * 69069) & 0xffffffffL
        result ^= len(x)
        if result == -1:
            result = -2
    else:
        result = hash(x)
    return result

In C, none of the "& 0xffffffffL" clauses are needed; in Python 
code, they're needed to cut results back to 32 bits.  69069 is 
a well-known multiplier for a better-than-most pure 
multiplicative PRNG.

This does add a multiply per loop trip over what we have 
today, but it can proceed in parallel with the existing 
multiply.  Unlike a canned table, it won't repeat multipliers, 
(unless the tuple has more than a billion elements).
msg20603 - (view) Author: Yitz Gale (ygale) Date: 2004-05-17 13:09
Logged In: YES 
user_id=1033539

Tim, your test fixture is very nice.

I verified your result that there are no collisions
over the fixture set for the algorithm you specified.
Based on that result, I would recommend that we
NOT use your algorithm.

The probability of no collisions is near zero for
250000 random samples taken from a set of 2^32
elements. So the result shows that your algorithm
is far from a good hash function, though it is
constructed to be great for that specific fixture set.

I ran my algorithm on your fixture set using the
table of 64 constants taken from MD5. I got 13
collisions, a reasonable result.

It is not true that I repeat multipliers (very often) -
note that I increment the elements of the table each
time around, so the sequence repeats itself only
after 2^38 tuple elements. Also, my algorithm runs
at esentially the same speed as the current one:
no additional multiplies, only a few adds and increments
and such.
msg20604 - (view) Author: Yitz Gale (ygale) Date: 2004-05-17 13:39
Logged In: YES 
user_id=1033539

Here is the Python code for the hash function
that I am describing:

def hx(x):
    if isinstance(x, tuple):
        result = 0x345678L
        for n, elt in enumerate(x):
            y = (tbl[n & 0x3f] + (n >> 6)) & 0xffffffffL
            result = (y * (result ^ hx(elt))) & 0xffffffffL
        result ^= len(x)
    else:
        result = hash(x)
    return result

# 64 constants taken from MD5
# (see Modules/md5c.c in the Python sources)
tbl = (
    -0x28955b88, -0x173848aa,  0x242070db, -0x3e423112,
    -0xa83f051,   0x4787c62a, -0x57cfb9ed, -0x2b96aff,
     0x698098d8, -0x74bb0851, -0xa44f,     -0x76a32842,
     0x6b901122, -0x2678e6d,  -0x5986bc72,  0x49b40821,
    -0x9e1da9e,  -0x3fbf4cc0,  0x265e5a51, -0x16493856,
    -0x29d0efa3,  0x2441453,  -0x275e197f, -0x182c0438,
     0x21e1cde6, -0x3cc8f82a, -0xb2af279,   0x455a14ed,
    -0x561c16fb, -0x3105c08,   0x676f02d9, -0x72d5b376,
    -0x5c6be,    -0x788e097f,  0x6d9d6122, -0x21ac7f4,
    -0x5b4115bc,  0x4bdecfa9, -0x944b4a0,  -0x41404390,
     0x289b7ec6, -0x155ed806, -0x2b10cf7b,  0x4881d05,
    -0x262b2fc7, -0x1924661b,  0x1fa27cf8, -0x3b53a99b,
    -0xbd6ddbc,   0x432aff97, -0x546bdc59, -0x36c5fc7,
     0x655b59c3, -0x70f3336e, -0x100b83,   -0x7a7ba22f,
     0x6fa87e4f, -0x1d31920,  -0x5cfebcec,  0x4e0811a1,
    -0x8ac817e,  -0x42c50dcb,  0x2ad7d2bb, -0x14792c6f
    )
msg20605 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-05-18 02:14
Logged In: YES 
user_id=31435

Nothing wrong with being "better than random" -- it's not the 
purpose of Python's hash() to randomize, but to minimize 
hash collisions.  That's why, e.g., hash(int) returns int 
unchanged.  Then when indexing a dict by any contiguous 
range of ints (excepting -1), there are no collisions at all.  
That's a good thing.

There's nothing in the structure of my example function 
geared toward the specific test instances used.  It *should* 
do a good job of "spreading out" inputs that are "close 
together", and that's an important case, and the test inputs 
have a lot of that.  It's a good thing that it avoids all 
collisions across them.

About speed, the only answer is to time both, and across 
several major platforms.  An extra mult may or may not cost 
more than blowing extra cache lines to suck up a table of ints.

About the table of ints, it's always a dubious idea to multiply 
by an even number.  Because the high bits of the product are 
thrown away, multiplying by an even number throws away 
information needlessly (it throws away as many useful bits as 
there are trailing zero bits in the multiplier).  odd_number * 
69069**i is always odd, so the multiplier in the table-free 
way is never even (or, worse, divisible by 4, 8, 16, etc).
msg20606 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-05-19 15:10
Logged In: YES 
user_id=4771

Tim:

> Most simple changes to the current algorithm don't
> do significantly better, and some do much worse.

The simple change suggested by ygale (put the xor before the
multiply) does a quite reasonable job in your test suite. 
The number of collisions drop from 66966 to 1466.  It also
solves the original problem with (x, (x,y)), which should
not be ignored because it could naturally occur if someone
is using d.items() to compute a hash out of a dict built by
d[x] = x,y.

With this change it seems more difficult to find examples of
regular families of tuples that all turn out to have the
same hash value.  Such examples do exist: (x,
(1684537936,x)) yields 2000004 for any x.  But the value
1684537936 has been carefully constructed for this purpose,
it's the only number to misbehave like this :-)
msg20607 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-05-19 16:15
Logged In: YES 
user_id=31435

If 250,050 balls are tossed into 2**32 bins at random, the 
mean number of occupied bins is ~250042.7 (== expected 
number of collisions is ~7.3), with std deviation ~2.7.  While 
we're happy to get "better than random" 
distributions, "astronomically worse than random" distributions 
usually predict more of the same kind of problem reported 
later.

That said, I certainly agree that would be a major, easy, and 
cheap improvement over the status quo.
msg20608 - (view) Author: Yitz Gale (ygale) Date: 2004-05-19 23:24
Logged In: YES 
user_id=1033539

I uploaded my current diff and some unit tests
to patch #957122.

Please look over the unit tests carefully. I am
sure there is much more we can do.

Note that besides the addition of the table containing
the constants, only three lines of code are changed
from the original function.
By inspection, this code should run at essentially
the same speed as the old code on all platforms
(tests anyone?)

Based on the results of the unit tests, I had to
do some tweaking.

First of all, Tim is correct: constants that are divisible
by any
small prime (e.g., 2) are a problem. To be on the safe side,
I replaced each md5 constant with a nearby prime.

I ran into a problem with empty tuples nested in tuples.
The seed kept getting xored with itself. It turns out that
no matter how nice the md5 constants are, they all
produce the same value when multiplied by zero.
Find "++x" in the code to see the fix for this.

Finally, I moved the xor with the tuple length from
the end of the calculation to the beginning. That
amplifies the effect very nicely.
msg20609 - (view) Author: Yitz Gale (ygale) Date: 2004-05-20 07:16
Logged In: YES 
user_id=1033539

Re Tim's concern about cache hits:

The trade-off is a slight risk of slight performance
degradation when hashing very long tuples, versus
the risk of complete app failure in certain rare cases
like Steve's. I think it is worthwhile to use more
constants.

Anecdotally, I did not experience any qualitative
slowdown while exercising these things with the
unit tests. OTH, I didn't time anything, and anyway
I am using a creaky old Celeron.

Another alternative: use two constants:

 x = (len & 0x1 ? 0xd76aa471 : 0xe8c7b75f) * (++x ^ y);

That passes all of the current unit tests, including
Tim's set. However, with that I am less confident that this
bug won't come up again in the future.
msg20610 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2004-06-01 06:36
Logged In: YES 
user_id=80475

Applied a small, slightly cheaper (no second multiply)
variation on Tim's original.  It is not as scientfic
looking, but is does assure changing, odd multipliers for
each position.  When Tim moved the XOR before the multiply,
that fixed the OP's original problem.

The new variation survives Tim's torture test which I
augmented a bit and put in the test suite for the benefit of
future generations.


See:
   Objects\tupleobject.c 2.91
   Lib\test\test_tuple.py 1.3

History
Date User Action Args
2004-04-27 11:41:34smstcreate