classification
Title: Type: hash collision when hash(x) == -2 causes many calls to __eq__ performance resolved Python 3.8, Python 3.7
process
Status: Resolution: closed wont fix crusaderky, hongweipeng, rhettinger, serhiy.storchaka, steven.daprano, tim.peters normal

Created on 2019-09-11 10:29 by crusaderky, last changed 2019-09-14 03:20 by tim.peters. This issue is now closed.

Messages (11)
msg351798 - (view) Author: Guido Imperiale (crusaderky) * Date: 2019-09-11 10:29
```Take two objects where hash(a) == hash(b) == -2 exactly, but a != b,
When they are inserted in a dict or set, the __eq__ method is invoked a whopping 13 times.

Does this have something to do with the special case of hash(-1) = -2?

class C:
def __init__(self, x, h):
self.x = x
self.h = h

def __eq__(self, other):
print(f"{self}.__eq__({other})")
return self.x == other.x

def __hash__(self):
print(f"{self}.__hash__")
return self.h

def __repr__(self):
return f"C({self.x, self.h})"

>>> {C(1, -2), C(2, -2)}
C((1, -2)).__hash__
C((2, -2)).__hash__
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
C((1, -2)).__eq__(C((2, -2)))
{C((1, -2)), C((2, -2))}

>>> {C(1, -3), C(1, -3)}
C((1, -3)).__hash__
C((1, -3)).__hash__
C((1, -3)).__eq__(C((1, -3)))
{C((1, -3))}

>>> {C(1, -1), C(1, -1)}
C((1, -1)).__hash__
C((1, -1)).__hash__
C((1, -1)).__eq__(C((1, -1)))

>>> {C(1, -2), C(1, -2)}
C((1, -2)).__hash__
C((1, -2)).__hash__
C((1, -2)).__eq__(C((1, -2)))
{C((1, -2))}```
msg351803 - (view) Author: Guido Imperiale (crusaderky) * Date: 2019-09-11 10:31
```Forgot a counter-example:

{C(1, 0), C(2, 0)}
C((1, 0)).__hash__
C((2, 0)).__hash__
C((1, 0)).__eq__(C((2, 0)))
{C((1, 0)), C((2, 0))}```
msg351817 - (view) Author: Steven D'Aprano (steven.daprano) * Date: 2019-09-11 11:00
```Here's a possibly simpler version which nevertheless still shows multiple calls to __eq__ (in this case, only 6, not 13):

class C(object):
def __eq__(self, other):
print('eq')
return super().__eq__(other)
def __hash__(self):
return -2

d = {-2, C()}

which outputs:
eq
eq
eq
eq
eq
eq```
msg351822 - (view) Author: Guido Imperiale (crusaderky) * Date: 2019-09-11 11:07
```Of course, the chances that in real life a custom object will have hash == -2 are very small. Different story however is for the actual numbers -1 and -2:

In [2]: %timeit {-1, -3}
64.2 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: %timeit {-1, -2}
238 ns ± 5.64 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)```
msg352046 - (view) Author: hongweipeng (hongweipeng) * Date: 2019-09-12 03:50
```More than -2, -1 -4 -8 -16 and -32 will cause many calls to __eq__.In `set_add_entry` use

```
perturb >>= PERTURB_SHIFT;
i = (i * 5 + 1 + perturb) & mask;
```
get the next index.In the example,mask is 7,perturb is -2. If i = 6, after execution, the value of i has not changed.We can do one more verification like:
```
do {
perturb >>= PERTURB_SHIFT;
} while (i == ((i * 5 + 1 + perturb) & mask));
i = (i * 5 + 1 + perturb) & mask;
```
Of course this requires tests.```
msg352049 - (view) Author: Tim Peters (tim.peters) * Date: 2019-09-12 04:54
```Note that you can contrive similar cases with positive hash codes too.  For example, force both hash codes to 2**60 - 2.

The salient points are that 0 * 5 is congruent to 0 mod any power of 2, while -2 * 5 = -10 is congruent to -2 mod 8, so they're fixed points of the `i = 5*i + 1 + perturb` iteration _given that_ perturb is congruent to -1 (so cancels out the +1) until enough shifts have been done to get 0 bits into the lowermost bits retained by the mask (in which case perturb is no longer congruent to -1, and so no longer cancels out the +1).  Since PERTURB_SHIFT is 5, on a 64-bit box it can take 13 shifts before `perturb` is entirely cleared.  As internal comments note, it's only _then_ that the recurrence is guaranteed to visit every slot exactly once.

I don't care.  It would be nice if it could guarantee to visit every slot exactly once from the start - which it _would_ do if we didn't use `perturb` at all.  But using `perturb` is extremely valuable for avoiding quadratic-time behavior in large tables in bad real cases.  The drawback is that it can stick in a fixed point for 13 shifts on a 64-bit box (7 shifts on a 32-bit box).  But that's the worst it can do, and it's rare.

I don't believe it's worth a single cycle, overall, to avoid it.```
msg352051 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2019-09-12 05:26
```[Tim Peters]
> I don't believe it's worth a single cycle, overall, to avoid it.

That makes complete sense.  Marking this one as closed.

Guido, thank for the high quality, reproducible report.

hongweipeng, thank you for the casual analysis.```
msg352206 - (view) Author: Tim Peters (tim.peters) * Date: 2019-09-12 15:49
```Something that may be slightly worth pursuing:  in

j = (5*j + 1) % 2**power

to get a full-period sequence hitting every integer in range(2**power) exactly once, the multiplier (5) is a bit delicate (it has to be congruent to 1 mod 4), but the addend (1) only needs to be odd.  When the hash code has a long string of high-order 1 bits, the lower bits of `perturb` are all ones repeatedly across iterations, and a string of `power` one bits is congruent to -1 mod 2**power, cancelling out the +1.

Which people stumbled into here:  all negative ints with small absolute value _do_ have a long string of high-order 1 bits (as did also my contrived 2**60 - 2).

If we changed the addend to, say, +3 instead, it would still be possible to contrive cases as bad, but you'd be unlikely to stumble into them by accident, and they would be sensitive to the value of `power` (the table's current size).  For example, for a table size of 8 (the smallest we use), only ints ending with bits 101 are congruent to -3 mod 8.  So maximally "bad" hash codes would be of the form (bits, where "x" is "doesn't matter", and grouped into chunks of PERTURB_SHIFT (5) bits from the right):

... xx101 xx101 ... xx101 xx101 xxxxx

Provided changing +1 to +3 didn't slow anything down (you never know until you try!), that would be fine by me.  But I'm not bothered by the current behavior, so I won't be pursuing it.```
msg352229 - (view) Author: Tim Peters (tim.peters) * Date: 2019-09-12 22:23
```A more principled change would be to replace instances of this:

i = (i*5 + perturb + 1) & mask;

with this:

i = (i*5 + (perturb << 1) + 1) & mask;

The latter spelling has no fixed points.  That's easy to see:  `(perturb << 1) + 1` is necessarily odd, and then `i*5 + ODD` is even when `i` is odd, and vice versa.

I had hoped that the time for a new shift could overlap with the multiply latency, but no such luck.  At least Visual Studio didn't multiply to begin with:  it first computes `i*4 + perturb` cheaply with a single LEA instruction, then adds 1 (INC), then adds in `i` again.

I expect it would be able to fold in a "free" shifted add of perturb with another LEA, so the latter spelling isn't necessarily more expensive.  But I'm generally loathe to increase operation counts relying on clever compilers to exploit processor-specific tricks.```
msg352232 - (view) Author: Tim Peters (tim.peters) * Date: 2019-09-13 00:47
```Following up, at least under Visual Studio for x86 "it's free" to change the code to add in `perturb` shifted left.  The C source:

perturb >>= PERTURB_SHIFT;
i = (i*5 + (perturb << 1) + 1) & mask;

compiles to this, where I added comments relating the instructions to the source code:

lea         rax,[r14+r14*4]  # rax = i*5
shr         r12,5            # r12(perturb) >>= 5
lea         r14,[r12*2+1]    # r14(i) = (perturb << 1) + 1
add         r14,rax          # r14(i) += i*5
and         r14,rbx          # r14(i) &= mask

Good job!  There are actually fewer machine instructions than C operators.

As noted, this change would eliminate the possibility of fixed points from the start, so would eliminate the symptoms in this specific bug report.

Which I don't much care about ;-)  But I _do_ care about that, in general, the early stages of collision resolution for small tables can "frequently" visit slots more than once.  This is true even if the hashes aren't identical.  And it would remain true even with the change (cycles can still exist - that there are no fixed points just means there are no more cycles of the minimum possible length 1).

Improving that is worth some effort, but not much.  In general, hash codes aren't identical, and in small tables redundant slot visits just cost the time to read the hash code from cache and re-deduce that it's not equal to what we're looking for (__eq__ isn't typically called).

In larger tables redundant slot visits in the early stages are too rare to care about.

In small tables, it's worth something to guarantee that the first probe after a collision can't hit exactly the same slot again (in a random model, there's a 12.5% chance of that happening now in the smallest tables - 12.5% is 1 in 8 - and this bug report contrived cases where the chance is 100% across a dozen iterations).

Anyone up a for a world of tedious benchmarking? ;-)```
msg352402 - (view) Author: Tim Peters (tim.peters) * Date: 2019-09-14 03:20
```Some results of the "add perturb shifted left 1 instead" approach.  These come from using an old pile of Python code I have that allows for easy investigation of different collision probe strategies.

- As expected, because it has no fixed points, there are no bad cases in a dict/set with only two elements, _unless_ you're searching for a third distinct element.  Ironically, if they all, e.g., have hash code -2, no problem at all:  then 3 probes settle it.  But if they're "random" hash codes (hashes of strings are good enough for testing this), then every now and again it can take up to 10 probes.  It all depends on accidents of how the various bits get fed in across perturb shifts.  However, with random hashes, searching for an element that's there never takes more than 2 probes in the new way (again because there's never a fixed point).  Under the current way I've seen it take as many as 6.

- On average, with random hashes and 2 elements in an 8-slot table, the average number of probes with a successful search fall from 1.07 to 1.06, and on an unsuccessful search from 1.33 to 1.30.  Theoretically ideal ("uniform hashing" - each object visits the slots in its own random permutation of slot order) are 1.06/1.29.

- Random hashes in an 8-slot table:  average # probes for successful and failing searches

1 entry
current 1.00 1.14
new     1.00 1.12
ideal   1.00 1.12

2 entries
current 1.07 1.33
new     1.06 1.30
ideal   1.06 1.29

3 entries
current 1.16 1.60
new     1.14 1.56
ideal   1.14 1.50

4 entries
current 1.27 2.00
new     1.25 1.93
ideal   1.23 1.80

5 entries
current 1.42 2.66
new     1.38 2.56
ideal   1.34 2.25

Note:  those aren't typos ;-)  For example, if there's only 1 entry, how can it take more than one probe for a failing search?  Easy!  The first probe hits the only entry, it doesn't match, and so it takes a second probe to be sure the new element isn't present.  In a perfect world, that would happen 12.5% of the time (1 in 8).  But in the presence of fixed points ("current"), it can take _more_ than 2.  The most probes I saw in "current" was 6 (it hit the only entry 5 times in a row).

- That account ends at 5 entries because we don't let an 8-slot table contain more than 5 entries.

- Small as those differences are, they get smaller for larger tables.  By the time we get to 1024 slots, three digits aren't enough to distinguish "current" from "ideal".  Even at 256 slots, it's just 1-or-2 ULP wobble.

So that's the good side:

- Eliminates the "surprises" in this bug report.

- For more realistic ordinary cases ("random" hashes), for small tables it buys small but very consistent reductions in average probes needed for both successful and failing searches.  It also, in general (but not spelled out above), cuts the rare longest probe chains.

- For small tables we're talking about nanoseconds regardless.

- That adding a new shift is "free" relies on that x86's LEA instruction supports implicitly shifting an addend, and on compilers exploiting that.

- Some cases will get worse.  That's just pragmatic fact.  The intuition here is that `perturb` is _trying_ to get high-order hash bits into play, and "current" gets 5 more into play on each iteration.  But "new" shifts perturb left 1 before adding, preventing them from having any effect at all on the last new bit.

This can have seemingly chaotic effects.  For example, with entries of the form i*1024 in a table with 2**13 slots and 5,461 entries ("full" - the most we allow),

current 3.09 5.52
new     3.22 4.78
ideal   1.65 3.00

So the change hurts successful lookups but helps failing ones.  For a full table with 16 slots, the reverse.  For a full table with 2**10 slots, it hurts both.

Bottom line?  You tell me :-)  It's just not compelling to me either way.  If I were writing the code from scratch, I'd do it.  But because of seemingly chaotic effects talked about near the end above, there's always a danger _some_ code important to someone will take a hit.  It's just as true that their code will enjoy an unexpected speed boost, but nobody screams about those ;-)```
History
Date User Action Args
2019-09-12 05:26:40rhettingersetstatus: open -> closed

nosy: + rhettinger
messages: + msg352051

resolution: wont fix
stage: resolved