Title: dict: improve lookup function
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.7
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Dmitry Rubanovich, inada.naoki, rhettinger, serhiy.storchaka, tim.peters, xiang.zhang
Priority: normal Keywords:

Created on 2017-06-15 07:25 by Dmitry Rubanovich, last changed 2017-06-24 21:05 by tim.peters. This issue is now closed.

File name Uploaded Description Edit Dmitry Rubanovich, 2017-06-15 07:25 Dmitry Rubanovich, 2017-06-16 02:44 tim.peters, 2017-06-18 04:25 emulate the dict lookup code in Python Dmitry Rubanovich, 2017-06-19 03:27
Messages (18)
msg296072 - (view) Author: Dmitry Rubanovich (Dmitry Rubanovich) * Date: 2017-06-15 07:25
lookdict_index() (and the rest of the files in dictobject.c) are using unnecessarily complicated perturb mechanism.  And, in fact, it's slower than the simpler case.

Instead of this:

for (size_t perturb = hash;;) {
    perturb >>= PERTURB_SHIFT;
    i = mask & ((i << 2) + i + perturb + 1);

it should do this:

for (size_t perturb = hash;;) {
    i = mask & ((i << 1) + perturb + 1);
    perturb >>= PERTURB_SHIFT;

This would not only save an instruction (a minor issue), but it would also reduce collisions.

I've attached a file which calculates frequencies of collisions for demonstration purposes.  It shows that the calculation, as it stands right now, does not create a 1-1 mapping even on the 1st iteration through the loop.  Moving PERTURB_SHIFT to the line before the calculation does reduce the density of the collision space.  But using the calculation, which I proposed, eliminates collisions on the 1st iteration completely and reduces it on most subsequent iterations.
msg296073 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-15 07:29
You may misunderstood current lookup loop.

In your, `yield i` must be added right before `while True:`
msg296138 - (view) Author: Dmitry Rubanovich (Dmitry Rubanovich) * Date: 2017-06-15 23:28
I am looking at the 3.6.1 code.  I am not trying to simulate the index lookup as the lookup function would do it.  

There is no point in that.  The point of the script is to detect collisions among perturbed secondary indexes rather than between primary hashes and secondary.  To that end, it applies the perturbed secondary hash function to **all** primary indexes and stores the results.  This allows to judge the quality of the secondary hash function.  The "old" one was bad because it automatically created collisions by multiplying by zero-divisors of the 2^k ring.  The one used in 3.6.1 is better because it  doesn't always do that, but it still multiplies by zero-divisors of the ring when h mod 2^k is equal to (h >> 5) mod 2^k.  This multiplication by zero-divisors of the 2^k ring is what produces collisions.  The script simply counts them.  

The later iterations of the loop are not very relevant.  They will almost certainly produce further collisions, but that's unavoidable.  

By the way, just as a test, I just changed the value of PERTURB_SHIFT to 1 and, in my proposed solution, it produced lower collision counts, after 3 runs of the loop, in virtually all rings (ie, mod all values of 2^k for k from 8 to 20).  Intuitively it makes sense because there is less information loss.  But I don't have a formal proof as to why that should be.
msg296140 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-16 00:39
Firstly, do you understand the lookup without perturb?
It's documented here:

>>> n = 0; L = []
>>> for i in range(16):
...   L.append(n)
...   n = (n*5+1)%8
>>> L
[0, 1, 6, 7, 4, 5, 2, 3, 0, 1, 6, 7, 4, 5, 2, 3]

>>> n = 0; L = []
>>> for i in range(16):
...   L.append(n)
...   n = (n*2+1)%8
>>> L
[0, 1, 3, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]

So you can't use 2.
msg296147 - (view) Author: Dmitry Rubanovich (Dmitry Rubanovich) * Date: 2017-06-16 02:44
Yes, I do understand it.  

But the statement in lines 166, 167: "For any initial j in range(2**i), repeating that 2**i times generates each int in range(2**i) exactly once" does not hold when the perturb is added.  2**i times will not be enough to generate all elements of the ring when some of multipliers are zero-divisors.  

Specifically, if you use j=((5*j) + P + 1) mod 2**i, you are effectively multiplying by a zero divisors every time P = j mod 2.  And I don't mean "mod 2**i."  I do mean "mod 2."  Which means anytime P (which changes a few times and eventually becomes 0), has the same parity as j, you are multiplying by a zero-divisor.  Because P is eventually 0, you will go through all the values **eventually**.  But for all values of P for which there is a P' such that P=/=P' and ((5*j) + P + 1) = ((5*j) + P' + 1) mod 2**i, the number of times you'll need to apply the function will be higher than 2**i.  And that's in the worst case scenario.

In the best case scenario, the collision probability can be gathered from looking at the values of my script printed by print_collision_counts(py3_6_1_lookdict_perturb).  It can be as low as ~1/20 and as high as ~1/3 depending on the value of i.

The main speed up we are seeking is to avoid a collision early on.  And we are less concerned about collisions later on.  Anecdotally, if your dict has 256 buckets, then the chance of collision is 1 in ~3.5.  Which is an improvement over 1 in 2, but still pretty high.  

Ok, how about this: to avoid the edge cases, unroll the 1st secondary hash key to use j = 2*j + P + 1.  So try to test for it before the loop.  But leave 5*j + P + 1 in the loop as is.  

Although, to be honest if PERTURB_SHIFT is changed to 1 and we have a dict with a key that causes 64 collisions, this may be a good time to resize even if we are still sparse.  On the other hand, this might create an attack vector with some magic value which causes resizes of near empty dict's.  So maybe not...  Certainly not without further analysis.

BTW, and it's mostly stylistic, but except for the "*hashpos = i & mask;" line in "find_empty_slot()", applying of the mask can be moved to "dk_get_index()".  Again, I am looking at the released 3.6.1 code, so if this is already done, then never mind.

As another note, changing PERTURB_SHIFT to 1 causes a near-universal reduction in collisions (for all strategies).  So if nothing else, that's probably an improvement.
msg296156 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-16 03:21
Whatever the iteration, accept that it's necessary it reach every value in range(2**i) eventually.  The current scheme does that, for reasons already documented.  You seem to be overlooking the importance of this part:

Note that because perturb is unsigned, if the recurrence is executed often enough perturb eventually becomes and remains 0.  At that point (very rarely reached) the recurrence is on (just) 5*j+1 again, and that's certain to find an empty slot eventually (since it generates every int in range(2**i), and we make sure there's always at least one empty slot).

5 is the smallest multiplier (besides the degenerate 1) for which that's true for every power-of-2 modulus.  It doesn't matter how rare collisions are if you switch to a scheme for which it's _not_ always true:  then you risk an infinite loop failing to find an empty slot.

Also note that testing was not just done on "random" cases, but "on both normal and pathological cases".  For example, integer keys all of which have lots of trailing zero bits (there's a specific example of that too in the comments).  The smaller PERTURB_SHIFT, the longer it takes to get the high-order bits into play - and the longer it takes to fall back to a pure "5*j+1" iteration, which is the part that's necessary for _correctness_.
msg296161 - (view) Author: Dmitry Rubanovich (Dmitry Rubanovich) * Date: 2017-06-16 06:31

I am not testing randomly.  The scripts calculate secondary hash for **each** value in a ring to see how often this results in duplicate (or triplicate, etc.) values.  And they do it for each (mod 2**i) ring for i between 8 and 20.

Then (mostly as an afterthought) the scripts calculate how often each scheme results in a collision if more than 1 secondary hash index is calculated.  I used 3 secondary indexes as a demonstration of the "afterthought" part.

I do understand that the logic relies on being able to reach each value in 0..-1+2**i in the worst case.  Isn't that though accomplished by my latest proposal? It was this:

"...unroll the 1st secondary hash key to use j = 2*j + P + 1.  So try to test for it before the loop.  But leave 5*j + P + 1 in the loop as is."

In the code that would mean changing of the loop in, for example, lookdict_index() from 

for (size_t perturb = hash;;) {
    perturb >>= PERTURB_SHIFT;
    i = mask & ((i << 2) + i + perturb + 1);
    ix = dk_get_index(k, i);
    if (ix == index) {
        return i;
    if (ix == DKIX_EMPTY) {
        return DKIX_EMPTY;

to this:

size_t perturb;

perturb = hash;
i = mask & ((i << 1) + perturb + 1); /* <---- key line */
ix = dk_get_index(k, i);
if (ix == index) {
    return i;
if (ix == DKIX_EMPTY) {
    return DKIX_EMPTY;

for (;;) {
    /* nothing changes in this loop */
    perturb >>= PERTURB_SHIFT;
    i = mask & ((i << 2) + i + perturb + 1);
    ix = dk_get_index(k, i);
    if (ix == index) {
        return i;
    if (ix == DKIX_EMPTY) {
        return DKIX_EMPTY;

And, of course, it would mean adding the same precheck in front of all loops which go through this secondary index calculation.

This prevents preventable collisions for the hashes, h, such that

h mod 2**k is equal to (h >> 5) mod 2**k,

where 2**k is the dict size.  This frequency of such occurrence for each dict size is what's printed by 


in either of the attached scripts.  Given that, for instance, it's 91 for dict's of size 256, it seems rather ripe for improvement.
msg296162 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-16 06:57
It seems not "simplify".

Please try implement it and show benchmark result if you think
it's really worth enough.
msg296164 - (view) Author: Dmitry Rubanovich (Dmitry Rubanovich) * Date: 2017-06-16 07:02
Changing the title, to remove "simplify", since the discussion revealed that the potential improvement would have to be treated as a special case and, therefore, would not simplify the code.
msg296221 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-16 23:47
Yes, any scheme whatsoever that guarantees to visit every int in range(2**i) meets the "correctness" part.

But I concur with Inada:  "head arguments" aren't compelling here, not even if I understood what they were saying ;-)  If you implement it and demonstrate significant speedups in plausibly realistic settings, then it may be worth pursuing.  Anything short of that is at best informed speculation, and so better hashed out on, e.g., the python-ideas mailing list than on the bug tracker.

BTW, I realize you're not running tests against random data.  But you're not running tests against _any_ real data, which is worrisome.  Your code appears to be utterly uniform, thus approximating a uniform random distribution.  I have scant idea of what you believe your code shows.  For example, it only looks at "hash codes" in `range(1<<p)`, which largely misses the point of the "perturb" logic.  In real life, on most boxes these days hash codes are 64 bits (regardless of dict size), and the real perturb code eventually folds in every one of those 64 bits (if the loop goes around often enough to need all of them).  Pretending hash codes contain only `p` bits may or may not be theoretically interesting for some purpose(s), but has nothing obvious to do with how the dict code actually works.
msg296232 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-17 04:35
I totally concur with Tim.

Your script test all hash codes in range(1<<p), without duplications, but in real world low bits of the hashes often are equal (see the Birthdays problem). The perturbation allows high bits to play.
msg296259 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-18 04:25
The attached pretty much convinces me there's nothing left to be gained here.  It shows that the current scheme essentially achieves the performance of theoretical "uniform hashing" for string keys, for both successful and unsuccessful searches, as measured by the average number of probes needed.  Uniform hashing is the head scheme in which, for each key, the probe sequence is chosen uniformly at random from the set of all table slot permutations.  Its advantage is that it's feasible to analyze - its disadvantage is that it can't actually be sanely implemented ;-)

Back when the current scheme was implemented, string keys sometimes behaved "better than random", for reasons explained in the code comments.  The string hash now is more like a high-quality PRNG, so the performance of uniform hashing is the best that can be hoped for.  Dicts with int keys can still enjoy better-than-random performance.

Back then I was using a 32-bit box, and I'm pleased to see that PERTURB_SHIFT=5 still appears to work well on a 64-bit box (note:  to a first approximation you can emulate a 32-bit box on a 64-bit box by setting M64 in the code to a 32-bit mask).  But the choice appears far less touchy on a 64-bit box, and it may help a little to increase PERTURB_SHIFT.  Not so much in the average case, but for pathological cases, like int keys all of which have a lot of trailing zeroes.  Before the bits that differ get shifted down far enough to matter, they all follow the same probe sequence.  Increasing PERTURB_SHIFT would reduce the number of iterations needed before that happens.  But I'd wait until someone complains before churning the code just for that ;-)
msg296265 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-06-18 06:12
Can this be closed now?
msg296289 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-18 19:02
I don't see a reason to keep this open, but I haven't been able to follow the OP's line of argument.  My best _guess_ is that they're chasing illusions based on not understanding (or grossly undervaluing) that the primary point of the perturb logic is to incrementally fold in hash code bits that _didn't_ have any effect at all on the initial table index.  It's true that for a hash table with 2**p slots, the current scheme could be improved in several ways _if_ hash codes only had p bits.  To judge from the Python code they posted, they either (erroneously) assumed that was the case, or offhandedly dismissed the potential value of folding in hash code bits beyond the p'th most significant.

But, since I've been unable to follow the argument, I could be wrong about all that.  So if they don't clarify within a few days, I'll close this.
msg296302 - (view) Author: Dmitry Rubanovich (Dmitry Rubanovich) * Date: 2017-06-19 03:27
A few notes on my motivation and other comments I made(not necessarily in the order of significance):

1. I started looking at this because of the fix in Issue28201.  It effectively improves on the difference between 




because the before-28201 code was, in effect, multiplying by 6 instead of 5 and, therefore, always had a 1/2 chance of secondary hashes colliding.

2. What I demonstrated was that the fix, as it stands, didn't do as much magic as we would hope.  Because it produced a scheme in which **first** **secondary** hashes would collide at a rate as high as 1/3.

3. I somewhat regret extending the script to demonstrate what happens to 2nd, 3rd secondary hashes because it created the perception that I was trying to demonstrate a simulation.  I wasn't.  I was trying to demonstrate a calculation.  Well, calculations for each dictionary size up to 2**20.  

4. I detect apathy towards the fix, but I get the sense that it's mostly due to the fact that I initially didn't realize that I should have squarely limited the issue to the 1st secondary hash.  And after I made the initial error in the calculation (as pointed out by Inada Naoki) which would have produced an unstable calculation of the 2nd, 3rd, 4th, etc. secondary hashes (as pointed out by Tim), there is a lot of push back.  

But the fix I propose DOES improve on ***1st secondary hash*** because it eliminates any possibility of collision.  That is to say, no 2 1st secondary hashes will be the same for any 2 different primary hashes.  

And in the latest iteration of this proposal I did limit the scope of the enhancement to only 1st secondary hash.  As it happens, this collision actually does not occur for dictionaries of sizes 8,16,32.  They only begin to be possible for dictionaries of sizes 64 or greater.  The latest script demonstrates it.

5.  Sorry, Tim, but I should have reserved my statements to only the calculation I was certain about.  The further comments about what happens to later secondary hashes was a speculation on my part and this calculation was not revealing much about it.  If you want to consider what happens with later secondary indexes (as you do in, that's a different problem.  To reiterate, my intention was to further improve on the fix in 28201.

On the size of perturb:

6.  Since hashes of strings and other objects are pointers, and can safely be assumed to be pointers addressing the heap, they are aligned on the boundary 2 * sizeof(void*).  That's 16 in 64-bit builds.  So the last 4 bits are 0.  And (depending on how much heap is used) the higher bits of the pointers are also similar in applications which don't use a lot of memory.  

For example, in an application in which 16MiB of heap is used, only bits 35(=63-4-24) through 59(63-4) are truly random.  The rest of the bits can be assumed to not change, but they are not reached until 5th iteration of the loop, so this represents an already mildly-degenerate case in this use case (in which most dictionary keys are strings).

I haven't given much thought to which bits will have the most entropy in general, but that's going too far off topic.
msg296306 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-19 05:31
Dmitry, I suggest you spend more of the time you give to thinking to writing code instead ;-)  But, really, that's the easiest & surest way to discover for yourself where your analysis is going off in the weeds.

For example, issue 28201 was both simpler and worse than your thoughts about it:  x and y collide initially if and only if x = y mod 2**k.  If they do collide, then _regardless_ of the value of N it's necessarily the case that N*x + x + 1 = N*y + y + 1 mod 2**k too.  That is, the second probes _always_ collided (not just half the time) in the case of a first-probe collision, and this would be so regardless of whether we were multiplying by 5, 4, 1, 0, or 31459265358.  That was an unfortunate "off by 1" kind of mistake in the way the code was written.  It had nothing to do with, e.g., zero divisors.

After the obvious fix was applied, there's very little that can be said in reality.  Because, after the fix, higher-order bits of the hash codes - which had nothing at all to do with the initial x = y mod 2**k collision - are added in to the second probe values.  There's no good reason I can see to calculate what happens if those never-before-considered bits _happen_ to be all zeroes.  They might be - but so what?  They might be any pair of values in the cross product of range(2**5) with itself.  There's nothing especially interesting about the (0, 0) pair.

That's why you're getting pushback:  your analysis hasn't made sense to me, and the things you're calculating still don't appear to have much of anything to do with how collisions are actually resolved.  To the contrary, so long as you're ignoring the higher-order hash code bits (about which we can infer _nothing_ from that the first probe collided), you're ignoring the _heart_ of the collision resolution strategy.

Some other things writing code would make immediately obvious:

- Hashes of strings have nothing to do with pointers or memory addresses.  They have solely to do with the characters the string contains, and all values in range(256) show up as the last byte of string hashes.

- While hashes of pointers do, the internal `_Py_HashPointer()` rotates addresses right by 4 bits so that the predictable trailing zero bits have no effect on the first probe.  For example,

>>> a = object()
>>> id(a)    # the address
>>> hex(_)
>>> hash(a)  # the address is rotated so `0x3` is the last nibble
>>> hex(_)

Because of all that, I'm going to close this.  But if you have an idea that actually speeds things, please post a patch and reopen it!  While it's true that I don't expect to see an actual improvement, clocks usually beat arguments ;-)
msg296364 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-19 18:19
BTW, I should have spelled this out earlier:  recall that x and y collide on the first probe if and only if

    x = y (mod 2**k)   [1]

The second probe never happens unless they do collide on the first probe, so when looking at the second probe we can assume that [1] is true.  Since [1] is true,

    5*(x % 2**k) + 1 = 5*(y % 2**k) + 1

is also true (and there's nothing special here about "5" and "1" - it would hold for any multiplier and any addend).  The second probe just adds (x >> 5) % 2**k to the left side of that and (y >> 5) % 2**k to the right side of that.  Therefore the second probes collide if and only if (in addition to [1])

    x >> 5 = y >> 5 (mod 2**k)  [2]

The primary point is that any analysis that ignores the higher-order bits is missing the primary thing that matters ;-)

[2] does reveal some potential weaknesses in the scheme.  For example, if k < 5, some bits of the hash code have no effect on the probe sequence (e.g., if k==1, the first probe depends only on the hash code's last bit, and if there's collision then a second-probe collision depends only on bit 2**5 -- bits 2**1 through 2**4 are effectively ignored - but the table only has 2 slots in the k==1 case).  Decreasing PERTURB_SHIFT works against that.

OTOH, if k > 5, some of the bits that contributed to [1] being true also feed into whether [2] is true.  Increasing PERTURN_SHIFT works against that.

Setting PERTURB_SHIFT=k (i.e., make it depend on the number of slots) avoids both, but dosen't appear to actually perform better than always using 5.

The rub is that collisions just aren't a real problem under the current scheme - and shows it's doing about as well on that count as the theoretical gold standard ("uniform hashing").

If people have time to spend on dicts, I'd rather see them pursue entirely different methods, that guarantee to slash the _worst_ case number of probes. also shows that while long collision chains are rare, they do occur (and they also occur under "uniform hashing").  Then, e.g., we could drop the annoying "hash randomization" cruft.  For example, dig into "cuckoo hashing" and "Robin Hood hashing".  Just be warned that, regardless of scheme, the devil's in the details :-(
msg296789 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-24 21:05
Actually, there is something to be gained here, for smaller tables.  The simple formulas for the expected number of probes under uniform hashing are upper bounds, and are significantly overstated when the load factor is very high (not a concern for Python) or the table is small.  Using exact analysis gives smaller values in those cases, which a slow implementation of uniform hashing achieves.  The current method does not.  I'll post more about this to python-ideas.
Date User Action Args
2017-06-24 21:05:44tim.peterssetmessages: + msg296789
2017-06-19 18:19:14tim.peterssetmessages: + msg296364
2017-06-19 05:31:23tim.peterssetstatus: open -> closed
resolution: rejected
messages: + msg296306

stage: resolved
2017-06-19 03:27:42Dmitry Rubanovichsetfiles: +
type: enhancement -> performance
messages: + msg296302
2017-06-18 19:02:04tim.peterssetmessages: + msg296289
2017-06-18 06:12:04rhettingersetmessages: + msg296265
2017-06-18 04:25:54tim.peterssetfiles: +

messages: + msg296259
2017-06-17 04:35:22serhiy.storchakasetmessages: + msg296232
2017-06-16 23:47:08tim.peterssetmessages: + msg296221
2017-06-16 07:02:07Dmitry Rubanovichsetmessages: + msg296164
title: dict: simplify and improve lookup function -> dict: improve lookup function
2017-06-16 06:57:44inada.naokisetmessages: + msg296162
2017-06-16 06:31:05Dmitry Rubanovichsetmessages: + msg296161
2017-06-16 03:21:03tim.peterssetnosy: + tim.peters
messages: + msg296156
2017-06-16 02:44:21Dmitry Rubanovichsetfiles: +

messages: + msg296147
2017-06-16 00:39:25inada.naokisetmessages: + msg296140
2017-06-15 23:28:39Dmitry Rubanovichsetmessages: + msg296138
2017-06-15 07:29:58inada.naokisetmessages: + msg296073
2017-06-15 07:25:26Dmitry Rubanovichcreate