classification
Title: Speed-up set_lookkey()
Type: performance Stage: patch review
Components: Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: pitrou, python-dev, rhettinger, serhiy.storchaka
Priority: low Keywords: patch

Created on 2015-01-31 21:37 by rhettinger, last changed 2015-05-27 21:26 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
hoisted_and_specialized.diff rhettinger, 2015-01-31 21:37 Specialized lookkey variants with and-masking hoisted out the inner loop review
hoist_mask_only.diff rhettinger, 2015-02-02 10:23 Simpler entry address computation in the nowrap case review
set_simpler_linear_probes.patch serhiy.storchaka, 2015-02-02 19:18 review
nd_lookkey_insertkey3.diff rhettinger, 2015-05-26 08:29 Split insertion logic from lookup logic review
new_set_timings.txt rhettinger, 2015-05-26 21:12 Timings for nd_lookkey_insertkey3
Messages (14)
msg235131 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-31 21:37
This patch applies three techniques to tighten-up the generated code for the lookkey() for sets.

I'm not sure I want to do this because it expands the size of the code quite a bit (much like the previously existing lookkey specializations did).  As such, it offends my sense of beauty and neatness.

That said, it greatly tightens the inner search loop and in the handful of timings I've done so far, it provides a good speed-up for smaller sets (ones that fit in L1 and L2 cache) and minor improvements for bigger sets (where running time is dominated by cost of fetching a cache-line with a key/hash pair).

The three transformations are:

1.)  Make two variants of lookkey, one with an early-out for a dummy entry and one that treats a dummy entry as an active entry.   This eliminates the freeslot tracking in both cases.  

In the first case (early-out), it frees one register (the one for the freeslot), it simplifies the inner-loop test (no need to see if freeslot is NULL), it provides an early-out (return on dummy rather than continuing to search), and it simplifies the return code (just return entry, no need to check the freeslot).

In the second case (dummies treated as active), it frees two registers (one for freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop test entirely.

That eliminated inner inner-loop code used to look like this in gcc-4.9:

    cmpq    %r15, %rbp      entry->key != dummny
    jne L375                                   |
    testq   %r13, %r13      freelslot == NULL  |
    cmove   %rbx, %r13                         |
L375:                                      <---|

And the eliminated inner loop code was even worse with clang:

    leaq    __dummy_struct(%rip), %rax
    cmpq    %rax, %r14                entry->key==dummy
    sete    %al
    testq   %rbx, %rbx                freeslot==NULL
    sete    %cl
    testb   %cl, %al
    cmovneq %r13, %rbx                freeslot ?= entry


2.) Avoid the &mask step for computing indices when not needed. Using the same technique that is used in set_insert_clean, we test to see if there is a possible wrap-around during the linear probes.  If not, we can skip the lengthy address calculation for table[(i+j)&mask] and instead use entry++.

This saves the sequentially dependent steps of add j, the bitwise-and with the mask, a shift left by four to scale the index, and the add to the table start address.  It replaces those steps with a single entry++ which codes as an add $16.

3).  In the linear_probe inner-loop for the ignore dummy case, replace (entry->key==NULL) with (entry->hash==0 && entry->key==NULL).   While that looks like a step backward adding an extra test, that second test is essentially free (a register compare and branch which is nearly 100% predictable).   The benefit is that the inner-loop only needs to fetch the entry->hash value and doesn't need to fetch entry->key.   This doesn't sound like much but it cuts the number of memory accesses per loop in half.  

Individually, three transformations don't seem like much, but they take already tight code and cut the work more than in-half (for the L1 and L2 case).   Collectively, it frees a couple of registers for better register allocation, it reduces the number of compares inside the loop, in substantially reduces the time for the entry address calculation, and in the ignore-dummy case cuts the number of memory accesses per loop in half.

The resulting code is astonishingly tight compared to the original and looks almost as tight as the code in set_insert_clean().


--- inner-loop of set_lookkey_dummy_allowed() ---
L393:
    cmpq    %r8, (%rbx)        entry->key == dummy
    je  L453
    cmpq    %r12, %rbx         entry == limit
    movq    %rbx, %r14
    je  L506
L412:
    movq    16(%r14), %rbp
    leaq    16(%r14), %rbx
    testq   %rbp, %rbp         entry->key NULL
    je  L449
    cmpq    8(%rbx), %r15      entry->hash == hash
    jne L393


--- inner-loop of set_lookkey_dummy_ignored() ---
L846:
    cmpq    %r13, %rbx         entry == limit
    je  L942
    addq    $16, %rbx          entry++
    movq    8(%rbx), %rax      entry->hash == NULL
    testq   %rax, %rax
    jne L845
    cmpq    $0, (%rbx)         entry->key == NULL (not in the inner loop)
    je  L905
L845:
    cmpq    %r12, %rax         entry->hash == tgt
    jne L846


Before anything can go forward, more timings are needed and I need to come to terms with the code expansion back to size it was before the old lookkey specializations where removed.  For now, I just want to document the work that has been done.

Anyway, here it is if someone wants to experiment with it.  Expect benefits in the cases where the inner-loop matters (small to medium sized sets that have collisions -- giant sets benefit as well but their lookup speed is dominated by the cost to fetch a single random cache line).
msg235132 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-31 21:57
One further possible transformation is to inline set_lookkey_dummy_allowed() inside set_insert_key() which is the only place it is used.  That saves all the test and branch code inside the latter (all that code does is reconstruct the exit points for set_lookkey_dummy_allowed() which were already known).
msg235147 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-01 08:50
>>> s = set()
>>> s.add(1<<64)
>>> s.add(2<<64)
>>> s.add(3<<64)
>>> list(s)
[36893488147419103232, 18446744073709551616, 55340232221128654848]
>>> s.discard(1<<64)
>>> s.discard(2<<64)
>>> s.add(3<<64)
>>> list(s)
[55340232221128654848, 55340232221128654848]
msg235189 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-02-01 18:15
Thank Serhiy.  No early-out on insertion is possible.
For discard and contains, there is still no need for
testing dummies and tracking freeslots.
msg235192 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-01 18:38
Here are more ideas. I would probe to use set_lookkey_dummy_ignored() with following set_find_free_slot() (from set_faster_copy_2.patch in issue23290) if former hadn't find a key. If set hash == -1 for key == NULL, we can use only one comparison for testing key == NULL || key == dummy.
msg235238 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-02-02 10:23
Before doing more study on the other variants, I would like to get the second transformation done (avoiding the mask computation in the case where there is no wrap-around).  Attaching a patch for just that step.
msg235265 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-02 15:26
Agree, applying simple steps one by one would be more robust. How large the benefit, do you have any timing results?
msg235269 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-02-02 16:35
New changeset 0b3bc51341aa by Raymond Hettinger in branch 'default':
Issue 23359: Tighten inner search loop for sets (don't and-mask every entry lookup).
https://hg.python.org/cpython/rev/0b3bc51341aa
msg235284 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-02 19:18
May be the code can be simplified without affecting performance if remove slower "else" branch in set_lookkey and set_insert_clean. At least I didn't find a regression in microbenchmarks, but found small benefit.

For example:

$ ./python -m timeit -s "n = 10**6; s = set(range(n, n+10)); a = list(range(n, n+10))*10**4" -- "s.intersection(a)"

Before 0b3bc51341aa: 10 loops, best of 3: 26.6 msec per loop
After 0b3bc51341aa: 10 loops, best of 3: 25.4 msec per loop
With set_simpler_linear_probes.patch: 10 loops, best of 3: 23.9 msec per loop

$ ./python -m timeit -s "n = 10**6; s = set(range(n, n+100)); a = list(range(n, n+100))*10**3" -- "s.intersection(a)"

Before 0b3bc51341aa: 10 loops, best of 3: 26.3 msec per loop
After 0b3bc51341aa: 10 loops, best of 3: 25.3 msec per loop
With set_simpler_linear_probes.patch: 10 loops, best of 3: 23.3 msec per loop

$ ./python -m timeit -s "n = 10**6; s = set(range(n, n+10)); a = list(range(n, n+10**5))" -- "s.intersection(a)"

Before 0b3bc51341aa: 100 loops, best of 3: 12 msec per loop
After 0b3bc51341aa: 100 loops, best of 3: 11.3 msec per loop
With set_simpler_linear_probes.patch: 100 loops, best of 3: 11.4 msec per loop
msg235349 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-02-03 16:16
New changeset 17cda5a92b6a by Raymond Hettinger in branch 'default':
Issue 23359:  Reduce size of code in set_lookkey. Only do linear probes when there is no wrap-around.
https://hg.python.org/cpython/rev/17cda5a92b6a
msg235350 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-02-03 16:16
Thanks Serhiy, that was brilliant.
msg244205 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-05-27 17:37
New changeset cd1e1715becd by Raymond Hettinger in branch 'default':
Issue #23359: Specialize set_lookkey intoa lookup function and an insert function.
https://hg.python.org/cpython/rev/cd1e1715becd
msg244210 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-27 18:12
Why you have added entry->hash == 0?
msg244220 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-27 21:26
[Reply to pitrou that I didn't seem to be able to make on Rietveld]

This part of the code is the most time sensitive and warrants expansion much more than other proposals (set copying, subset tests, etc).  I long aspired to split the lookup and insertion logic.  The former doesn't need dummy tracking and its split relieves the callers of doing dummy checks.  The latter needed to be more tightly integrated with set_insert_entry.  Both sets of logic have different branch prediction statistics depending on the input data.  Working going forward will be made easier for me by having the lookup and insertion logic separated.

The speed-up is modest but this part of a long series of modest speed-ups collectively adding-up to a nice boost.  The next in line is possibly adding likely/unlikely macros to improve the quality of code generation across different platforms. Overall the code size is just about the same it was in Python 2.7.
History
Date User Action Args
2015-05-27 21:26:07rhettingersetmessages: + msg244220
2015-05-27 18:12:06serhiy.storchakasetmessages: + msg244210
2015-05-27 17:37:50rhettingersetstatus: open -> closed
resolution: fixed
2015-05-27 17:37:29python-devsetmessages: + msg244205
2015-05-26 21:12:27rhettingersetfiles: + new_set_timings.txt
2015-05-26 08:29:07rhettingersetfiles: + nd_lookkey_insertkey3.diff
versions: + Python 3.6, - Python 3.5
2015-02-03 16:16:42rhettingersetmessages: + msg235350
2015-02-03 16:16:01python-devsetmessages: + msg235349
2015-02-02 19:18:12serhiy.storchakasetfiles: + set_simpler_linear_probes.patch

messages: + msg235284
2015-02-02 16:35:12python-devsetnosy: + python-dev
messages: + msg235269
2015-02-02 15:26:03serhiy.storchakasetnosy: + pitrou
messages: + msg235265
2015-02-02 10:23:51rhettingersetfiles: + hoist_mask_only.diff

messages: + msg235238
2015-02-01 18:38:28serhiy.storchakasetmessages: + msg235192
2015-02-01 18:15:25rhettingersetmessages: + msg235189
2015-02-01 08:50:59serhiy.storchakasetmessages: + msg235147
2015-01-31 21:57:23rhettingersetnosy: + serhiy.storchaka
2015-01-31 21:57:09rhettingersetmessages: + msg235132
2015-01-31 21:37:58rhettingercreate