This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Tighten-up search loops in sets
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: pitrou, python-dev, rhettinger, serhiy.storchaka
Priority: low Keywords: patch

Created on 2015-01-19 09:15 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
limit.diff rhettinger, 2015-01-19 09:15 entry++ style of looping review
timings.txt rhettinger, 2015-01-19 20:14 Initial timings
limit2.diff rhettinger, 2015-01-19 20:15 More readable limit calculation review
measure_build_set.py rhettinger, 2015-01-19 20:17 Timing suite
tight2.diff rhettinger, 2015-01-25 20:31 Move overflow decision out of the inner search loop.
tight2a.diff serhiy.storchaka, 2015-01-25 21:11 review
Messages (9)
msg234308 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-19 09:15
First draft of patch to switch from a table[(i+j)&mask] style of entry calculation to an entry++ style.  The entry computation simplifies from add/shl4/and/lea to a single add16.  To do this, the linear probes are limited to the length of table rather than wrapping-around.

I haven't timed this one yet (have just looked at the assembly) or worked through the implications.  The new approach is a little less attractive but may be easier to reason about than the mask wrap-around.  Also, it disadvantages small sets where the wrap-around property assured that the linear probes would always fine a hit without any entry being checked more than once.  There is a extra setup time before the linear probes to compute the limit.  Also, since the loop size is now variable instead of fixed, the inner loop for set_insert_clean() no longer gets unrolled by either GCC or CLANG.

With the change, the inner loop of set_insert_clean() is very tight:

L436:
        addq    $16, %rax       #, entry
entry < limit    
        cmpq    %rdx, %rax      # limit, entry
        ja      L399    #,
entry->key == NULL
        cmpq    $0, (%rax)      #,* entry
        jne     L436 

The code for lookkey() is also tight (though I can't tell why the entry->key gets loaded from memory twice):

L102:
        cmpq    %r15, %r13      # tmp170, D.12706      entry->key == dummy
        jne     L103    #,
        testq   %r12, %r12      # freeslot
        cmove   %r14, %r12      # entry -> freeslot
L103:
        addq    $16, %r14       #, entry            entry++ 
        cmpq    %r14, %rbx      # entry, limit
        jb      L146    #,
        movq    (%r14), %r13    # MEM[base: entry_65, offset: 0B], D.12706
        testq   %r13, %r13      # D.12706           entry->key == NULL
        je      L147    #,
        cmpq    %rbp, 8(%r14)   # hash, MEM[base: entry_104, offset: 8B]
        je      L148    #,                          entry->hash == hash
?       movq    (%r14), %r13    # MEM[base: entry_104, offset: 0B], D.12706
        jmp     L102    #
msg234327 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-19 20:14
Patch timings give inconsistent results.  GCC-4.9 generates faster code and CLang generates slower code :-(
msg234328 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-01-19 20:17
Either way the improvement doesn't look terrific, so I would suggest not to bother with this.
msg234685 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-25 20:31
Reopening this with a better patch that nicely tightens the inner-loop without an algorithmic change or any downside.

Currently, the entry computation adds a constant (i+j), masks it (&mask) to wrap on overflow, scales it to a table index (<< 4), and then does the key lookup (one memory access).   These steps are sequential (not parallizeable).

The patch moves the wrap-on-overflow decision out of the inner loop (adding a single, fast, predictable branch (i + LINEAR_PROBES > mask).   The inner-loop then simplifies to an add, fetch, and test (no masking and shifting).

The generated code on Clang is very tight:

LBB29_26:      
    cmpq    $0, (%rsi)       # entry->key == NULL
    je  LBB29_30
    incq    %rdx             # j++
    addq    $16, %rsi        # entry++  (done in parallel with the incq)
    cmpq    $9, %rdx         # j <= LINEAR_PROBES
    jbe LBB29_26

On gcc-4.9, the loop is unrolled and we get even better code:

    leaq    (%rbx,%rsi), %rdx   # entry = table[i]
    cmpq    $0, (%rdx)          # entry.key == NULL
    je  L223

    leaq    16(%rbx,%rsi), %rdx # entry = table[i+1];
    cmpq    $0, (%rdx)          # entry.key == NULL
    je  L223

    leaq    32(%rbx,%rsi), %rdx
    cmpq    $0, (%rdx)
    je  L223
msg234693 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-25 21:11
What timing results?

There is large chance that first tested entry is empty. May be worth to move this test outside of the loop as in the set_lookup function.

"i &= mask;" may be moved to the start of the loop.
msg234713 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-26 03:23
> May be worth to move this test outside of the loop as in the 
> set_lookup function.

The loop used to be this way but testing showed no benefit, so a while back I recombined it back to a j=0 start point and the code looked much nicer.  I don't really want to go back if I don't have to.

There may or may not be a microscopic benefit to moving the  leaq 9(%rcx), %rsi / cmpq %rsi, %rax / jae L237 sequence before the table[i+0]->key == NULL check.  I don't still have access to VTune to verify that this highly predictable branch is essentially free.   I do have a preference at this point for the simpler code -- that will make the future optimization work easier (perhaps inlining the lookkey code into set_insert_key, set_discard, and set_contains).  Also, looking at the disassembly of your patch, there is an additional cost for setting up the table[i+1] entry that wasn't there before (two new extra instructions:      leaq    1(%rcx), %rsi; salq $4, %rsi; occur before the normal lookup and test sequence: leaq 0(%rbp,%rsi), %rdx; cmpq $0, (%rdx); je  L237)

That said, if you think it is important to move the table[i+0] test outside the (i + LINEAR_PROBES <= mask) test, just say so and I'll apply your version of the patch instead of mine.  Otherwise, I prefer the cleaner code (both C and generated assembly) of my patch.
msg234715 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-26 07:21
I have no strong preferences. This code is used very rarely and I have no any example which exposes performance differences. But how large the benefit of your patch? It doesn't make the C code more clean.
msg234807 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-01-27 05:33
New changeset 0b2a3d764e63 by Raymond Hettinger in branch 'default':
Issue #23269:  Tighten search_loop in set_insert_clean()
https://hg.python.org/cpython/rev/0b2a3d764e63
msg234808 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-27 05:35
Applied Sirhiy's version of the patch but may switch to the j=0 version later.
History
Date User Action Args
2022-04-11 14:58:12adminsetgithub: 67458
2015-01-27 05:35:28rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg234808
2015-01-27 05:33:58python-devsetnosy: + python-dev
messages: + msg234807
2015-01-26 07:21:37serhiy.storchakasetmessages: + msg234715
2015-01-26 03:23:16rhettingersetmessages: + msg234713
2015-01-25 21:11:20serhiy.storchakasetfiles: + tight2a.diff

messages: + msg234693
2015-01-25 20:31:31rhettingersetstatus: closed -> open
files: + tight2.diff
resolution: rejected -> (no value)
messages: + msg234685
2015-01-20 08:08:17rhettingersetstatus: open -> closed
resolution: rejected
2015-01-19 20:17:35pitrousetmessages: + msg234328
2015-01-19 20:17:07rhettingersetfiles: + measure_build_set.py
2015-01-19 20:15:18rhettingersetfiles: + limit2.diff
2015-01-19 20:14:44rhettingersetfiles: + timings.txt

messages: + msg234327
2015-01-19 09:57:10serhiy.storchakasetnosy: + pitrou, serhiy.storchaka
2015-01-19 09:15:42rhettingercreate