Message 235131 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	rhettinger
Date	2015-01-31.21:37:55
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1422740278.04.0.89058521092.issue23359@psf.upfronthosting.co.za>
In-reply-to

Content
This patch applies three techniques to tighten-up the generated code for the lookkey() for sets. I'm not sure I want to do this because it expands the size of the code quite a bit (much like the previously existing lookkey specializations did). As such, it offends my sense of beauty and neatness. That said, it greatly tightens the inner search loop and in the handful of timings I've done so far, it provides a good speed-up for smaller sets (ones that fit in L1 and L2 cache) and minor improvements for bigger sets (where running time is dominated by cost of fetching a cache-line with a key/hash pair). The three transformations are: 1.) Make two variants of lookkey, one with an early-out for a dummy entry and one that treats a dummy entry as an active entry. This eliminates the freeslot tracking in both cases. In the first case (early-out), it frees one register (the one for the freeslot), it simplifies the inner-loop test (no need to see if freeslot is NULL), it provides an early-out (return on dummy rather than continuing to search), and it simplifies the return code (just return entry, no need to check the freeslot). In the second case (dummies treated as active), it frees two registers (one for freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop test entirely. That eliminated inner inner-loop code used to look like this in gcc-4.9: cmpq %r15, %rbp entry->key != dummny jne L375 \| testq %r13, %r13 freelslot == NULL \| cmove %rbx, %r13 \| L375: <---\| And the eliminated inner loop code was even worse with clang: leaq __dummy_struct(%rip), %rax cmpq %rax, %r14 entry->key==dummy sete %al testq %rbx, %rbx freeslot==NULL sete %cl testb %cl, %al cmovneq %r13, %rbx freeslot ?= entry 2.) Avoid the &mask step for computing indices when not needed. Using the same technique that is used in set_insert_clean, we test to see if there is a possible wrap-around during the linear probes. If not, we can skip the lengthy address calculation for table[(i+j)&mask] and instead use entry++. This saves the sequentially dependent steps of add j, the bitwise-and with the mask, a shift left by four to scale the index, and the add to the table start address. It replaces those steps with a single entry++ which codes as an add $16. 3). In the linear_probe inner-loop for the ignore dummy case, replace (entry->key==NULL) with (entry->hash==0 && entry->key==NULL). While that looks like a step backward adding an extra test, that second test is essentially free (a register compare and branch which is nearly 100% predictable). The benefit is that the inner-loop only needs to fetch the entry->hash value and doesn't need to fetch entry->key. This doesn't sound like much but it cuts the number of memory accesses per loop in half. Individually, three transformations don't seem like much, but they take already tight code and cut the work more than in-half (for the L1 and L2 case). Collectively, it frees a couple of registers for better register allocation, it reduces the number of compares inside the loop, in substantially reduces the time for the entry address calculation, and in the ignore-dummy case cuts the number of memory accesses per loop in half. The resulting code is astonishingly tight compared to the original and looks almost as tight as the code in set_insert_clean(). --- inner-loop of set_lookkey_dummy_allowed() --- L393: cmpq %r8, (%rbx) entry->key == dummy je L453 cmpq %r12, %rbx entry == limit movq %rbx, %r14 je L506 L412: movq 16(%r14), %rbp leaq 16(%r14), %rbx testq %rbp, %rbp entry->key NULL je L449 cmpq 8(%rbx), %r15 entry->hash == hash jne L393 --- inner-loop of set_lookkey_dummy_ignored() --- L846: cmpq %r13, %rbx entry == limit je L942 addq $16, %rbx entry++ movq 8(%rbx), %rax entry->hash == NULL testq %rax, %rax jne L845 cmpq $0, (%rbx) entry->key == NULL (not in the inner loop) je L905 L845: cmpq %r12, %rax entry->hash == tgt jne L846 Before anything can go forward, more timings are needed and I need to come to terms with the code expansion back to size it was before the old lookkey specializations where removed. For now, I just want to document the work that has been done. Anyway, here it is if someone wants to experiment with it. Expect benefits in the cases where the inner-loop matters (small to medium sized sets that have collisions -- giant sets benefit as well but their lookup speed is dominated by the cost to fetch a single random cache line).

This patch applies three techniques to tighten-up the generated code for the lookkey() for sets.

I'm not sure I want to do this because it expands the size of the code quite a bit (much like the previously existing lookkey specializations did).  As such, it offends my sense of beauty and neatness.

That said, it greatly tightens the inner search loop and in the handful of timings I've done so far, it provides a good speed-up for smaller sets (ones that fit in L1 and L2 cache) and minor improvements for bigger sets (where running time is dominated by cost of fetching a cache-line with a key/hash pair).

The three transformations are:

1.)  Make two variants of lookkey, one with an early-out for a dummy entry and one that treats a dummy entry as an active entry.   This eliminates the freeslot tracking in both cases.  

In the first case (early-out), it frees one register (the one for the freeslot), it simplifies the inner-loop test (no need to see if freeslot is NULL), it provides an early-out (return on dummy rather than continuing to search), and it simplifies the return code (just return entry, no need to check the freeslot).

In the second case (dummies treated as active), it frees two registers (one for freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop test entirely.

That eliminated inner inner-loop code used to look like this in gcc-4.9:

    cmpq    %r15, %rbp      entry->key != dummny
    jne L375                                   |
    testq   %r13, %r13      freelslot == NULL  |
    cmove   %rbx, %r13                         |
L375:                                      <---|

And the eliminated inner loop code was even worse with clang:

    leaq    __dummy_struct(%rip), %rax
    cmpq    %rax, %r14                entry->key==dummy
    sete    %al
    testq   %rbx, %rbx                freeslot==NULL
    sete    %cl
    testb   %cl, %al
    cmovneq %r13, %rbx                freeslot ?= entry


2.) Avoid the &mask step for computing indices when not needed. Using the same technique that is used in set_insert_clean, we test to see if there is a possible wrap-around during the linear probes.  If not, we can skip the lengthy address calculation for table[(i+j)&mask] and instead use entry++.

This saves the sequentially dependent steps of add j, the bitwise-and with the mask, a shift left by four to scale the index, and the add to the table start address.  It replaces those steps with a single entry++ which codes as an add $16.

3).  In the linear_probe inner-loop for the ignore dummy case, replace (entry->key==NULL) with (entry->hash==0 && entry->key==NULL).   While that looks like a step backward adding an extra test, that second test is essentially free (a register compare and branch which is nearly 100% predictable).   The benefit is that the inner-loop only needs to fetch the entry->hash value and doesn't need to fetch entry->key.   This doesn't sound like much but it cuts the number of memory accesses per loop in half.  

Individually, three transformations don't seem like much, but they take already tight code and cut the work more than in-half (for the L1 and L2 case).   Collectively, it frees a couple of registers for better register allocation, it reduces the number of compares inside the loop, in substantially reduces the time for the entry address calculation, and in the ignore-dummy case cuts the number of memory accesses per loop in half.

The resulting code is astonishingly tight compared to the original and looks almost as tight as the code in set_insert_clean().


--- inner-loop of set_lookkey_dummy_allowed() ---
L393:
    cmpq    %r8, (%rbx)        entry->key == dummy
    je  L453
    cmpq    %r12, %rbx         entry == limit
    movq    %rbx, %r14
    je  L506
L412:
    movq    16(%r14), %rbp
    leaq    16(%r14), %rbx
    testq   %rbp, %rbp         entry->key NULL
    je  L449
    cmpq    8(%rbx), %r15      entry->hash == hash
    jne L393


--- inner-loop of set_lookkey_dummy_ignored() ---
L846:
    cmpq    %r13, %rbx         entry == limit
    je  L942
    addq    $16, %rbx          entry++
    movq    8(%rbx), %rax      entry->hash == NULL
    testq   %rax, %rax
    jne L845
    cmpq    $0, (%rbx)         entry->key == NULL (not in the inner loop)
    je  L905
L845:
    cmpq    %r12, %rax         entry->hash == tgt
    jne L846


Before anything can go forward, more timings are needed and I need to come to terms with the code expansion back to size it was before the old lookkey specializations where removed.  For now, I just want to document the work that has been done.

Anyway, here it is if someone wants to experiment with it.  Expect benefits in the cases where the inner-loop matters (small to medium sized sets that have collisions -- giant sets benefit as well but their lookup speed is dominated by the cost to fetch a single random cache line).

History
Date	User	Action	Args
2015-01-31 21:37:58	rhettinger	set	recipients: + rhettinger
2015-01-31 21:37:58	rhettinger	set	messageid: <1422740278.04.0.89058521092.issue23359@psf.upfronthosting.co.za>
2015-01-31 21:37:57	rhettinger	link	issue23359 messages
2015-01-31 21:37:56	rhettinger	create