Message235131
This patch applies three techniques to tighten-up the generated code for the lookkey() for sets.
I'm not sure I want to do this because it expands the size of the code quite a bit (much like the previously existing lookkey specializations did). As such, it offends my sense of beauty and neatness.
That said, it greatly tightens the inner search loop and in the handful of timings I've done so far, it provides a good speed-up for smaller sets (ones that fit in L1 and L2 cache) and minor improvements for bigger sets (where running time is dominated by cost of fetching a cache-line with a key/hash pair).
The three transformations are:
1.) Make two variants of lookkey, one with an early-out for a dummy entry and one that treats a dummy entry as an active entry. This eliminates the freeslot tracking in both cases.
In the first case (early-out), it frees one register (the one for the freeslot), it simplifies the inner-loop test (no need to see if freeslot is NULL), it provides an early-out (return on dummy rather than continuing to search), and it simplifies the return code (just return entry, no need to check the freeslot).
In the second case (dummies treated as active), it frees two registers (one for freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop test entirely.
That eliminated inner inner-loop code used to look like this in gcc-4.9:
cmpq %r15, %rbp entry->key != dummny
jne L375 |
testq %r13, %r13 freelslot == NULL |
cmove %rbx, %r13 |
L375: <---|
And the eliminated inner loop code was even worse with clang:
leaq __dummy_struct(%rip), %rax
cmpq %rax, %r14 entry->key==dummy
sete %al
testq %rbx, %rbx freeslot==NULL
sete %cl
testb %cl, %al
cmovneq %r13, %rbx freeslot ?= entry
2.) Avoid the &mask step for computing indices when not needed. Using the same technique that is used in set_insert_clean, we test to see if there is a possible wrap-around during the linear probes. If not, we can skip the lengthy address calculation for table[(i+j)&mask] and instead use entry++.
This saves the sequentially dependent steps of add j, the bitwise-and with the mask, a shift left by four to scale the index, and the add to the table start address. It replaces those steps with a single entry++ which codes as an add $16.
3). In the linear_probe inner-loop for the ignore dummy case, replace (entry->key==NULL) with (entry->hash==0 && entry->key==NULL). While that looks like a step backward adding an extra test, that second test is essentially free (a register compare and branch which is nearly 100% predictable). The benefit is that the inner-loop only needs to fetch the entry->hash value and doesn't need to fetch entry->key. This doesn't sound like much but it cuts the number of memory accesses per loop in half.
Individually, three transformations don't seem like much, but they take already tight code and cut the work more than in-half (for the L1 and L2 case). Collectively, it frees a couple of registers for better register allocation, it reduces the number of compares inside the loop, in substantially reduces the time for the entry address calculation, and in the ignore-dummy case cuts the number of memory accesses per loop in half.
The resulting code is astonishingly tight compared to the original and looks almost as tight as the code in set_insert_clean().
--- inner-loop of set_lookkey_dummy_allowed() ---
L393:
cmpq %r8, (%rbx) entry->key == dummy
je L453
cmpq %r12, %rbx entry == limit
movq %rbx, %r14
je L506
L412:
movq 16(%r14), %rbp
leaq 16(%r14), %rbx
testq %rbp, %rbp entry->key NULL
je L449
cmpq 8(%rbx), %r15 entry->hash == hash
jne L393
--- inner-loop of set_lookkey_dummy_ignored() ---
L846:
cmpq %r13, %rbx entry == limit
je L942
addq $16, %rbx entry++
movq 8(%rbx), %rax entry->hash == NULL
testq %rax, %rax
jne L845
cmpq $0, (%rbx) entry->key == NULL (not in the inner loop)
je L905
L845:
cmpq %r12, %rax entry->hash == tgt
jne L846
Before anything can go forward, more timings are needed and I need to come to terms with the code expansion back to size it was before the old lookkey specializations where removed. For now, I just want to document the work that has been done.
Anyway, here it is if someone wants to experiment with it. Expect benefits in the cases where the inner-loop matters (small to medium sized sets that have collisions -- giant sets benefit as well but their lookup speed is dominated by the cost to fetch a single random cache line). |
|
Date |
User |
Action |
Args |
2015-01-31 21:37:58 | rhettinger | set | recipients:
+ rhettinger |
2015-01-31 21:37:58 | rhettinger | set | messageid: <1422740278.04.0.89058521092.issue23359@psf.upfronthosting.co.za> |
2015-01-31 21:37:57 | rhettinger | link | issue23359 messages |
2015-01-31 21:37:56 | rhettinger | create | |
|