Issue 23359: Speed-up set_lookkey()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67548

classification

Title:	Speed-up set_lookkey()
Type:	performance	Stage:	patch review
Components:		Versions:	Python 3.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	pitrou, python-dev, rhettinger, serhiy.storchaka
Priority:	low	Keywords:	patch

Created on 2015-01-31 21:37 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
hoisted_and_specialized.diff	rhettinger, 2015-01-31 21:37	Specialized lookkey variants with and-masking hoisted out the inner loop	review
hoist_mask_only.diff	rhettinger, 2015-02-02 10:23	Simpler entry address computation in the nowrap case	review
set_simpler_linear_probes.patch	serhiy.storchaka, 2015-02-02 19:18		review
nd_lookkey_insertkey3.diff	rhettinger, 2015-05-26 08:29	Split insertion logic from lookup logic	review
new_set_timings.txt	rhettinger, 2015-05-26 21:12	Timings for nd_lookkey_insertkey3

Messages (14)
msg235131 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-31 21:37
This patch applies three techniques to tighten-up the generated code for the lookkey() for sets. I'm not sure I want to do this because it expands the size of the code quite a bit (much like the previously existing lookkey specializations did). As such, it offends my sense of beauty and neatness. That said, it greatly tightens the inner search loop and in the handful of timings I've done so far, it provides a good speed-up for smaller sets (ones that fit in L1 and L2 cache) and minor improvements for bigger sets (where running time is dominated by cost of fetching a cache-line with a key/hash pair). The three transformations are: 1.) Make two variants of lookkey, one with an early-out for a dummy entry and one that treats a dummy entry as an active entry. This eliminates the freeslot tracking in both cases. In the first case (early-out), it frees one register (the one for the freeslot), it simplifies the inner-loop test (no need to see if freeslot is NULL), it provides an early-out (return on dummy rather than continuing to search), and it simplifies the return code (just return entry, no need to check the freeslot). In the second case (dummies treated as active), it frees two registers (one for freeslot and the other for &dummy) and it removes the dummy/freeslot inner-loop test entirely. That eliminated inner inner-loop code used to look like this in gcc-4.9: cmpq %r15, %rbp entry->key != dummny jne L375 \| testq %r13, %r13 freelslot == NULL \| cmove %rbx, %r13 \| L375: <---\| And the eliminated inner loop code was even worse with clang: leaq __dummy_struct(%rip), %rax cmpq %rax, %r14 entry->key==dummy sete %al testq %rbx, %rbx freeslot==NULL sete %cl testb %cl, %al cmovneq %r13, %rbx freeslot ?= entry 2.) Avoid the &mask step for computing indices when not needed. Using the same technique that is used in set_insert_clean, we test to see if there is a possible wrap-around during the linear probes. If not, we can skip the lengthy address calculation for table[(i+j)&mask] and instead use entry++. This saves the sequentially dependent steps of add j, the bitwise-and with the mask, a shift left by four to scale the index, and the add to the table start address. It replaces those steps with a single entry++ which codes as an add $16. 3). In the linear_probe inner-loop for the ignore dummy case, replace (entry->key==NULL) with (entry->hash==0 && entry->key==NULL). While that looks like a step backward adding an extra test, that second test is essentially free (a register compare and branch which is nearly 100% predictable). The benefit is that the inner-loop only needs to fetch the entry->hash value and doesn't need to fetch entry->key. This doesn't sound like much but it cuts the number of memory accesses per loop in half. Individually, three transformations don't seem like much, but they take already tight code and cut the work more than in-half (for the L1 and L2 case). Collectively, it frees a couple of registers for better register allocation, it reduces the number of compares inside the loop, in substantially reduces the time for the entry address calculation, and in the ignore-dummy case cuts the number of memory accesses per loop in half. The resulting code is astonishingly tight compared to the original and looks almost as tight as the code in set_insert_clean(). --- inner-loop of set_lookkey_dummy_allowed() --- L393: cmpq %r8, (%rbx) entry->key == dummy je L453 cmpq %r12, %rbx entry == limit movq %rbx, %r14 je L506 L412: movq 16(%r14), %rbp leaq 16(%r14), %rbx testq %rbp, %rbp entry->key NULL je L449 cmpq 8(%rbx), %r15 entry->hash == hash jne L393 --- inner-loop of set_lookkey_dummy_ignored() --- L846: cmpq %r13, %rbx entry == limit je L942 addq $16, %rbx entry++ movq 8(%rbx), %rax entry->hash == NULL testq %rax, %rax jne L845 cmpq $0, (%rbx) entry->key == NULL (not in the inner loop) je L905 L845: cmpq %r12, %rax entry->hash == tgt jne L846 Before anything can go forward, more timings are needed and I need to come to terms with the code expansion back to size it was before the old lookkey specializations where removed. For now, I just want to document the work that has been done. Anyway, here it is if someone wants to experiment with it. Expect benefits in the cases where the inner-loop matters (small to medium sized sets that have collisions -- giant sets benefit as well but their lookup speed is dominated by the cost to fetch a single random cache line).
msg235132 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-31 21:57
One further possible transformation is to inline set_lookkey_dummy_allowed() inside set_insert_key() which is the only place it is used. That saves all the test and branch code inside the latter (all that code does is reconstruct the exit points for set_lookkey_dummy_allowed() which were already known).
msg235147 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-01 08:50
>>> s = set() >>> s.add(1<<64) >>> s.add(2<<64) >>> s.add(3<<64) >>> list(s) [36893488147419103232, 18446744073709551616, 55340232221128654848] >>> s.discard(1<<64) >>> s.discard(2<<64) >>> s.add(3<<64) >>> list(s) [55340232221128654848, 55340232221128654848]
msg235189 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-02-01 18:15
Thank Serhiy. No early-out on insertion is possible. For discard and contains, there is still no need for testing dummies and tracking freeslots.
msg235192 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-01 18:38
Here are more ideas. I would probe to use set_lookkey_dummy_ignored() with following set_find_free_slot() (from set_faster_copy_2.patch in issue23290) if former hadn't find a key. If set hash == -1 for key == NULL, we can use only one comparison for testing key == NULL \|\| key == dummy.
msg235238 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-02-02 10:23
Before doing more study on the other variants, I would like to get the second transformation done (avoiding the mask computation in the case where there is no wrap-around). Attaching a patch for just that step.
msg235265 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-02 15:26
Agree, applying simple steps one by one would be more robust. How large the benefit, do you have any timing results?
msg235269 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-02-02 16:35
New changeset 0b3bc51341aa by Raymond Hettinger in branch 'default': Issue 23359: Tighten inner search loop for sets (don't and-mask every entry lookup). https://hg.python.org/cpython/rev/0b3bc51341aa
msg235284 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-02 19:18
May be the code can be simplified without affecting performance if remove slower "else" branch in set_lookkey and set_insert_clean. At least I didn't find a regression in microbenchmarks, but found small benefit. For example: $ ./python -m timeit -s "n = 10*6; s = set(range(n, n+10)); a = list(range(n, n+10))104" -- "s.intersection(a)" Before 0b3bc51341aa: 10 loops, best of 3: 26.6 msec per loop After 0b3bc51341aa: 10 loops, best of 3: 25.4 msec per loop With set_simpler_linear_probes.patch: 10 loops, best of 3: 23.9 msec per loop $ ./python -m timeit -s "n = 106; s = set(range(n, n+100)); a = list(range(n, n+100))103" -- "s.intersection(a)" Before 0b3bc51341aa: 10 loops, best of 3: 26.3 msec per loop After 0b3bc51341aa: 10 loops, best of 3: 25.3 msec per loop With set_simpler_linear_probes.patch: 10 loops, best of 3: 23.3 msec per loop $ ./python -m timeit -s "n = 106; s = set(range(n, n+10)); a = list(range(n, n+10*5))" -- "s.intersection(a)" Before 0b3bc51341aa: 100 loops, best of 3: 12 msec per loop After 0b3bc51341aa: 100 loops, best of 3: 11.3 msec per loop With set_simpler_linear_probes.patch: 100 loops, best of 3: 11.4 msec per loop
msg235349 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-02-03 16:16
New changeset 17cda5a92b6a by Raymond Hettinger in branch 'default': Issue 23359: Reduce size of code in set_lookkey. Only do linear probes when there is no wrap-around. https://hg.python.org/cpython/rev/17cda5a92b6a
msg235350 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-02-03 16:16
Thanks Serhiy, that was brilliant.
msg244205 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-05-27 17:37
New changeset cd1e1715becd by Raymond Hettinger in branch 'default': Issue #23359: Specialize set_lookkey intoa lookup function and an insert function. https://hg.python.org/cpython/rev/cd1e1715becd
msg244210 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-05-27 18:12
Why you have added entry->hash == 0?
msg244220 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-05-27 21:26
[Reply to pitrou that I didn't seem to be able to make on Rietveld] This part of the code is the most time sensitive and warrants expansion much more than other proposals (set copying, subset tests, etc). I long aspired to split the lookup and insertion logic. The former doesn't need dummy tracking and its split relieves the callers of doing dummy checks. The latter needed to be more tightly integrated with set_insert_entry. Both sets of logic have different branch prediction statistics depending on the input data. Working going forward will be made easier for me by having the lookup and insertion logic separated. The speed-up is modest but this part of a long series of modest speed-ups collectively adding-up to a nice boost. The next in line is possibly adding likely/unlikely macros to improve the quality of code generation across different platforms. Overall the code size is just about the same it was in Python 2.7.

History
Date	User	Action	Args
2022-04-11 14:58:12	admin	set	github: 67548
2015-05-27 21:26:07	rhettinger	set	messages: + msg244220
2015-05-27 18:12:06	serhiy.storchaka	set	messages: + msg244210
2015-05-27 17:37:50	rhettinger	set	status: open -> closed resolution: fixed
2015-05-27 17:37:29	python-dev	set	messages: + msg244205
2015-05-26 21:12:27	rhettinger	set	files: + new_set_timings.txt
2015-05-26 08:29:07	rhettinger	set	files: + nd_lookkey_insertkey3.diff versions: + Python 3.6, - Python 3.5
2015-02-03 16:16:42	rhettinger	set	messages: + msg235350
2015-02-03 16:16:01	python-dev	set	messages: + msg235349
2015-02-02 19:18:12	serhiy.storchaka	set	files: + set_simpler_linear_probes.patch messages: + msg235284
2015-02-02 16:35:12	python-dev	set	nosy: + python-dev messages: + msg235269
2015-02-02 15:26:03	serhiy.storchaka	set	nosy: + pitrou messages: + msg235265
2015-02-02 10:23:51	rhettinger	set	files: + hoist_mask_only.diff messages: + msg235238
2015-02-01 18:38:28	serhiy.storchaka	set	messages: + msg235192
2015-02-01 18:15:25	rhettinger	set	messages: + msg235189
2015-02-01 08:50:59	serhiy.storchaka	set	messages: + msg235147
2015-01-31 21:57:23	rhettinger	set	nosy: + serhiy.storchaka
2015-01-31 21:57:09	rhettinger	set	messages: + msg235132
2015-01-31 21:37:58	rhettinger	create