Issue 23269: Tighten-up search loops in sets

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67458

classification

Title:	Tighten-up search loops in sets
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	pitrou, python-dev, rhettinger, serhiy.storchaka
Priority:	low	Keywords:	patch

Created on 2015-01-19 09:15 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
limit.diff	rhettinger, 2015-01-19 09:15	entry++ style of looping	review
timings.txt	rhettinger, 2015-01-19 20:14	Initial timings
limit2.diff	rhettinger, 2015-01-19 20:15	More readable limit calculation	review
measure_build_set.py	rhettinger, 2015-01-19 20:17	Timing suite
tight2.diff	rhettinger, 2015-01-25 20:31	Move overflow decision out of the inner search loop.
tight2a.diff	serhiy.storchaka, 2015-01-25 21:11		review

Messages (9)
msg234308 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-19 09:15
First draft of patch to switch from a table[(i+j)&mask] style of entry calculation to an entry++ style. The entry computation simplifies from add/shl4/and/lea to a single add16. To do this, the linear probes are limited to the length of table rather than wrapping-around. I haven't timed this one yet (have just looked at the assembly) or worked through the implications. The new approach is a little less attractive but may be easier to reason about than the mask wrap-around. Also, it disadvantages small sets where the wrap-around property assured that the linear probes would always fine a hit without any entry being checked more than once. There is a extra setup time before the linear probes to compute the limit. Also, since the loop size is now variable instead of fixed, the inner loop for set_insert_clean() no longer gets unrolled by either GCC or CLANG. With the change, the inner loop of set_insert_clean() is very tight: L436: addq $16, %rax #, entry entry < limit cmpq %rdx, %rax # limit, entry ja L399 #, entry->key == NULL cmpq $0, (%rax) #,* entry jne L436 The code for lookkey() is also tight (though I can't tell why the entry->key gets loaded from memory twice): L102: cmpq %r15, %r13 # tmp170, D.12706 entry->key == dummy jne L103 #, testq %r12, %r12 # freeslot cmove %r14, %r12 # entry -> freeslot L103: addq $16, %r14 #, entry entry++ cmpq %r14, %rbx # entry, limit jb L146 #, movq (%r14), %r13 # MEM[base: entry_65, offset: 0B], D.12706 testq %r13, %r13 # D.12706 entry->key == NULL je L147 #, cmpq %rbp, 8(%r14) # hash, MEM[base: entry_104, offset: 8B] je L148 #, entry->hash == hash ? movq (%r14), %r13 # MEM[base: entry_104, offset: 0B], D.12706 jmp L102 #
msg234327 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-19 20:14
Patch timings give inconsistent results. GCC-4.9 generates faster code and CLang generates slower code :-(
msg234328 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-01-19 20:17
Either way the improvement doesn't look terrific, so I would suggest not to bother with this.
msg234685 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-25 20:31
Reopening this with a better patch that nicely tightens the inner-loop without an algorithmic change or any downside. Currently, the entry computation adds a constant (i+j), masks it (&mask) to wrap on overflow, scales it to a table index (<< 4), and then does the key lookup (one memory access). These steps are sequential (not parallizeable). The patch moves the wrap-on-overflow decision out of the inner loop (adding a single, fast, predictable branch (i + LINEAR_PROBES > mask). The inner-loop then simplifies to an add, fetch, and test (no masking and shifting). The generated code on Clang is very tight: LBB29_26: cmpq $0, (%rsi) # entry->key == NULL je LBB29_30 incq %rdx # j++ addq $16, %rsi # entry++ (done in parallel with the incq) cmpq $9, %rdx # j <= LINEAR_PROBES jbe LBB29_26 On gcc-4.9, the loop is unrolled and we get even better code: leaq (%rbx,%rsi), %rdx # entry = table[i] cmpq $0, (%rdx) # entry.key == NULL je L223 leaq 16(%rbx,%rsi), %rdx # entry = table[i+1]; cmpq $0, (%rdx) # entry.key == NULL je L223 leaq 32(%rbx,%rsi), %rdx cmpq $0, (%rdx) je L223
msg234693 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-25 21:11
What timing results? There is large chance that first tested entry is empty. May be worth to move this test outside of the loop as in the set_lookup function. "i &= mask;" may be moved to the start of the loop.
msg234713 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-26 03:23
> May be worth to move this test outside of the loop as in the > set_lookup function. The loop used to be this way but testing showed no benefit, so a while back I recombined it back to a j=0 start point and the code looked much nicer. I don't really want to go back if I don't have to. There may or may not be a microscopic benefit to moving the leaq 9(%rcx), %rsi / cmpq %rsi, %rax / jae L237 sequence before the table[i+0]->key == NULL check. I don't still have access to VTune to verify that this highly predictable branch is essentially free. I do have a preference at this point for the simpler code -- that will make the future optimization work easier (perhaps inlining the lookkey code into set_insert_key, set_discard, and set_contains). Also, looking at the disassembly of your patch, there is an additional cost for setting up the table[i+1] entry that wasn't there before (two new extra instructions: leaq 1(%rcx), %rsi; salq $4, %rsi; occur before the normal lookup and test sequence: leaq 0(%rbp,%rsi), %rdx; cmpq $0, (%rdx); je L237) That said, if you think it is important to move the table[i+0] test outside the (i + LINEAR_PROBES <= mask) test, just say so and I'll apply your version of the patch instead of mine. Otherwise, I prefer the cleaner code (both C and generated assembly) of my patch.
msg234715 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-26 07:21
I have no strong preferences. This code is used very rarely and I have no any example which exposes performance differences. But how large the benefit of your patch? It doesn't make the C code more clean.
msg234807 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-01-27 05:33
New changeset 0b2a3d764e63 by Raymond Hettinger in branch 'default': Issue #23269: Tighten search_loop in set_insert_clean() https://hg.python.org/cpython/rev/0b2a3d764e63
msg234808 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-27 05:35
Applied Sirhiy's version of the patch but may switch to the j=0 version later.

History
Date	User	Action	Args
2022-04-11 14:58:12	admin	set	github: 67458
2015-01-27 05:35:28	rhettinger	set	status: open -> closed resolution: fixed messages: + msg234808
2015-01-27 05:33:58	python-dev	set	nosy: + python-dev messages: + msg234807
2015-01-26 07:21:37	serhiy.storchaka	set	messages: + msg234715
2015-01-26 03:23:16	rhettinger	set	messages: + msg234713
2015-01-25 21:11:20	serhiy.storchaka	set	files: + tight2a.diff messages: + msg234693
2015-01-25 20:31:31	rhettinger	set	status: closed -> open files: + tight2.diff resolution: rejected -> (no value) messages: + msg234685
2015-01-20 08:08:17	rhettinger	set	status: open -> closed resolution: rejected
2015-01-19 20:17:35	pitrou	set	messages: + msg234328
2015-01-19 20:17:07	rhettinger	set	files: + measure_build_set.py
2015-01-19 20:15:18	rhettinger	set	files: + limit2.diff
2015-01-19 20:14:44	rhettinger	set	files: + timings.txt messages: + msg234327
2015-01-19 09:57:10	serhiy.storchaka	set	nosy: + pitrou, serhiy.storchaka
2015-01-19 09:15:42	rhettinger	create