Issue 23282: Slightly faster set lookup

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67471

classification

Title:	Slightly faster set lookup
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	josh.r, pitrou, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2015-01-20 10:39 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
set_faster_lookup.patch	serhiy.storchaka, 2015-01-20 10:39		review

Messages (5)
msg234367 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-20 10:39
Currently set_lookkey() first tests entry->key == NULL, then entry->hash == hash and entry->key != dummy, and only after that entry->key == key. Proposed patch optimizes the order of comparisons. entry->key == key is tested first as for dicts. And no need to test entry->key != dummy after entry->hash == hash if entry->hash of dummy key is set to -1. Microbenchmark which demonstrates the best case (a lot of lookups of keys existing in the set). $ ./python -m timeit -s "a = list(range(10**6)); s1 = set(a); s2 = set(a)" -- "s1 <= s2" Unpatched: 10 loops, best of 3: 39.4 msec per loop Patched: 10 loops, best of 3: 35.3 msec per loop
msg234553 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2015-01-23 11:57
Does it slow down other cases? Seems to me like dictionaries would have more "existing key lookups" than sets to justify the optimization, since they're used for attribute lookup and the like, and because you usually want the value associated with existing keys, where a set would usually be used in a "might exist, might not" scenario of membership testing.
msg234555 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-23 14:00
Good point Josh. Yes, it may slow down other cases, and there is a difference between sets and dicts. 1. If the key is not contained in the set, then if the first tested entry table[hash & mask] is empty, then the lookup is slowed down by one pointer comparison (2 comparisons instead of 1). If the entry is used then the lookup is not slow downed or even speed up (if need to test more than one additional entries). 2. If the same case is in the set, then if the first tested entry is the set then the lookup will use one comparison instead of 4. In other cases (the key is placed in other entry) the lookup will use at least one comparison less. 3. If the set contains key equal but not identical to tested key, then the lookup will use at least one comparison less. Only first case can cause slowdown. If we test a lot of keys not existing in the set with small ratio between used and allocated entries numbers. May be the worst case is creating a copy of the set. For other operations I did not find significant difference. $ ./python -m timeit -s "s = set(range(10**4))" -- "frozenset(s)" Unpatched: 1000 loops, best of 3: 658 usec per loop Patched: 1000 loops, best of 3: 668 usec per loop But issue23290 prevents this regression.
msg234679 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-01-25 18:33
I had intentionally moved the entry->key == key test to be after the hash test. Unlike dicts where exact key matches are norm, sets are used for deduping (cased where keys are equal but not dups). By moving the test inside the entry->key == hash test, we reduce the number of comparisons per loop for collisions and increase the predictability of the entry->key == key branch (once the hash matches, the chance of an equality match is very high. The one part I'm looking at changing is moving the dummy test after the identity key test. However, the dummy test inside the hash test is perfectly predictable (it basicly never matches) and may be cost free.
msg234683 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-25 19:55
What about just removing the dummy test inside the hash test? The benefit is smaller but statistically sygnificant and always positive. $ ./python -m timeit -s "a = list(range(10**6)); s1 = set(a); s2 = set(a)" -- "s1 <= s2" Unpatched: 10 loops, best of 3: 39.3 msec per loop Patched: 10 loops, best of 3: 38.7 msec per loop

History
Date	User	Action	Args
2022-04-11 14:58:12	admin	set	github: 67471
2015-01-25 19:55:50	serhiy.storchaka	set	messages: + msg234683
2015-01-25 18:33:17	rhettinger	set	status: open -> closed resolution: rejected messages: + msg234679
2015-01-23 14:00:59	serhiy.storchaka	set	messages: + msg234555
2015-01-23 11:57:16	josh.r	set	nosy: + josh.r messages: + msg234553
2015-01-23 01:42:24	rhettinger	set	assignee: rhettinger
2015-01-20 10:39:16	serhiy.storchaka	create