Issue 43464: set intersections should short-circuit

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87630

classification

Title:	set intersections should short-circuit
Type:		Stage:	resolved
Components:	C API	Versions:	Python 3.11

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	josh.r, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:	easy (C)

Created on 2021-03-10 16:39 by josh.r, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 31316	merged	serhiy.storchaka, 2022-02-13 19:55

Messages (3)
msg388442 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2021-03-10 16:39
At present, set_intersection (the C name for set.intersection) optimizes for pairs of sets by iterating the smallest set and only adding entries found in the larger, meaning work is proportionate to the smallest input. But when the other input isn't a set, it goes with a naive solution, iterating the entire non-set, and adding entries found in the set. This is fine when the intersection will end up smaller than the original set (there's no way to avoid exhausting the non-set when that's the case), but when the intersection ends up being the same size as the original, we could add a cheap length check and short-circuit at that point. As is, {4}.intersection(range(10000)) takes close to 1000 times longer than {4}.intersection(range(10)) despite both of them reaching the point where the outcome will be {4} at the same time. Since the length check for short-circuiting only needs to be performed when input set actually contains the value, the cost should be fairly low. I figure this would be the worst case for the change: {3, 4}.intersection((4,) * 10000) where it performs the length check every time, and doesn't benefit from short-circuiting. But cases like: {4}.intersection((4,) * 10000) or {4}.intersection(range(10000)) would finish much faster. A similar optimization to set_intersection_multi (to stop when the intermediate result is empty) would make cases like: {4000}.intersection([1], range(10000), range(100000, 200000)) also finish dramatically quicker in the (I'd assume not uncommon case) where the intersection of many iterables is empty, and this could be known quite early on (the cost of this check would be even lower, since it would only be performed once per iterable, not per-value). Only behavioral change this would cause is that errors resulting from processing items in an iterable that is no longer run to exhaustion due to short-circuiting wouldn't happen ({4}.intersection([4, []]) currently dies, but would succeed with short-circuiting; same foes for {4}.intersection([5], [[]]) if set_intersection_multi is optimized), and input iterators might be left only partially consumed. If that's acceptable, the required code changes are trivial.
msg413193 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-02-13 19:59
See also issue46721. set.issuperset() can get similar optimization. And set.issubset() will benefit from this optimization if use set.intersection() in issue46705.
msg416886 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-04-06 16:56
New changeset 31cd25f4e17cd68487dc76c1b2ec162a646818c2 by Serhiy Storchaka in branch 'main': bpo-43464: Optimize set.intersection() for non-set arguments (GH-31316) https://github.com/python/cpython/commit/31cd25f4e17cd68487dc76c1b2ec162a646818c2

History
Date	User	Action	Args
2022-04-11 14:59:42	admin	set	github: 87630
2022-04-06 17:03:17	serhiy.storchaka	set	status: open -> closed stage: resolved resolution: fixed versions: + Python 3.11, - Python 3.10
2022-04-06 16:56:41	serhiy.storchaka	set	messages: + msg416886
2022-02-13 19:59:21	serhiy.storchaka	set	keywords: - patch messages: + msg413193 stage: patch review -> (no value)
2022-02-13 19:55:37	serhiy.storchaka	set	keywords: + patch nosy: + serhiy.storchaka pull_requests: + pull_request29477 stage: patch review
2021-03-10 16:44:44	xtreak	set	nosy: + rhettinger
2021-03-10 16:39:19	josh.r	create