Issue 46705: Memory optimization for set.issubset

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90861

classification

Title:	Memory optimization for set.issubset
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	panda1200, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2022-02-10 01:24 by panda1200, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
instrument_issubset.py	rhettinger, 2022-02-10 06:05	Instrument the baseline and proposed issubset() method
instrument_issubset.py	rhettinger, 2022-02-11 05:18	Added Serhiy's difference variant

Pull Requests
URL	Status	Linked	Edit
PR 31267	open	panda1200, 2022-02-11 00:05

Messages (8)
msg412966 - (view)	Author: Jack Nguyen (panda1200) *	Date: 2022-02-10 01:24
I noticed that the set.issubset cpython implementation casts its iterable argument to a set. In some cases, casting the whole iterable to a set is unnecessary (see https://bugs.python.org/issue18032). Although the latter suggestion is to perform early termination, my suggestion is to use the intersection instead. # PyAnySet_Check coming from the cpython source code. def issubset(self, other): # Intersection suggestion: if not PyAnySet_Check(other): return len(self.intersection(other)) == len(self) # Usual implementation for sets. else: return ... The main advantage that this implementation has is its memory performance, using only O(min(len(self), len(other))) memory, since it never stores elements it does not need. I'm assuming that set construction costs O(n) set.__contains__ calls. This implementation uses len(other) calls to self.__contains__ and tmp.__contains__, where tmp = set(other). The current implementation uses len(self) + len(other) calls to tmp.__contains__. Thus, I suspect the current implementation only has a chance at running noticeably faster when len(self) << len(other), where it performs fewer calls to set.__contains__. This is, however, also where the proposed implementation has significantly superior memory performance.
msg412977 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2022-02-10 05:19
We care more about the running speed than the memory usage. Since sets only store pointers to data, they are typically smaller than the data they refer to. I've attached some instrumentation code for running experiments to verify the workload under various scenarios.
msg412978 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2022-02-10 06:03
I've run a few more experiments and this looks like a net win more often than not. Go ahead and submit a PR so we can evaluate it further.
msg412982 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-02-10 08:35
Would not testing len(self.difference(other)) == 0 be more efficient? Making a copy of a set and removing elements one by one may be faster than add elements one by one, because we only need to allocate a single chunk of memory for a set. It depends on relative values of len(self), len(other) and len(set(other)). For example, the following code may be optimal in some cases: tmp = set() it = iter(other) for item in it: # if item in self: ? tmp.add(item) if len(tmp) >= len(self): self = self.difference(tmp, it) if not self: return True self.difference_update(other) return not self else: return False # len(self) > len(set(other)) The disadvantage of such optimizations is that they make the code much more bigger. The current code is small and simple, and good enough in most cases.
msg413030 - (view)	Author: Jack Nguyen (panda1200) *	Date: 2022-02-10 23:26
As you say, which implementation performs better likely depends on the nature of the sets. I would suspect that using set.difference won't be substantially faster than using set.intersection in the best case, but it would be much slower if len(self) is much larger than len(self.intersection(other)) e.g. set(range(1_000_000)).issubset(range(10)). Overall I think that using set.intersection will have more well-rounded performance.
msg413037 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2022-02-11 05:18
> Would not testing len(self.difference(other)) == 0 > be more efficient? Yes, I think so.
msg413074 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-02-11 15:01
See also issue18032.
msg413194 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-02-13 20:01
It will be even more efficient after applying a set.intersection() optimization in issue43464.

History
Date	User	Action	Args
2022-04-11 14:59:56	admin	set	github: 90861
2022-02-13 20:01:30	serhiy.storchaka	set	messages: + msg413194
2022-02-11 15:01:10	serhiy.storchaka	set	messages: + msg413074
2022-02-11 05:18:56	rhettinger	set	files: + instrument_issubset.py messages: + msg413037
2022-02-11 00:05:17	panda1200	set	keywords: + patch stage: patch review pull_requests: + pull_request29432
2022-02-10 23:26:38	panda1200	set	messages: + msg413030
2022-02-10 08:35:24	serhiy.storchaka	set	messages: + msg412982
2022-02-10 06:05:22	rhettinger	set	files: + instrument_issubset.py
2022-02-10 06:04:49	rhettinger	set	files: - instrument_issubset.py
2022-02-10 06:03:35	rhettinger	set	files: - instrument_issubset.py
2022-02-10 06:03:26	rhettinger	set	files: + instrument_issubset.py messages: + msg412978
2022-02-10 05:19:06	rhettinger	set	files: + instrument_issubset.py messages: + msg412977
2022-02-10 03:40:27	rhettinger	set	versions: - Python 3.7, Python 3.8, Python 3.9, Python 3.10
2022-02-10 03:39:56	rhettinger	set	assignee: rhettinger
2022-02-10 01:33:49	Dennis Sweeney	set	nosy: + rhettinger, serhiy.storchaka
2022-02-10 01:24:46	panda1200	create