Message 412966 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	panda1200
Recipients	panda1200
Date	2022-02-10.01:24:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1644456286.75.0.570469463639.issue46705@roundup.psfhosted.org>
In-reply-to

Content
I noticed that the set.issubset cpython implementation casts its iterable argument to a set. In some cases, casting the whole iterable to a set is unnecessary (see https://bugs.python.org/issue18032). Although the latter suggestion is to perform early termination, my suggestion is to use the intersection instead. # PyAnySet_Check coming from the cpython source code. def issubset(self, other): # Intersection suggestion: if not PyAnySet_Check(other): return len(self.intersection(other)) == len(self) # Usual implementation for sets. else: return ... The main advantage that this implementation has is its memory performance, using only O(min(len(self), len(other))) memory, since it never stores elements it does not need. I'm assuming that set construction costs O(n) set.__contains__ calls. This implementation uses len(other) calls to self.__contains__ and tmp.__contains__, where tmp = set(other). The current implementation uses len(self) + len(other) calls to tmp.__contains__. Thus, I suspect the current implementation only has a chance at running noticeably faster when len(self) << len(other), where it performs fewer calls to set.__contains__. This is, however, also where the proposed implementation has significantly superior memory performance.

I noticed that the set.issubset cpython implementation casts its iterable argument to a set. In some cases, casting the whole iterable to a set is unnecessary (see https://bugs.python.org/issue18032). Although the latter suggestion is to perform early termination, my suggestion is to use the intersection instead.

    # PyAnySet_Check coming from the cpython source code.
    def issubset(self, other):
        # Intersection suggestion:
        if not PyAnySet_Check(other):
            return len(self.intersection(other)) == len(self)
        # Usual implementation for sets.
        else:
            return ...

The main advantage that this implementation has is its memory performance, using only O(min(len(self), len(other))) memory, since it never stores elements it does not need.

I'm assuming that set construction costs O(n) set.__contains__ calls. This implementation uses len(other) calls to self.__contains__ and tmp.__contains__, where tmp = set(other). The current implementation uses len(self) + len(other) calls to tmp.__contains__.

Thus, I suspect the current implementation only has a chance at running noticeably faster when len(self) << len(other), where it performs fewer calls to set.__contains__. This is, however, also where the proposed implementation has significantly superior memory performance.

History
Date	User	Action	Args
2022-02-10 01:24:46	panda1200	set	recipients: + panda1200
2022-02-10 01:24:46	panda1200	set	messageid: <1644456286.75.0.570469463639.issue46705@roundup.psfhosted.org>
2022-02-10 01:24:46	panda1200	link	issue46705 messages
2022-02-10 01:24:46	panda1200	create