Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory optimization for set.issubset #90861

Closed
SimpleArt mannequin opened this issue Feb 10, 2022 · 8 comments · Fixed by #92799
Closed

Memory optimization for set.issubset #90861

SimpleArt mannequin opened this issue Feb 10, 2022 · 8 comments · Fixed by #92799
Assignees
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@SimpleArt
Copy link
Mannequin

SimpleArt mannequin commented Feb 10, 2022

BPO 46705
Nosy @rhettinger, @serhiy-storchaka, @SimpleArt
PRs
  • bpo-46705: Memory optimization for set.issubset #31267
  • Files
  • instrument_issubset.py: Instrument the baseline and proposed issubset() method
  • instrument_issubset.py: Added Serhiy's difference variant
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = None
    created_at = <Date 2022-02-10.01:24:46.738>
    labels = ['interpreter-core', '3.11', 'performance']
    title = 'Memory optimization for set.issubset'
    updated_at = <Date 2022-02-13.20:01:30.345>
    user = 'https://github.com/SimpleArt'

    bugs.python.org fields:

    activity = <Date 2022-02-13.20:01:30.345>
    actor = 'serhiy.storchaka'
    assignee = 'rhettinger'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2022-02-10.01:24:46.738>
    creator = 'panda1200'
    dependencies = []
    files = ['50615', '50620']
    hgrepos = []
    issue_num = 46705
    keywords = ['patch']
    message_count = 8.0
    messages = ['412966', '412977', '412978', '412982', '413030', '413037', '413074', '413194']
    nosy_count = 3.0
    nosy_names = ['rhettinger', 'serhiy.storchaka', 'panda1200']
    pr_nums = ['31267']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46705'
    versions = ['Python 3.11']

    @SimpleArt
    Copy link
    Mannequin Author

    SimpleArt mannequin commented Feb 10, 2022

    I noticed that the set.issubset cpython implementation casts its iterable argument to a set. In some cases, casting the whole iterable to a set is unnecessary (see https://bugs.python.org/issue18032). Although the latter suggestion is to perform early termination, my suggestion is to use the intersection instead.

        # PyAnySet_Check coming from the cpython source code.
        def issubset(self, other):
            # Intersection suggestion:
            if not PyAnySet_Check(other):
                return len(self.intersection(other)) == len(self)
            # Usual implementation for sets.
            else:
                return ...

    The main advantage that this implementation has is its memory performance, using only O(min(len(self), len(other))) memory, since it never stores elements it does not need.

    I'm assuming that set construction costs O(n) set.__contains__ calls. This implementation uses len(other) calls to self.__contains__ and tmp.__contains__, where tmp = set(other). The current implementation uses len(self) + len(other) calls to tmp.__contains__.

    Thus, I suspect the current implementation only has a chance at running noticeably faster when len(self) << len(other), where it performs fewer calls to set.__contains__. This is, however, also where the proposed implementation has significantly superior memory performance.

    @SimpleArt SimpleArt mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 10, 2022
    @rhettinger rhettinger self-assigned this Feb 10, 2022
    @rhettinger rhettinger removed 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels Feb 10, 2022
    @rhettinger
    Copy link
    Contributor

    We care more about the running speed than the memory usage. Since sets only store pointers to data, they are typically smaller than the data they refer to.

    I've attached some instrumentation code for running experiments to verify the workload under various scenarios.

    @rhettinger
    Copy link
    Contributor

    I've run a few more experiments and this looks like a net win more often than not. Go ahead and submit a PR so we can evaluate it further.

    @serhiy-storchaka
    Copy link
    Member

    Would not testing len(self.difference(other)) == 0 be more efficient? Making a copy of a set and removing elements one by one may be faster than add elements one by one, because we only need to allocate a single chunk of memory for a set.

    It depends on relative values of len(self), len(other) and len(set(other)). For example, the following code may be optimal in some cases:

        tmp = set()
        it = iter(other)
        for item in it:
            # if item in self: ?
            tmp.add(item)
            if len(tmp) >= len(self):
                self = self.difference(tmp, it)
                if not self:
                    return True
                self.difference_update(other)
                return not self
        else:
            return False  # len(self) > len(set(other))

    The disadvantage of such optimizations is that they make the code much more bigger. The current code is small and simple, and good enough in most cases.

    @SimpleArt
    Copy link
    Mannequin Author

    SimpleArt mannequin commented Feb 10, 2022

    As you say, which implementation performs better likely depends on the nature of the sets. I would suspect that using set.difference won't be substantially faster than using set.intersection in the best case, but it would be much slower if len(self) is much larger than len(self.intersection(other)) e.g. set(range(1_000_000)).issubset(range(10)). Overall I think that using set.intersection will have more well-rounded performance.

    @rhettinger
    Copy link
    Contributor

    Would not testing len(self.difference(other)) == 0
    be more efficient?

    Yes, I think so.

    @serhiy-storchaka
    Copy link
    Member

    See also bpo-18032.

    @serhiy-storchaka
    Copy link
    Member

    It will be even more efficient after applying a set.intersection() optimization in bpo-43464.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    Successfully merging a pull request may close this issue.

    2 participants