Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set intersections should short-circuit #87630

Closed
MojoVampire mannequin opened this issue Mar 10, 2021 · 3 comments
Closed

set intersections should short-circuit #87630

MojoVampire mannequin opened this issue Mar 10, 2021 · 3 comments
Labels
3.11 only security fixes easy topic-C-API

Comments

@MojoVampire
Copy link
Mannequin

MojoVampire mannequin commented Mar 10, 2021

BPO 43464
Nosy @rhettinger, @serhiy-storchaka, @MojoVampire
PRs
  • bpo-43464: Optimize set.intersection() for non-set arguments #31316
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-04-06.17:03:17.031>
    created_at = <Date 2021-03-10.16:39:19.681>
    labels = ['easy', 'expert-C-API', '3.11']
    title = 'set intersections should short-circuit'
    updated_at = <Date 2022-04-06.17:03:17.030>
    user = 'https://github.com/MojoVampire'

    bugs.python.org fields:

    activity = <Date 2022-04-06.17:03:17.030>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-04-06.17:03:17.031>
    closer = 'serhiy.storchaka'
    components = ['C API']
    creation = <Date 2021-03-10.16:39:19.681>
    creator = 'josh.r'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43464
    keywords = ['easy (C)']
    message_count = 3.0
    messages = ['388442', '413193', '416886']
    nosy_count = 3.0
    nosy_names = ['rhettinger', 'serhiy.storchaka', 'josh.r']
    pr_nums = ['31316']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue43464'
    versions = ['Python 3.11']

    @MojoVampire
    Copy link
    Mannequin Author

    MojoVampire mannequin commented Mar 10, 2021

    At present, set_intersection (the C name for set.intersection) optimizes for pairs of sets by iterating the smallest set and only adding entries found in the larger, meaning work is proportionate to the smallest input.

    But when the other input isn't a set, it goes with a naive solution, iterating the entire non-set, and adding entries found in the set. This is fine when the intersection will end up smaller than the original set (there's no way to avoid exhausting the non-set when that's the case), but when the intersection ends up being the same size as the original, we could add a cheap length check and short-circuit at that point.

    As is, {4}.intersection(range(10000)) takes close to 1000 times longer than {4}.intersection(range(10)) despite both of them reaching the point where the outcome will be {4} at the same time.

    Since the length check for short-circuiting only needs to be performed when input set actually contains the value, the cost should be fairly low.

    I figure this would be the worst case for the change:

    {3, 4}.intersection((4,) * 10000)

    where it performs the length check every time, and doesn't benefit from short-circuiting. But cases like:

    {4}.intersection((4,) * 10000)

    or

    {4}.intersection(range(10000))

    would finish much faster. A similar optimization to set_intersection_multi (to stop when the intermediate result is empty) would make cases like:

    {4000}.intersection([1], range(10000), range(100000, 200000))

    also finish dramatically quicker in the (I'd assume not uncommon case) where the intersection of many iterables is empty, and this could be known quite early on (the cost of this check would be even lower, since it would only be performed once per iterable, not per-value).

    Only behavioral change this would cause is that errors resulting from processing items in an iterable that is no longer run to exhaustion due to short-circuiting wouldn't happen ({4}.intersection([4, []]) currently dies, but would succeed with short-circuiting; same foes for {4}.intersection([5], [[]]) if set_intersection_multi is optimized), and input iterators might be left only partially consumed. If that's acceptable, the required code changes are trivial.

    @MojoVampire MojoVampire mannequin added 3.10 only security fixes topic-C-API easy labels Mar 10, 2021
    @serhiy-storchaka
    Copy link
    Member

    See also bpo-46721. set.issuperset() can get similar optimization.

    And set.issubset() will benefit from this optimization if use set.intersection() in bpo-46705.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 31cd25f by Serhiy Storchaka in branch 'main':
    bpo-43464: Optimize set.intersection() for non-set arguments (GH-31316)
    31cd25f

    @serhiy-storchaka serhiy-storchaka added 3.11 only security fixes and removed 3.10 only security fixes labels Apr 6, 2022
    @serhiy-storchaka serhiy-storchaka added 3.11 only security fixes and removed 3.10 only security fixes labels Apr 6, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes easy topic-C-API
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant