set(range(100000)).difference(set()) is slow #52931

spiv · 2010-05-11T09:04:18Z

BPO	8685
Nosy	@rhettinger, @abalkin, @pitrou, @pjenvey, @florentx
Files	set-difference-speedup.diff: Small patch to improve the performance of large_set.difference(small_set) set-difference-speedup-2.diff: Corrected patch. set-mem.py: Quick script to show memory consumption of set.difference on linux

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/pitrou'
closed_at = <Date 2010-11-30.22:24:53.628>
created_at = <Date 2010-05-11.09:04:17.799>
labels = ['interpreter-core', 'performance']
title = 'set(range(100000)).difference(set()) is slow'
updated_at = <Date 2010-11-30.22:24:53.626>
user = 'https://bugs.python.org/spiv'

bugs.python.org fields:

activity = <Date 2010-11-30.22:24:53.626>
actor = 'pitrou'
assignee = 'pitrou'
closed = True
closed_date = <Date 2010-11-30.22:24:53.628>
closer = 'pitrou'
components = ['Interpreter Core']
creation = <Date 2010-05-11.09:04:17.799>
creator = 'spiv'
dependencies = []
files = ['17292', '17293', '17306']
hgrepos = []
issue_num = 8685
keywords = ['patch']
message_count = 19.0
messages = ['105489', '105490', '105494', '105524', '105585', '105586', '105823', '105885', '105909', '113405', '113418', '113430', '113498', '115295', '115315', '115319', '122936', '122937', '122942']
nosy_count = 7.0
nosy_names = ['rhettinger', 'spiv', 'belopolsky', 'pitrou', 'pjenvey', 'stutzbach', 'flox']
pr_nums = []
priority = 'low'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue8685'
versions = ['Python 3.2']

spiv · 2010-05-11T09:04:15Z

set.difference(s), when s is also a set, basically does::

    res = set()
    for elem in self:
        if elem not in other:
            res.add(elem)

This is wasteful when len(self) is much greater than len(other):

$ python -m timeit -s "s = set(range(100000)); sd = s.difference; empty = set()" "sd(empty)"
100 loops, best of 3: 12.8 msec per loop
$ python -m timeit -s "s = set(range(10)); sd = s.difference; empty = set()" "sd(empty)"
1000000 loops, best of 3: 1.18 usec per loop

Here's a patch that compares the lengths of self and other before that loop, and if len(self) is greater, swaps them. The new timeit results are:

$ python -m timeit -s "s = set(range(100000)); sd = s.difference; empty = set()" "sd(empty)"
1000000 loops, best of 3: 0.289 usec per loop
$ python -m timeit -s "s = set(range(10)); sd = s.difference; empty = set()" "sd(empty)"
1000000 loops, best of 3: 0.294 usec per loop

spiv · 2010-05-11T09:24:06Z

Oops, obvious bug in this patch. set('abc') - set('bcd') != set('bcd') - set('abc'). I'll see if I can make a more sensible improvement.

See also <http://bugs.python.org/issue8425\>. Thanks dickinsm on #python-dev.

spiv · 2010-05-11T10:13:57Z

Ok, this time test_set* passes :)

Currently if you have large set and small set the code will do len(large) lookups in the small set. When large is >> than small, it is cheaper to copy large and do len(small) lookups in large. On my laptop a size difference of 4 times is a clear winner for copy+difference_update over the status quo, even for sets of millions of entries. For more similarly sized sets (even only factor of 2 size difference) the cost of allocating a large set that is likely to be shrunk significantly is greater than the benefit. So my patch only switches behaviour for len(x)/4 > len(y).

This patch is complementary to the patch in bpo-8425, I think.

abalkin · 2010-05-11T16:11:01Z

I have two problems with this proposal:

In constrained memory environments, creating a temporary internal copy of a large set may cause the difference operation to fail that would otherwise succeed.
The break-even point between extra lookups and a copy is likely to be different on different systems or even on the same system under different loads.

Programs that suffer from poor large_set.difference(small_set) performance can be rewritten as large_set_copy = large_set.copy(); large_set_copy.difference_updste(small_set) or even simply as large_set.difference_updste(small_set) if program logic allows it.

spiv · 2010-05-12T13:14:55Z

Regarding memory, good question... but this patch turns out to be an improvement there too.

This optimisation only applies when len(x) > len(y) * 4. So the minimum size of the result is a set with 3/4 of the elems of x (and possibly would be a full copy of x anyway).

So if you like this optimisation is simply taking advantage of the fact we're going to be copying almost all of these elements anyway. We could make it less aggressive, but large sets are tuned to be between 1/2 and 1/3 empty internally anyway, so 1/4 overhead seems reasonable.

Also, because this code immediately makes the result set be about the right size, rather than growing it one element at a time, the memory consumption is actually *better*. I'll attach a script that demonstrates this; for me it shows that large_set.difference(small_set) [where large_set has 4M elems, small_set has 100] peaks at 50MB memory consumption without my patch, but only 18MB with. (after discounting the memory required for large_set itself, etc.)

pitrou · 2010-05-12T13:19:13Z

In constrained memory environments, creating a temporary internal
copy of a large set may cause the difference operation to fail that
would otherwise succeed.

It's a space/time tradeoff. There's nothing wrong about that.
(do note that hash tables themselves take much more space than the "equivalent" list)

The break-even point between extra lookups and a copy is likely to
be different on different systems or even on the same system under
different loads.

So what? It's just a matter of choosing reasonable settings. There are other optimization heuristics in the interpreter.

The optimization here looks ok to me.

pitrou · 2010-05-15T20:53:37Z

The current patch gives much smaller benefits than the originally posted benchmarks, although they are still substantial:

$ ./python -m timeit -s "a = set(range(100000)); sd = a.difference; b = set(range(1000))" "sd(b)"
- before: 5.56 msec per loop
- after: 3.18 msec per loop

$ ./python -m timeit -s "a = set(range(1000000)); sd = a.difference; b = set(range(10))" "sd(b)"
- before: 67.9 msec per loop
- after: 41.8 msec per loop

spiv · 2010-05-16T22:16:12Z

Antoine: Thanks for the updated benchmark results! I should have done that myself earlier.

rhettinger · 2010-05-17T15:09:21Z

Will look at this when I get back to the U.S.

spiv · 2010-08-09T11:24:40Z

On 2010-05-17 rhettinger wrote:

Will look at this when I get back to the U.S.

Ping! This patch (set-difference-speedup-2.diff) has been sitting around for a fair few weeks now. It's a small patch, so it should be relatively easy to review. It makes a significant improvement to speed and memory in one case (which we have encountered and worked around in bzr), and has no significant downside to any other cases.

Thanks :)

abalkin · 2010-08-09T14:48:35Z

Andrew,

This issue is somewhat similar to bpo-8425. I may be reading too much into the "priority" field, but it looks like Raymond would like to review bpo-8425 first. You can help by commenting on how the two issues relate to each other. I believe the two are complementary, but I did not attempt to apply both patches.

(The patch still applies with little fuzz.)

rhettinger · 2010-08-09T17:34:43Z

I'll be looking at it shortly. Py3.2 is still aways from release so there is no hurry.

spiv · 2010-08-09T23:40:12Z

Alexander: yes, they are complementary. My patch improves set.difference, which always creates a new set. bpo-8425 on the other hand improves in-place difference (via the -= operator or set.difference_update). Looking at the two patches, my patch will not improve in-place difference, and bpo-8425's patch will not improve set.difference. So they are complementary.

exarkun · 2010-09-01T13:07:52Z

I'll be looking at it shortly. Py3.2 is still aways from release so there is no hurry.

I would consider reviewing and possibly apply this change, but I don't want to invade anyone's territory.

pitrou · 2010-09-01T15:19:14Z

I would consider reviewing and possibly apply this change, but I don't
want to invade anyone's territory.

I don't think there would be any invasion. I think the patch is simple enough, and seems to provide a nice benefit.

rhettinger · 2010-09-01T17:24:06Z

Please leave this for me.
Thank you.

pitrou · 2010-11-30T21:47:00Z

Raymond, unless you object, I'd like to commit this before beta1.

rhettinger · 2010-11-30T21:56:22Z

Thx.

pitrou · 2010-11-30T22:24:54Z

Modified patch committed in r86905. Thanks!

spiv mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 11, 2010

mdickinson assigned rhettinger May 11, 2010

rhettinger assigned pitrou and unassigned rhettinger Nov 30, 2010

pitrou closed this as completed Nov 30, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set(range(100000)).difference(set()) is slow #52931

set(range(100000)).difference(set()) is slow #52931

spiv mannequin commented May 11, 2010

spiv mannequin commented May 11, 2010

spiv mannequin commented May 11, 2010

spiv mannequin commented May 11, 2010

abalkin commented May 11, 2010

spiv mannequin commented May 12, 2010

pitrou commented May 12, 2010

pitrou commented May 15, 2010

spiv mannequin commented May 16, 2010

rhettinger commented May 17, 2010

spiv mannequin commented Aug 9, 2010

abalkin commented Aug 9, 2010

rhettinger commented Aug 9, 2010

spiv mannequin commented Aug 9, 2010

exarkun mannequin commented Sep 1, 2010

pitrou commented Sep 1, 2010

rhettinger commented Sep 1, 2010

pitrou commented Nov 30, 2010

rhettinger commented Nov 30, 2010

pitrou commented Nov 30, 2010

set(range(100000)).difference(set()) is slow #52931

set(range(100000)).difference(set()) is slow #52931

Comments

spiv mannequin commented May 11, 2010

spiv mannequin commented May 11, 2010

spiv mannequin commented May 11, 2010

spiv mannequin commented May 11, 2010

abalkin commented May 11, 2010

spiv mannequin commented May 12, 2010

pitrou commented May 12, 2010

pitrou commented May 15, 2010

spiv mannequin commented May 16, 2010

rhettinger commented May 17, 2010

spiv mannequin commented Aug 9, 2010

abalkin commented Aug 9, 2010

rhettinger commented Aug 9, 2010

spiv mannequin commented Aug 9, 2010

exarkun mannequin commented Sep 1, 2010

pitrou commented Sep 1, 2010

rhettinger commented Sep 1, 2010

pitrou commented Nov 30, 2010

rhettinger commented Nov 30, 2010

pitrou commented Nov 30, 2010