classification
Title: reorder set.intersection parameters for better performance
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.4
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: dalke, jcea, rhettinger, terry.reedy
Priority: low Keywords:

Created on 2011-12-23 00:54 by dalke, last changed 2019-08-26 04:50 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
set_intersection_benchmark.py dalke, 2011-12-23 00:54 benchmark showing how the set intersection order affects performance
Messages (5)
msg150124 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2011-12-23 00:54
In Issue3069, Arnaud Delobelle proposed support for multiple values to set.intersection() and set.union(), writing "Intersection is optimized by sorting all sets/frozensets/dicts in increasing order of size and only iterating over elements in the smallest."

Raymond Hettinger commented therein that he had just added support for multiple parameters. However, he did not pick up the proposed change in the attached patch which attempts to improve the intersection performance.

Consider the attached benchmark, which constructs an inverted index mapping a letter to the set of words which contain that letter. (Rather, to word index.) Here's the output:

## Example output:
# a has 144900 words
# j has 3035 words
# m has 62626 words
# amj takes 5.902/1000 (verify: 289)
# ajm takes 0.292/1000 (verify: 289)
# jma takes 0.132/1000 (verify: 289)


Searching set.intersection(inverted_index["j"], inverted_index["m"], inverted_index["a"]) is fully 44 times faster than searching "a", "m", "j"!

Of course, the set.intersection() supports any iterable, so would only be an optimization for when all of the inputs are set types.

BTW, my own experiments suggest that sorting isn't critical. It's more important to find the most anti-correlated set to the smallest set, and the following does that dynamically by preferentially choosing sets which are likely to not match elements of the smallest set:

def set_intersection(*input_sets):
    N = len(input_sets)
    min_index = min(range(len(input_sets)), key=lambda x: len(input_sets[x]))
    best_mismatch = (min_index+1)%N

    new_set = set()
    for element in input_sets[min_index]:
        # This failed to match last time; perhaps it's a mismatch this time?
        if element not in input_sets[best_mismatch]:
            continue

        # Scan through the other sets
        for i in range(best_mismatch+1, best_mismatch+N):
            j = i % N
            if j == min_index:
                continue
            # If the element isn't in the set then perhaps this
            # set is a better rejection test for the next input element
            if element not in input_sets[j]:
                best_mismatch = j
                break
        else:
            # The element is in all of the other sets
            new_set.add(element)
    return new_set


Using this in the benchmark gives

amj takes 0.972/1000 (verify: 289)
ajm takes 0.972/1000 (verify: 289)
jma takes 0.892/1000 (verify: 289)

which clearly shows that this Python algorithm is still 6 times faster (for the worst case) than the CPython code.

However, the simple sort solution:


def set_intersection_sorted(*input_sets):
    input_sets = sorted(input_sets, key=len)
    new_set = set()
    for element in input_sets[0]:
        if element in input_sets[1]:
            if element in input_sets[2]:
                new_set.add(element)
    return new_set

gives times of 

amj takes 0.492/1000 (verify: 289)
ajm takes 0.492/1000 (verify: 289)
jma takes 0.422/1000 (verify: 289)

no doubt because there's much less Python overhead than my experimental algorithm.
msg150135 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-12-23 07:09
Thanks guys.  I'll look at this in detail when I get a chance.  Offhand, it seems like a good idea though it may rarely be of benefit.  The only downsides I see are that it overrides the user's ability to specify the application order and that it would be at odds with proposals to implement ordered sets (including a guaranteed order of application in intersection, union, difference, etc).
msg150159 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2011-12-23 12:58
My belief is that the people who use set.intersection with more than two terms are 1) going to pass in a list of sets, and 2) don't care about the specific order.

To check the validity of my belief, I did a Google Code Search to find cases of people using set intersection in Python. I searched for "set\.intersection\(\*" and "\.intersection\(.*\, lang:^python$", among others.

I am sad to report that the most common way to compute set.intersection(*list) is by using reduce, like:

    possible = (set(index[c]) for c in set(otp))
    possible = reduce(lambda a, b: a.intersection(b), possible)


That comes from:
  git://github.com/Kami/python-yubico-client.git /yubico/modhex.py
and similar uses are in:
  git://github.com/sburns/PyCap.git /redcap/rc.py
  http://hltdi-l3.googlecode.com/hg//xdg/languages/morpho/fst.py
  http://dsniff.googlecode.com/svn/trunk/dsniff/lib/fcap.py


As well as in the Rosetta Code example for a simple inverted index, at:
  http://rosettacode.org/wiki/Inverted_index#Python

This was also implemented more verbosely in:

http://eats.googlecode.com/svn/trunk/server/eats/views/main.py
    intersected_set = sets[0]
    for i in range(1, len(sets)):
        intersected_set = intersected_set.intersection(sets[i])

and 

http://iocbio.googlecode.com/svn/trunk/iocbio/microscope/cluster_tools.py
        s = set (range (len (data[0])))
        for d in zip(*data):
            s = s.intersection(set(find_outliers(d, zoffset=zoffset)))
        return sorted(s)

In other words, 7 codebases use manual pairwise reduction rather than use the equivalent code in Python. (I have not checked for which are due to backwards compatibility requirements.)

On the other hand, if someone really wants to have a specific intersection order, this shows that it's very easy to write.


I found 4 other code bases where set intersection was used for something other than binary intersection, and used the built-in intersection().



git://github.com/valda/wryebash.git/experimental/bait/bait/presenter/impl/filters.py
    def get_visible_node_ids(self, filterId):
        if filterId in self.idMask:
            visibleNodeIdSets = [f.get_visible_node_ids(filterId) for f in self._filters]
            return set.intersection(*[v for v in visibleNodeIdSets if v is not None])
        return None



http://wallproxy.googlecode.com/svn/trunk/local/proxy.py
                if threads[ct].intersection(*threads.itervalues()):
                    raise ValueError('All threads failed')
(here, threads' values contain sets)



git://github.com/argriffing/xgcode.git/20100623a.py
    header_sets = [set(x) for x in header_list]
    header_intersection = set.intersection(*header_sets)




http://pyvenn.googlecode.com/hg//venn.py
            to_exclude = set()
            for ii in xrange(0, len(self.sets)):
                if (i & (2**ii)):
                    sets_to_intersect.append(sets_by_power_of_two[i & (2**ii)])
                else:
                    to_exclude = to_exclude.union(sets_by_power_of_two[(2**ii)])
            final = set.intersection(*sets_to_intersect) - to_exclude



These all find the intersection of sets (not iterators), and the order of evaluation does not appear like it will affect the result.

I do not know though if there will be a performance advantage in these cases to reordering. I do know that in my code, and any inverted index, there is an advantage.

And I do know that the current CPython implementation has bad worst-case performance.
msg150206 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-12-24 03:48
Given that equality is not identify, order does matter, although in 3.2.2 the results are the opposite of what one might expect.

a = set((1,2,3))
b = set((1.0, 3.0, 5.0))
print(a&b, b&a)
print(a.intersection(b), b.intersection(a))
a &= b
print(a)
>>> 
{1.0, 3.0} {1, 3}
{1.0, 3.0} {1, 3}
{1.0, 3.0}

In my view, a &= b should remove the members of a that are not in b, rather than deleting all and replacing some with equal members of b.

That glitch aside, & remains and remains binary for exact control of order. The doc should just say that intersection may re-order the intersection for efficiency.
msg350484 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-26 04:50
The anti-correlation algorithm seems like a plausible heuristic but it can't really know more than the user does about the semantic content of the sets.  Also as Terry pointed out, this will have user visible effects.

This likely should be published as a blog post or recipe.  A user can already control the pairing order and do this or something like it themselves.  It is more of a technique for using sets that it is a core set algorithm (it reminds me of using associativity to optimize chained matrix multiplications, though than can be done precisely rather than heuristically).
History
Date User Action Args
2019-08-26 04:50:12rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg350484

stage: resolved
2012-01-02 22:30:40jceasetnosy: + jcea
2011-12-24 03:48:21terry.reedysetnosy: + terry.reedy
messages: + msg150206
2011-12-23 12:58:13dalkesetmessages: + msg150159
2011-12-23 07:09:01rhettingersetpriority: normal -> low

messages: + msg150135
2011-12-23 00:56:01benjamin.petersonsetassignee: rhettinger

nosy: + rhettinger
2011-12-23 00:54:13dalkecreate