Author dalke
Recipients dalke, rhettinger
Date 2011-12-23.12:58:12
SpamBayes Score 5.49499e-12
Marked as misclassified No
Message-id <1324645094.27.0.39891919407.issue13653@psf.upfronthosting.co.za>
In-reply-to
Content
My belief is that the people who use set.intersection with more than two terms are 1) going to pass in a list of sets, and 2) don't care about the specific order.

To check the validity of my belief, I did a Google Code Search to find cases of people using set intersection in Python. I searched for "set\.intersection\(\*" and "\.intersection\(.*\, lang:^python$", among others.

I am sad to report that the most common way to compute set.intersection(*list) is by using reduce, like:

    possible = (set(index[c]) for c in set(otp))
    possible = reduce(lambda a, b: a.intersection(b), possible)


That comes from:
  git://github.com/Kami/python-yubico-client.git /yubico/modhex.py
and similar uses are in:
  git://github.com/sburns/PyCap.git /redcap/rc.py
  http://hltdi-l3.googlecode.com/hg//xdg/languages/morpho/fst.py
  http://dsniff.googlecode.com/svn/trunk/dsniff/lib/fcap.py


As well as in the Rosetta Code example for a simple inverted index, at:
  http://rosettacode.org/wiki/Inverted_index#Python

This was also implemented more verbosely in:

http://eats.googlecode.com/svn/trunk/server/eats/views/main.py
    intersected_set = sets[0]
    for i in range(1, len(sets)):
        intersected_set = intersected_set.intersection(sets[i])

and 

http://iocbio.googlecode.com/svn/trunk/iocbio/microscope/cluster_tools.py
        s = set (range (len (data[0])))
        for d in zip(*data):
            s = s.intersection(set(find_outliers(d, zoffset=zoffset)))
        return sorted(s)

In other words, 7 codebases use manual pairwise reduction rather than use the equivalent code in Python. (I have not checked for which are due to backwards compatibility requirements.)

On the other hand, if someone really wants to have a specific intersection order, this shows that it's very easy to write.


I found 4 other code bases where set intersection was used for something other than binary intersection, and used the built-in intersection().



git://github.com/valda/wryebash.git/experimental/bait/bait/presenter/impl/filters.py
    def get_visible_node_ids(self, filterId):
        if filterId in self.idMask:
            visibleNodeIdSets = [f.get_visible_node_ids(filterId) for f in self._filters]
            return set.intersection(*[v for v in visibleNodeIdSets if v is not None])
        return None



http://wallproxy.googlecode.com/svn/trunk/local/proxy.py
                if threads[ct].intersection(*threads.itervalues()):
                    raise ValueError('All threads failed')
(here, threads' values contain sets)



git://github.com/argriffing/xgcode.git/20100623a.py
    header_sets = [set(x) for x in header_list]
    header_intersection = set.intersection(*header_sets)




http://pyvenn.googlecode.com/hg//venn.py
            to_exclude = set()
            for ii in xrange(0, len(self.sets)):
                if (i & (2**ii)):
                    sets_to_intersect.append(sets_by_power_of_two[i & (2**ii)])
                else:
                    to_exclude = to_exclude.union(sets_by_power_of_two[(2**ii)])
            final = set.intersection(*sets_to_intersect) - to_exclude



These all find the intersection of sets (not iterators), and the order of evaluation does not appear like it will affect the result.

I do not know though if there will be a performance advantage in these cases to reordering. I do know that in my code, and any inverted index, there is an advantage.

And I do know that the current CPython implementation has bad worst-case performance.
History
Date User Action Args
2011-12-23 12:58:14dalkesetrecipients: + dalke, rhettinger
2011-12-23 12:58:14dalkesetmessageid: <1324645094.27.0.39891919407.issue13653@psf.upfronthosting.co.za>
2011-12-23 12:58:13dalkelinkissue13653 messages
2011-12-23 12:58:12dalkecreate