Message 150159 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dalke
Recipients	dalke, rhettinger
Date	2011-12-23.12:58:12
SpamBayes Score	5.4949933e-12
Marked as misclassified	No
Message-id	<1324645094.27.0.39891919407.issue13653@psf.upfronthosting.co.za>
In-reply-to

Content
My belief is that the people who use set.intersection with more than two terms are 1) going to pass in a list of sets, and 2) don't care about the specific order. To check the validity of my belief, I did a Google Code Search to find cases of people using set intersection in Python. I searched for "set\.intersection\(\" and "\.intersection\(.\, lang:^python$", among others. I am sad to report that the most common way to compute set.intersection(list) is by using reduce, like: possible = (set(index[c]) for c in set(otp)) possible = reduce(lambda a, b: a.intersection(b), possible) That comes from: git://github.com/Kami/python-yubico-client.git /yubico/modhex.py and similar uses are in: git://github.com/sburns/PyCap.git /redcap/rc.py http://hltdi-l3.googlecode.com/hg//xdg/languages/morpho/fst.py http://dsniff.googlecode.com/svn/trunk/dsniff/lib/fcap.py As well as in the Rosetta Code example for a simple inverted index, at: http://rosettacode.org/wiki/Inverted_index#Python This was also implemented more verbosely in: http://eats.googlecode.com/svn/trunk/server/eats/views/main.py intersected_set = sets[0] for i in range(1, len(sets)): intersected_set = intersected_set.intersection(sets[i]) and http://iocbio.googlecode.com/svn/trunk/iocbio/microscope/cluster_tools.py s = set (range (len (data[0]))) for d in zip(data): s = s.intersection(set(find_outliers(d, zoffset=zoffset))) return sorted(s) In other words, 7 codebases use manual pairwise reduction rather than use the equivalent code in Python. (I have not checked for which are due to backwards compatibility requirements.) On the other hand, if someone really wants to have a specific intersection order, this shows that it's very easy to write. I found 4 other code bases where set intersection was used for something other than binary intersection, and used the built-in intersection(). git://github.com/valda/wryebash.git/experimental/bait/bait/presenter/impl/filters.py def get_visible_node_ids(self, filterId): if filterId in self.idMask: visibleNodeIdSets = [f.get_visible_node_ids(filterId) for f in self._filters] return set.intersection([v for v in visibleNodeIdSets if v is not None]) return None http://wallproxy.googlecode.com/svn/trunk/local/proxy.py if threads[ct].intersection(threads.itervalues()): raise ValueError('All threads failed') (here, threads' values contain sets) git://github.com/argriffing/xgcode.git/20100623a.py header_sets = [set(x) for x in header_list] header_intersection = set.intersection(header_sets) http://pyvenn.googlecode.com/hg//venn.py to_exclude = set() for ii in xrange(0, len(self.sets)): if (i & (2ii)): sets_to_intersect.append(sets_by_power_of_two[i & (2ii)]) else: to_exclude = to_exclude.union(sets_by_power_of_two[(2ii)]) final = set.intersection(sets_to_intersect) - to_exclude These all find the intersection of sets (not iterators), and the order of evaluation does not appear like it will affect the result. I do not know though if there will be a performance advantage in these cases to reordering. I do know that in my code, and any inverted index, there is an advantage. And I do know that the current CPython implementation has bad worst-case performance.

My belief is that the people who use set.intersection with more than two terms are 1) going to pass in a list of sets, and 2) don't care about the specific order.

To check the validity of my belief, I did a Google Code Search to find cases of people using set intersection in Python. I searched for "set\.intersection\(\*" and "\.intersection\(.*\, lang:^python$", among others.

I am sad to report that the most common way to compute set.intersection(*list) is by using reduce, like:

    possible = (set(index[c]) for c in set(otp))
    possible = reduce(lambda a, b: a.intersection(b), possible)


That comes from:
  git://github.com/Kami/python-yubico-client.git /yubico/modhex.py
and similar uses are in:
  git://github.com/sburns/PyCap.git /redcap/rc.py
  http://hltdi-l3.googlecode.com/hg//xdg/languages/morpho/fst.py
  http://dsniff.googlecode.com/svn/trunk/dsniff/lib/fcap.py


As well as in the Rosetta Code example for a simple inverted index, at:
  http://rosettacode.org/wiki/Inverted_index#Python

This was also implemented more verbosely in:

http://eats.googlecode.com/svn/trunk/server/eats/views/main.py
    intersected_set = sets[0]
    for i in range(1, len(sets)):
        intersected_set = intersected_set.intersection(sets[i])

and 

http://iocbio.googlecode.com/svn/trunk/iocbio/microscope/cluster_tools.py
        s = set (range (len (data[0])))
        for d in zip(*data):
            s = s.intersection(set(find_outliers(d, zoffset=zoffset)))
        return sorted(s)

In other words, 7 codebases use manual pairwise reduction rather than use the equivalent code in Python. (I have not checked for which are due to backwards compatibility requirements.)

On the other hand, if someone really wants to have a specific intersection order, this shows that it's very easy to write.


I found 4 other code bases where set intersection was used for something other than binary intersection, and used the built-in intersection().



git://github.com/valda/wryebash.git/experimental/bait/bait/presenter/impl/filters.py
    def get_visible_node_ids(self, filterId):
        if filterId in self.idMask:
            visibleNodeIdSets = [f.get_visible_node_ids(filterId) for f in self._filters]
            return set.intersection(*[v for v in visibleNodeIdSets if v is not None])
        return None



http://wallproxy.googlecode.com/svn/trunk/local/proxy.py
                if threads[ct].intersection(*threads.itervalues()):
                    raise ValueError('All threads failed')
(here, threads' values contain sets)



git://github.com/argriffing/xgcode.git/20100623a.py
    header_sets = [set(x) for x in header_list]
    header_intersection = set.intersection(*header_sets)




http://pyvenn.googlecode.com/hg//venn.py
            to_exclude = set()
            for ii in xrange(0, len(self.sets)):
                if (i & (2**ii)):
                    sets_to_intersect.append(sets_by_power_of_two[i & (2**ii)])
                else:
                    to_exclude = to_exclude.union(sets_by_power_of_two[(2**ii)])
            final = set.intersection(*sets_to_intersect) - to_exclude



These all find the intersection of sets (not iterators), and the order of evaluation does not appear like it will affect the result.

I do not know though if there will be a performance advantage in these cases to reordering. I do know that in my code, and any inverted index, there is an advantage.

And I do know that the current CPython implementation has bad worst-case performance.

History
Date	User	Action	Args
2011-12-23 12:58:14	dalke	set	recipients: + dalke, rhettinger
2011-12-23 12:58:14	dalke	set	messageid: <1324645094.27.0.39891919407.issue13653@psf.upfronthosting.co.za>
2011-12-23 12:58:13	dalke	link	issue13653 messages
2011-12-23 12:58:12	dalke	create