Message 150124 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dalke
Recipients	dalke
Date	2011-12-23.00:54:12
SpamBayes Score	1.6753805e-09
Marked as misclassified	No
Message-id	<1324601653.94.0.248406097539.issue13653@psf.upfronthosting.co.za>
In-reply-to

Content
In Issue3069, Arnaud Delobelle proposed support for multiple values to set.intersection() and set.union(), writing "Intersection is optimized by sorting all sets/frozensets/dicts in increasing order of size and only iterating over elements in the smallest." Raymond Hettinger commented therein that he had just added support for multiple parameters. However, he did not pick up the proposed change in the attached patch which attempts to improve the intersection performance. Consider the attached benchmark, which constructs an inverted index mapping a letter to the set of words which contain that letter. (Rather, to word index.) Here's the output: ## Example output: # a has 144900 words # j has 3035 words # m has 62626 words # amj takes 5.902/1000 (verify: 289) # ajm takes 0.292/1000 (verify: 289) # jma takes 0.132/1000 (verify: 289) Searching set.intersection(inverted_index["j"], inverted_index["m"], inverted_index["a"]) is fully 44 times faster than searching "a", "m", "j"! Of course, the set.intersection() supports any iterable, so would only be an optimization for when all of the inputs are set types. BTW, my own experiments suggest that sorting isn't critical. It's more important to find the most anti-correlated set to the smallest set, and the following does that dynamically by preferentially choosing sets which are likely to not match elements of the smallest set: def set_intersection(input_sets): N = len(input_sets) min_index = min(range(len(input_sets)), key=lambda x: len(input_sets[x])) best_mismatch = (min_index+1)%N new_set = set() for element in input_sets[min_index]: # This failed to match last time; perhaps it's a mismatch this time? if element not in input_sets[best_mismatch]: continue # Scan through the other sets for i in range(best_mismatch+1, best_mismatch+N): j = i % N if j == min_index: continue # If the element isn't in the set then perhaps this # set is a better rejection test for the next input element if element not in input_sets[j]: best_mismatch = j break else: # The element is in all of the other sets new_set.add(element) return new_set Using this in the benchmark gives amj takes 0.972/1000 (verify: 289) ajm takes 0.972/1000 (verify: 289) jma takes 0.892/1000 (verify: 289) which clearly shows that this Python algorithm is still 6 times faster (for the worst case) than the CPython code. However, the simple sort solution: def set_intersection_sorted(input_sets): input_sets = sorted(input_sets, key=len) new_set = set() for element in input_sets[0]: if element in input_sets[1]: if element in input_sets[2]: new_set.add(element) return new_set gives times of amj takes 0.492/1000 (verify: 289) ajm takes 0.492/1000 (verify: 289) jma takes 0.422/1000 (verify: 289) no doubt because there's much less Python overhead than my experimental algorithm.

In Issue3069, Arnaud Delobelle proposed support for multiple values to set.intersection() and set.union(), writing "Intersection is optimized by sorting all sets/frozensets/dicts in increasing order of size and only iterating over elements in the smallest."

Raymond Hettinger commented therein that he had just added support for multiple parameters. However, he did not pick up the proposed change in the attached patch which attempts to improve the intersection performance.

Consider the attached benchmark, which constructs an inverted index mapping a letter to the set of words which contain that letter. (Rather, to word index.) Here's the output:

## Example output:
# a has 144900 words
# j has 3035 words
# m has 62626 words
# amj takes 5.902/1000 (verify: 289)
# ajm takes 0.292/1000 (verify: 289)
# jma takes 0.132/1000 (verify: 289)


Searching set.intersection(inverted_index["j"], inverted_index["m"], inverted_index["a"]) is fully 44 times faster than searching "a", "m", "j"!

Of course, the set.intersection() supports any iterable, so would only be an optimization for when all of the inputs are set types.

BTW, my own experiments suggest that sorting isn't critical. It's more important to find the most anti-correlated set to the smallest set, and the following does that dynamically by preferentially choosing sets which are likely to not match elements of the smallest set:

def set_intersection(*input_sets):
    N = len(input_sets)
    min_index = min(range(len(input_sets)), key=lambda x: len(input_sets[x]))
    best_mismatch = (min_index+1)%N

    new_set = set()
    for element in input_sets[min_index]:
        # This failed to match last time; perhaps it's a mismatch this time?
        if element not in input_sets[best_mismatch]:
            continue

        # Scan through the other sets
        for i in range(best_mismatch+1, best_mismatch+N):
            j = i % N
            if j == min_index:
                continue
            # If the element isn't in the set then perhaps this
            # set is a better rejection test for the next input element
            if element not in input_sets[j]:
                best_mismatch = j
                break
        else:
            # The element is in all of the other sets
            new_set.add(element)
    return new_set


Using this in the benchmark gives

amj takes 0.972/1000 (verify: 289)
ajm takes 0.972/1000 (verify: 289)
jma takes 0.892/1000 (verify: 289)

which clearly shows that this Python algorithm is still 6 times faster (for the worst case) than the CPython code.

However, the simple sort solution:


def set_intersection_sorted(*input_sets):
    input_sets = sorted(input_sets, key=len)
    new_set = set()
    for element in input_sets[0]:
        if element in input_sets[1]:
            if element in input_sets[2]:
                new_set.add(element)
    return new_set

gives times of 

amj takes 0.492/1000 (verify: 289)
ajm takes 0.492/1000 (verify: 289)
jma takes 0.422/1000 (verify: 289)

no doubt because there's much less Python overhead than my experimental algorithm.

History
Date	User	Action	Args
2011-12-23 00:54:14	dalke	set	recipients: + dalke
2011-12-23 00:54:13	dalke	set	messageid: <1324601653.94.0.248406097539.issue13653@psf.upfronthosting.co.za>
2011-12-23 00:54:13	dalke	link	issue13653 messages
2011-12-23 00:54:12	dalke	create