Message 270906 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	David Su2
Recipients	David Su2, rhettinger
Date	2016-07-21.04:28:36
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1469075317.32.0.367056288254.issue27575@psf.upfronthosting.co.za>
In-reply-to

Content
Here are some tests that I ran: Using current naive viewkeys intersection: $ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)" "d.viewkeys() & {0}" 100 loops, best of 3: 447 msec per loop Nearly half a second per iteration is unacceptably slow. Solution 1 from my previous comment: $ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0); s = set(d)" "s & {0}" 100 loops, best of 3: 0.391 usec per loop This solution caches the keys of the dict in a set, then performs all intersections on that set. The initial creation of the set is slow, and it wastes a lot of memory, but each iteration is much faster. The workaround that I ended up using: $ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)" "{item for item in {0} if item in d}" 100 loops, best of 3: 0.629 usec per loop This solution just looks up each value in the set to see if it is also in the dict using a set comprehension. This avoids wasting time and space on creating a set. For my use case, memory was a bigger issue as my dictionary can approach tens of GBs in size, and almost doubling the memory consumption from creating a cache set was unacceptable. Of course, in the set comprehension, the smaller of the dict/set should be iterated over to satisfy the O(min(len(d), len(s))) condition as specified in https://wiki.python.org/moin/TimeComplexity. A implementation would look something like this: def efficient_intersection(smaller_dict_or_set, bigger_dict_or_set): if len(bigger_dict_or_set) < len(smaller_dict_or_set): bigger_dict_or_set, smaller_dict_or_set = smaller_dict_or_set, bigger_dict_or_set return {item for item in smaller_dict_or_set if item in bigger_dict_or_set} In conclusion, porting over the set intersection logic to viewkeys would be the ideal solution. It avoids wasting memory like in the set cache solution, and I am sure the C implementation will be much faster than my workaround of manually performing an intersection using set comprehensions. My view is that dict.viewkeys() & set(...) should be as performant as the syntax suggests.

Here are some tests that I ran:


Using current naive viewkeys intersection:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)" "d.viewkeys() & {0}"
100 loops, best of 3: 447 msec per loop

Nearly half a second per iteration is unacceptably slow.



Solution 1 from my previous comment:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0); s = set(d)" "s & {0}"
100 loops, best of 3: 0.391 usec per loop

This solution caches the keys of the dict in a set, then performs all intersections on that set. 
The initial creation of the set is slow, and it wastes a lot of memory, but each iteration is much faster.



The workaround that I ended up using:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)" "{item for item in {0} if item in d}"
100 loops, best of 3: 0.629 usec per loop

This solution just looks up each value in the set to see if it is also in the dict using a set comprehension.
This avoids wasting time and space on creating a set. For my use case, memory was a bigger issue as my dictionary can approach tens of GBs in size, and almost doubling the memory consumption from creating a cache set was unacceptable. 

Of course, in the set comprehension, the smaller of the dict/set should be iterated over to satisfy the O(min(len(d), len(s))) condition as specified in https://wiki.python.org/moin/TimeComplexity.
A implementation would look something like this:

def efficient_intersection(smaller_dict_or_set, bigger_dict_or_set):
    if len(bigger_dict_or_set) < len(smaller_dict_or_set):
        bigger_dict_or_set, smaller_dict_or_set = smaller_dict_or_set, bigger_dict_or_set
    
    return {item for item in smaller_dict_or_set if item in bigger_dict_or_set}




In conclusion, porting over the set intersection logic to viewkeys would be the ideal solution. It avoids wasting memory like in the set cache solution, and I am sure the C implementation will be much faster than my workaround of manually performing an intersection using set comprehensions.

My view is that dict.viewkeys() & set(...) should be as performant as the syntax suggests.

History
Date	User	Action	Args
2016-07-21 04:28:37	David Su2	set	recipients: + David Su2, rhettinger
2016-07-21 04:28:37	David Su2	set	messageid: <1469075317.32.0.367056288254.issue27575@psf.upfronthosting.co.za>
2016-07-21 04:28:37	David Su2	link	issue27575 messages
2016-07-21 04:28:36	David Su2	create