classification
Title: a -= b should be fast if a is a small set and b is a large set
Type: resource usage Stage: needs patch
Components: Interpreter Core Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: abacabadabacaba, belopolsky, eric.smith, ezio.melotti, mark.dickinson, r.david.murray, rhettinger, stutzbach
Priority: low Keywords: easy, patch

Created on 2010-04-16 17:19 by abacabadabacaba, last changed 2010-11-30 23:48 by rhettinger.

Files
File name Uploaded Description Edit
issue8425.diff belopolsky, 2010-05-10 17:44 review
issue8425a.diff belopolsky, 2010-05-10 18:13 review
issue8425-with-tests.diff belopolsky, 2010-05-10 18:27 Patch against py3k revision 81048 review
Messages (9)
msg103343 - (view) Author: Evgeny Kapun (abacabadabacaba) Date: 2010-04-16 17:19
>>> small_set = set(range(2000))
>>> large_set = set(range(20000000))
>>> large_set -= small_set # Fast
>>> small_set -= large_set # Slow, should be fast
>>> small_set = small_set - large_set # Fast
msg103552 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-19 04:20
Raymond, have you forgotten to attach the patch or have you just set the stage to 'patch review' by mistake?
msg105450 - (view) Author: Alexander Belopolsky (belopolsky) (Python committer) Date: 2010-05-10 17:44
The problem is apparently due to the fact that small_set -= large_set
iterates over the large set removing entries from small_set while more
efficient small_set - large_set builds a new set iterating over a
small_set and adding items that are not in the large set.  Since
lookups are O(1), it is more efficient to iterate over a small set
while looking up a large set than the other way around.  I am
attaching a minimalist patch which gives up in-place update for s1 -=
s2 if len(s1) < len(s2) and effectively replaces it with s1 = s1 - s2.
 The code can be further optimized by replicating set_difference logic
in set_isub instead of relying on fall back behavior, but the big
question here  with whether it is acceptable to give up preserving set
identity in in-place subtract to gain performance.
msg105451 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-10 18:07
The answer is almost certainly "no".
msg105452 - (view) Author: Alexander Belopolsky (belopolsky) (Python committer) Date: 2010-05-10 18:13
On the second thought, it is possible to preserve set identity and avoid iteration over a large set.   See issue8425a.diff attached.

It looks like the unit tests don't check identity preservation.  I will submit additional tests shortly.
msg105455 - (view) Author: Alexander Belopolsky (belopolsky) (Python committer) Date: 2010-05-10 18:51
Note that issue8425a.diff leaves small_set.difference_update(large_dict) unoptimized:

$ ./python.exe -m timeit -s "s = {1}; l = dict.fromkeys(range(1000));" "s.difference_update(l)"
1000 loops, best of 3: 842 usec per loop
$ ./python.exe -m timeit -s "s = {1}; l = dict.fromkeys(range(1000));" "s.difference(l)"
100000 loops, best of 3: 13.2 usec per loop

It would be easy to add an extra type check, but I noticed that small_set.difference_update(large_dict.viewkeys()) is not optimized either:

$ ./python.exe -m timeit -s "s = {1}; l = dict.fromkeys(range(1000));" "s.difference(l.viewkeys())"
1000 loops, best of 3: 842 usec per loop

It would be nice if there was a uniform C-API that would allow to uniformly check that an object supports O(1) lookup and invoke such lookup efficiently.  I don't think such C-API exists at the moment.

I would like to hear what others have to say before adding optimizations  to the patch that go beyond the scope of this issue.
msg105492 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-11 09:31
See also issue 8685.
msg122954 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-30 23:24
Deferring to 3.3.
msg122957 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-30 23:48
Any new logic should make maximum use of existing tools:

def __isub__(self, other)
    if len(other) > len(self)*8:
        other = self & other
    . . .
    # rest of isub unchanged
History
Date User Action Args
2010-11-30 23:48:30rhettingersetmessages: + msg122957
stage: patch review -> needs patch
2010-11-30 23:24:48rhettingersetpriority: normal -> low

messages: + msg122954
versions: + Python 3.3, - Python 2.7, Python 3.2
2010-09-01 21:30:18stutzbachsetnosy: + stutzbach
2010-05-11 09:31:01mark.dickinsonsetnosy: + mark.dickinson
messages: + msg105492
2010-05-10 18:51:22belopolskysetmessages: + msg105455
2010-05-10 18:27:55belopolskysetfiles: + issue8425-with-tests.diff
2010-05-10 18:13:06belopolskysetfiles: + issue8425a.diff

messages: + msg105452
2010-05-10 18:07:41r.david.murraysetnosy: + r.david.murray
messages: + msg105451
2010-05-10 17:46:28belopolskysetstage: needs patch -> patch review
2010-05-10 17:44:36belopolskysetfiles: + issue8425.diff
keywords: + patch
messages: + msg105450
2010-05-09 19:15:23belopolskysetnosy: + belopolsky
2010-04-19 05:14:13rhettingersetstage: test needed -> needs patch
2010-04-19 04:20:04ezio.melottisetpriority: normal

nosy: + ezio.melotti
messages: + msg103552

stage: patch review -> test needed
2010-04-18 05:45:09rhettingersetkeywords: + easy
stage: patch review
versions: + Python 2.7, Python 3.2, - Python 3.3
2010-04-16 21:45:39eric.smithsetnosy: + eric.smith
2010-04-16 19:07:45rhettingersetassignee: rhettinger

nosy: + rhettinger
versions: + Python 3.3, - Python 3.1
2010-04-16 17:19:41abacabadabacabacreate