classification
Title: Faster set copying
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: haypo, pitrou, python-dev, rhettinger, serhiy.storchaka
Priority: low Keywords: patch

Created on 2015-01-21 13:50 by serhiy.storchaka, last changed 2015-05-13 08:46 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
set_faster_copy.patch serhiy.storchaka, 2015-01-21 13:51 review
set_faster_copy_2.patch serhiy.storchaka, 2015-01-21 14:31 review
set_faster_copy_3.patch serhiy.storchaka, 2015-01-27 08:06 review
set_faster_copy_4.patch serhiy.storchaka, 2015-03-27 07:37 review
set_faster_copy_5.patch serhiy.storchaka, 2015-05-12 19:57 review
set_faster_copy_6.diff rhettinger, 2015-05-12 22:59 review
Messages (20)
msg234437 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-21 13:50
Proposed patch makes faster creating new set from other set. The benefit is only few percents when there are no hash conflicts, but can be significant if there are hash duplicates.

$ ./python -m timeit -s "s = set(range(10**4))" -- "frozenset(s)"
Unpatched: 1000 loops, best of 3: 658 usec per loop
Patched: 1000 loops, best of 3: 620 usec per loop

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**3) for j in range(10)}" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 6.72 msec per loop
Patched: 100 loops, best of 3: 2.05 msec per loop

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**2) for j in range(10**2)}" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 14 msec per loop
Patched: 100 loops, best of 3: 2.67 msec per loop

It should also affect set.copy and operations which makes implicit copy (most set operators). The effect should be larger for objects with slow equality operator.

set_find_free_slot() can be used to prevent a catastrophic linear pile-up in issue23259.
msg234438 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-21 14:31
Here is even faster patch. When there are no dummies in source set we can just dump a table, placing entries at the same indices.

$ ./python -m timeit -s "s = set(range(10**4))" -- "frozenset(s)"
Unpatched: 1000 loops, best of 3: 658 usec per loop
Patched: 1000 loops, best of 3: 631 usec per loop

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**3) for j in range(10)}" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 6.72 msec per loop
Patched: 1000 loops, best of 3: 930 usec per loop

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**2) for j in range(10**2)}" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 14 msec per loop
Patched: 1000 loops, best of 3: 1.12 msec per loop

To test other branch we should add dummy entry: s.add(-1); s.discard(-1).

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**3) for j in range(10)}; s.add(-1); s.discard(-1)" -- "frozenset(s)"
Unpatched: 1000 loops, best of 3: 661 usec per loop
Patched: 1000 loops, best of 3: 643 usec per loop

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**3) for j in range(10)}; s.add(-1); s.discard(-1)" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 6.8 msec per loop
Patched: 100 loops, best of 3: 2.1 msec per loop

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**2) for j in range(10**2)}; s.add(-1); s.discard(-1)" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 14 msec per loop
Patched: 100 loops, best of 3: 2.71 msec per loop
msg234680 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-01-25 18:38
For the time being (while I working on other parts of sets), I'm holding-off an anything that adds more lookkey variants.   When the neatening-up and tweaking of search routines comes to a close, I'll revisit this one because it looks like there is a some low hanging fruit here.
msg234811 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-27 08:06
Actually set_find_free_slot() is not needed because there is set_insert_clean(). Results with using set_insert_clean():

$ ./python -m timeit -s "s = set(range(10**4))" -- "frozenset(s)"
Unpatched: 1000 loops, best of 3: 700 usec per loop
Patched:   1000 loops, best of 3: 570 usec per loop
Speed up: 23%

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**4//2) for j in range(2)}" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 2.2 msec per loop
Patched:   1000 loops, best of 3: 765 usec per loop
Speed up: 188%

$ ./python -m timeit -s "s = set(range(10**4)); s.add(-1); s.discard(-1)" -- "frozenset(s)"
Unpatched: 1000 loops, best of 3: 700 usec per loop
Patched:   1000 loops, best of 3: 605 usec per loop
Speed up: 16%

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**4//2) for j in range(2)}; s.add(-1); s.discard(-1)" -- "frozenset(s)"
Unpatched: 100 loops, best of 3: 2.2 msec per loop
Patched:   1000 loops, best of 3: 1.06 msec per loop
Speed up: 108%

Note that unpatched version is 6% slower now than before, perhaps due to issue23119.
msg236943 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-01 08:33
Ping. The last patch doesn't add more lookkey variants, it uses existing function.
msg236947 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-03-01 08:47
I still intend to apply some variant of your patch.  Am still looking at the options.
msg239367 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2015-03-27 00:15
Is msg234811 the result of set_faster_copy_3.patch? If yes, I see no reason to not apply the patch: the patch is short and looks always fast, and sometimes *much* faster.

I just leaved a small comment on the review.
msg239376 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-27 07:37
Updated patch addresses Victor's comment. Thank you for your review Victor.

> Is msg234811 the result of set_faster_copy_3.patch?

Yes, it is. There are tests for sets with small and large number of hash conflicts, with and without dummy objects. The patch affects many operations that implicitly makes a copy, for example set1 | set2, set1 & set2, set1 - set2.
msg239378 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2015-03-27 07:48
New patch looks good to me. Great job on speedup. I didn't expect that it's
still possible to enhance set.

By the way, is it possible to report latest enhancement to the dict type? I
don't remember if tou tried that richard?
msg239381 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-27 08:41
> I didn't expect that it's still possible to enhance set.

Raymond made a lot of set optimizations recently.

> By the way, is it possible to report latest enhancement to the dict type? I
> don't remember if tou tried that richard?

May be when Raymond port his optimizations to dict (issue18898).
msg242955 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-12 10:30
Could you please pay a little attention to this issue Raymond?
msg242983 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-12 18:02
I don't like the patch as-is (comment change is wrong, test in the inner-loop making the common case pay for a short-cut for the uncommon case).  As I said before, the basic idea is good and I will very likely include some variant of this for Python3.5.

I marked this as low priority so I can work on other things (such as deque slicing) before the feature freeze and will come back to this after beta 1.  Please remain calm.  My development time is limited but this is something I do want to get done.
msg242985 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-12 18:37
Doesn't the feature freeze start from beta 1?
msg242988 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-12 18:55
Optimizations aren't new features.  We can still tweak the implementation through-out the betas.
msg242993 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-12 19:13
I looked at the patch again and it is in pretty good shape.

Please hoist the conditionals out of the loop (for intelligibility and to let the compiler in-line more effectively).  Also, let's remove the "dump" and "clear" variable names in favor of comments that explain the conditionals (to introducing new terminology to the module). 

If you want, I'll take a crack at it in the next couple of days.
msg242995 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-12 19:18
Le 12/05/2015 20:55, Raymond Hettinger a écrit :
> 
> Optimizations aren't new features.  We can still tweak the implementation through-out the betas.

Not really. The period after the first beta is for fixing bugs, not for
risking introducing new bugs.
msg243001 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-12 19:57
Here is the patch with hoisted the conditionals out of the loop.

New microbenchmarking results:

$ ./python -m timeit -s "s = set(range(10**4))" -- "frozenset(s)"
Unpatched:     1000 loops, best of 3: 407 usec per loop
With patch #4: 1000 loops, best of 3: 325 usec per loop  (speed up 25%)
With patch #5: 1000 loops, best of 3: 272 usec per loop  (speed up 50%)

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**4//2) for j in range(2)}" -- "frozenset(s)"
Unpatched:     1000 loops, best of 3: 995 usec per loop
With patch #4: 1000 loops, best of 3: 447 usec per loop  (speed up 123%)
With patch #5: 1000 loops, best of 3: 417 usec per loop  (speed up 139%)

$ ./python -m timeit -s "s = set(range(10**4)); s.add(-1); s.discard(-1)" -- "frozenset(s)"
Unpatched:     1000 loops, best of 3: 411 usec per loop
With patch #4: 1000 loops, best of 3: 355 usec per loop  (speed up 16%)
With patch #5: 1000 loops, best of 3: 406 usec per loop  (equal)

$ ./python -m timeit -s "s = {i+(j<<64) for i in range(10**4//2) for j in range(2)}; s.add(-1); s.discard(-1)" -- "frozenset(s)"
Unpatched:     1000 loops, best of 3: 1.01 msec per loop
With patch #4: 1000 loops, best of 3: 572 usec per loop  (speed up 77%)
With patch #5: 1000 loops, best of 3: 609 usec per loop  (speed up 66%)
msg243028 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-12 22:55
Attaching a variant with several fix-ups (mostly minor):

* Changed the order of the three sections to go from most-restricted-most-optimized to the general-fall-through case.  The downside is that we test so->fill==0 twice.  The upside is that it corresponds to my way of thinking about the logic.

* Put the fill/used increments in the same order as the rest of the file.

* Loop over other_entry++ instead of using indexing.  This corresponds to my way of thinking about the entries and gives the compiler a stronger hint that it can avoid the indexing overhead.

* Removed the unnecessary dummy check from the direct_pointer_copy case.

* Switch the order of the same size and no dummies tests in the insert_clean case.

* Revise the comments to be clearer about the requirements for each case.

* Move the sometimes unnecessary hash variable assignments inside the conditional.
msg243049 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-13 07:07
Except one error (the declaration after the code), the patch LGTM. It is shame to me that I left so much non-optimized code in specialized loops.

The result of my microbenchmarks is the speed up 51%, 153%, 23% and 96% respectively.
msg243056 - (view) Author: Roundup Robot (python-dev) Date: 2015-05-13 08:26
New changeset 79c884cc9625 by Raymond Hettinger in branch 'default':
Issue #23290:  Optimize set_merge() for cases where the target is empty.
https://hg.python.org/cpython/rev/79c884cc9625
History
Date User Action Args
2015-05-13 08:46:16rhettingersetstatus: open -> closed
resolution: later -> fixed
stage: patch review -> resolved
2015-05-13 08:26:22python-devsetnosy: + python-dev
messages: + msg243056
2015-05-13 07:07:07serhiy.storchakasetmessages: + msg243049
2015-05-12 22:59:24rhettingersetfiles: + set_faster_copy_6.diff
2015-05-12 22:58:15rhettingersetfiles: - set_faster_copy_6.diff
2015-05-12 22:55:52rhettingersetfiles: + set_faster_copy_6.diff

messages: + msg243028
2015-05-12 19:57:53serhiy.storchakasetfiles: + set_faster_copy_5.patch

messages: + msg243001
2015-05-12 19:18:15pitrousetmessages: + msg242995
2015-05-12 19:13:53rhettingersetmessages: + msg242993
2015-05-12 18:55:02rhettingersetmessages: + msg242988
2015-05-12 18:37:04serhiy.storchakasetmessages: + msg242985
2015-05-12 18:02:07rhettingersetmessages: + msg242983
2015-05-12 10:30:46serhiy.storchakasetmessages: + msg242955
2015-03-27 08:41:20serhiy.storchakasetmessages: + msg239381
2015-03-27 07:48:47hayposetmessages: + msg239378
2015-03-27 07:37:46serhiy.storchakasetfiles: + set_faster_copy_4.patch

messages: + msg239376
2015-03-27 00:15:37hayposetnosy: + haypo
messages: + msg239367
2015-03-01 08:47:47rhettingersetpriority: normal -> low

messages: + msg236947
2015-03-01 08:33:44serhiy.storchakasetmessages: + msg236943
2015-01-27 08:06:16serhiy.storchakasetfiles: + set_faster_copy_3.patch

messages: + msg234811
2015-01-25 18:38:19rhettingersetresolution: later
messages: + msg234680
2015-01-23 01:41:41rhettingersetassignee: rhettinger
2015-01-21 14:31:09serhiy.storchakasetfiles: + set_faster_copy_2.patch

messages: + msg234438
2015-01-21 13:51:16serhiy.storchakasetfiles: + set_faster_copy.patch
nosy: + rhettinger, pitrou
keywords: + patch
2015-01-21 13:50:08serhiy.storchakacreate