Title: Simplify set_add_entry()
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.7
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: methane, rhettinger, serhiy.storchaka, tim.peters, vstinner
Priority: low Keywords: patch

Created on 2017-02-08 03:24 by rhettinger, last changed 2018-01-14 18:20 by rhettinger. This issue is now closed.

File name Uploaded Description Edit
set_no_dummy_reuse.diff rhettinger, 2017-02-08 03:24 1st draft -- Simplification of set_add_entry() review
inner_loop_analysis.s rhettinger, 2018-01-13 20:59 Analysis of the code generated by CLang
Pull Requests
URL Status Linked Edit
PR 5175 merged rhettinger, 2018-01-13 20:15
Messages (13)
msg287272 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-08 03:24
Dummy entries created by deletions are currently handled in two places.  The code in set_add_entry() tracks a freeslot to allow dummy slots to be reused when new elements are inserted. The dummies are also eliminated during resizing.

The freeslot checks are made in the inner-loop of set_add_entry() and will only eliminate a few of the dummies (ones that happen to be in a slot that we want to use again).  In contrast, the resize routine eliminates 100% of the dummy entries -- this is where most dummies get reclaimed.

This is a preliminary draft of an idea to only use resizing for dummy reclamation.  It proposes eliminating the freeslot reuse step in set_add_entry().  One advantage is that the patch nicely simplifies the code in set_add_entry and makes its inner-loop tighter (eliminating one memory access out of the two on each iteration).  With the advent of linear probing and the 60% set load factor, set resizing has become very cheap.

One issue with the current design is the while a few dummies get reclaimed earlier, others are left alive longer, causing more collisions during lookups. The reuse of a few dummies defers the resizing step that would actually clear out all of the dummies.

Another thought is that the proposal gives a better separation of concerns.  Insertion focuses on adding entries without making inner loop checks for an uncommon case (the relevant note in dictobject.c says this is uncommon by a factor of hundreds).  The patch leaves the dummy cleanup to the fast, efficient resizing step.

Some areas for exploration:

* We know that the benefit to set_add_entry() will reduce the cost of building a set (which is the common case).  It simply does less work by using less code overall and by having less code in the inner loop (see the disassembly below (1) to see that we're now down to one memory access per iteration).  That said, it would be nice to quantify whether this is a small improvement or a big one (I expect the answer will be different for CLang vs GCC vs MSC, for 32-bit builds versus 64-builds, and for ARM versus x86, etc).

* It would also be nice see if there is an observable small benefit or small detriment to the uncommon case of building a set followed by cycling back and forth between removing elements and adding new elements.

In any case, the simplification of the code in set_add_entry() will be nice and it would be good to have a separation of concerns with only set_resize being responsible for clearing out dummies in a single high-speed pass.

(1) Inner-loop for the new set_add_entry() featuring only one memory access per iteration:

        cmpq    %r9, %rbx       ; entry < start + LINEAR_PROBES
        je  L406                
        addq    $16, %rbx       ; entry++ 
        movq    8(%rbx), %rax   ; load entry->hash from memory
        testq   %rax, %rax      ; entry->hash == 0
        jne L382                ; until zero found skip next test  
        cmpq    $0, (%rbx)      ; entry->key == NULL
        je  L374                ; return entry
        cmpq    %rax, %r15      ; entry->hash == hash
        jne L383                ; loop back
    <... process to equality test >
msg287278 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-02-08 05:12
I like this idea.
msg287290 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-08 09:14
Sets often are used in following pattern:

    def recurse(obj):
        if subobj not in proceeding:
            for subobj in links(obj):

In this case items are added and removed in LIFO order. How this change affects this case?
msg287333 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-08 14:52
> How this change affects this case?

I would expect to see a tiny net benefit.  The set.add() would be microscopically faster (because of less work finding and going back to a free slot).  The set.discard() would do the same work (locating a value and marking it as a dummy).  The resize step would run at the same speed (it just skips over dummies and reinserts only active values).  The resize would run a little earlier (because we're not reusing a small proportion of the dummy entries) but it would clean out 100% of the dummies, making the table more sparse sooner.

> Sets often are used in following pattern

Not really.  This is one use pattern out of many and is one of the least common.  The two dominant use cases for sets are uniquification (deduping) and fast membership testing.  The third most common case is data analysis using fast set-to-set operations (union, intersection, and difference).  Then comes cases with individual element membership tests followed by an individual element set.add (i.e. the "if x not in seen: {seen.add(x); work_on(x);}" case).  Dead last are the affected cases that the bounce around with a mix of set.add() and set.discard() or set.remove().

The comment in dictobject.c suggests that the latter case is uncommon by a factor of hundreds.  Hence, I think work should be taken out of the inner loop for the common cases and instead deferred to the high-speed resize step which cleans out 100% of the dummy entries all at once.

The common cases all benefit (none of those have dummies, so there is no reason at all to check for dummies in set_add_entry).   The uncommon case (the mix of individual adds and discards) is about neutral or slightly positive (as explained above).  

I've tried to think of a case that would be negatively affected and all I can think of is repeatedly adding and removing exactly the *same* element or small group of elements.  In that case, the freeslot would be reused 100% of the time and the would never need a resize.  I've not seen such a case and if I had, I would still care about the common cases more.

Also, I like the simplification of set_add_entry() and the separation of concerns (dummy reclamation occurs in exactly one place and that one place eliminates 100% of the dummies in a single pass).

FWIW, there is a loose analogy between this idea and the trade-off between reference counting and GC.  Reference counting reuses memory quicker than waiting for GC to reclaim memory in one pass later, but it entails encumbering all of the setting and deleting code.  GC-only systems make the rest of the code much cleaner and faster, but they have to wait to reclaim memory all at once.  Where the analogy fails though is that use of reuse of dummies in sets is by far the least common case, that early freeslot checks only recapture a small fraction of the deletions (only the ones that happen to have exactly the same hash slot), and that early freeslot checks are completely wasted in all of the common cases (which typically have no dummies at all).
msg287345 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-08 16:03
For the worst case the drawback is significant:

$ ./python -m perf timeit -s "s = set('a%s' % i for i in range(100))" -- "s.add('test'); s.discard('test')"
Unpatched:  Median +- std dev: 861 ns +- 82 ns
Patched:    Median +- std dev: 2.81 us +- 0.18 us

How large the benefit in the best case? I can't get any significant difference.

$ ./python -m perf timeit -s "a = ['a%s' % i for i in range(1000)]" -- "set(a)"
Unpatched:  Median +- std dev: 130 us +- 6 us
Patched:    Median +- std dev: 127 us +- 8 us
msg287346 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-08 16:04
Serhiy, can you post your recurse() code with some sample data so that we can run some timings and get a sense of the effect on the extreme case.
msg287361 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-08 17:57
I don't have good real-world example. Actually dict is used more often than set in such cases.

I agree with your arguments Raymond, in particular about your ranging of cases. But even if the bad case is hundreds times more rare than the good case, the total effect can be negative if the negative effect of the bad case is hundreds times larger that the positive effect of the good case. I checked timing just to be sure and was unpleasantly surprised that the negative effect is so large.
msg287364 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-02-08 21:06
I think it violates O(1).

"s.add(x); s.discard(x);" loop creates dummy chain.
Average chain length is determined by size of the set.  So it is O(len(s)) rather than amortized O(1).
msg287400 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-09 10:38
I vote -1 on set_no_dummy_reuse.diff since it doesn't show any significant speedup (3 ns on 130 ns with a std dev of 8 ns is not something significant), and makes the set type 2x slower on some corner cases.
msg287434 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-09 16:22
The problem with posting an idea here on the tracker while it is still in the research phase is that it will be prematurely downvoted without have fully thought it through.

What I'm working on now is that opposite question.  Was it ever worthwhile to add this optimization in the first place?   In particular, does it strongly preference a single case over normal cases to no real benefit?

Suppose we found a line code in the calendar module that tested for one special case. Is it March 3, 1978, and if so used a precomputed result, but for all other dates would compute normally.  The right thing to do would be to remove that code, but then one commenter would say, "it makes that one case many times slower" when in fact it brings that case into line with the other normal cases.  And another respondent would say "removing the almost never used special case provides too small of an improvement to the rest of the cases".  The group would then arrive at the collective incorrect conclusion that yes it is a great idea to keep a special case for March 3, 1978.

The test bench I'm working on now is examining whether a repeated add/discard of the same element in an otherwise mostly full table is like the calendar module scenario described above.  In particular, I want to know whether applying the patch makes the new performance for that case about the same as other typical cases of add/discard.

Here are some thoughts using timings.  How you time, what you time, and where you time it matter a great deal when evaluating whether to keep code that is deep in the set insertion logic.  In a complex function that is already having register spills, different compilers may have very different results (i.e. eliminating the freeslot lookup may allow another important variable to use a register rather than spilling and reloading on every iteration).  We can also expect different results on 32-bit vs 64-bit builds (the former having many fewer registers to work with) and on ARM vs x86 (with the latter having greater instruction level parallelism that lets you get away with having blocks of useless code running in parallel with the important code).   

The test dataset matters as well. If set insertion timings are dominated by hashing or equality tests, then all improvements to the probing logic with look insignificant.  Likewise, if the dataset has few hash collisions, then the affected code doesn't get run at all, leading to the false conclusion that simplifying the code doesn't have any benefit.

For a timing that properly isolates and exercises the affected code, we should expect some measurable (and likely non-trivial) benefit.  The GCC-6 x86_64 disassembly gives us reason to expect a favorable result because the new inner loop has a third fewer instructions, two fewer compare/jumps, and half the number of memory accesses (because the freeslot is being reloaded from the stack on every iteration).

Also, there hasn't been any exploration of the potential indirect benefits in cases that I expect to be far more common.  In those cases where we're going to have to resize anyway, is it better to do work to have partial early reclamation of dummies and defer resizing, or is it better to avoid the work for early reclamation so we can resize sooner and eliminate all of the dummies.  I don't know the answer to this question yet, but it is far more important and usual than a contrived case of repeatedly adding and discarding the exact same element but never hitting a resize.

On the other hand, perhaps the early dummy reclamation is worth it.  I don't know the answer yet and was hoping that you all would help me find out.   Downvoting of the basis of two quick and dirty timings on a single machine, single compiler, and single dataset isn't helpful.

I've already put a lot of time in looking at these issues and still don't know the right thing to do.  So, if other commenters have only put a few minutes into thinking about about it and have already drawn a conclusion, then their conclusions are suspect.

IMO, there are still open research questions:  Is early freeslot reclamation worth it or does it have a net negative benefit in all but the most exotic and unlikely cases?  Without the existing optimization, would the exotic case perform about the same as other cases, or is the exotic case catastrophically worse than the others so that it warrants special handing even to the detriment of the other cases?
msg287457 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-09 21:16
I don't downvote, no. I am just unsure. I don't have enough information to say that the net benefit is positive neither that it is negative. In the face of hesitance the status quo wins.

But it looks to me that in dominant cases set_add_entry() is used with a set that doesn't contain dummies. What do you think about specializing set_add_entry() for this case? set_add_entry() could be optimized for most common cases (creating and set-to-set operations), but still not be degrading in worst cases.
msg309897 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-01-13 20:14
What I've decided for now is to keep the freeslot reuse logic but in a simpler form. Currently there are two optimizations for rare cases.  The first is reusing a dummy entry -- we'll keep that.  And the second is detecting multiple dummies in a lookup chain and prefering the first in the chain -- I'm going to get rid of this I haven't found any plausible case where it demonstrates value in excess of its cost.

For it to trigger, the set has to have had a growth phase, then a number of subsequent deletions, and then later additions that have multiple collisions with dummy entries, all without any intervening resizes.  Modern sets have a lower fill factor, making this situation even less likely.  Also, modern sets use a series of linear probes making collisions much cheaper.  The benefit is so small and so rare, it possible that the cost of the inner-loop test now exceeds its potential benefit.  Given that the optimization is of questionable worth, I choose to remove it, preferring the simpler inner-loop code.
msg309932 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-01-14 18:20
New changeset 3329992e31bd0494a7d7312853f7ffd054737e27 by Raymond Hettinger in branch 'master':
bpo-29476: Simplify set_add_entry() (#5175)
Date User Action Args
2018-01-14 18:20:36rhettingersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-01-14 18:20:15rhettingersetmessages: + msg309932
2018-01-13 20:59:26rhettingersetfiles: + inner_loop_analysis.s
2018-01-13 20:15:54rhettingersetstage: patch review
pull_requests: + pull_request5027
2018-01-13 20:14:43rhettingersetmessages: + msg309897
2017-02-09 21:16:50serhiy.storchakasetmessages: + msg287457
2017-02-09 16:22:07rhettingersetpriority: normal -> low

messages: + msg287434
2017-02-09 10:38:14vstinnersetmessages: + msg287400
2017-02-09 10:30:22vstinnersetnosy: + vstinner
2017-02-08 21:06:22methanesetmessages: + msg287364
2017-02-08 17:57:26serhiy.storchakasetmessages: + msg287361
2017-02-08 16:04:57rhettingersetmessages: + msg287346
2017-02-08 16:03:26serhiy.storchakasetmessages: + msg287345
2017-02-08 14:52:36rhettingersetnosy: + tim.peters
messages: + msg287333
2017-02-08 09:14:00serhiy.storchakasetmessages: + msg287290
2017-02-08 05:12:44methanesetnosy: + methane
messages: + msg287278
2017-02-08 03:24:34rhettingercreate