This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author rhettinger
Recipients Dennis Sweeney, eamartin, rhettinger, zach.ware
Date 2021-02-12.03:04:30
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1613099070.93.0.280343346537.issue43198@roundup.psfhosted.org>
In-reply-to
Content
Addenda [reposted to fix a premature submission].

Top use cases for sets (roughly in order of commonality):

1. Eliminate duplicates from an iterable input and loop over the result.
2. Store elements in a set just once but do many membership tests.
3. Perform data analysis on multiple sets using union, intersection, difference, and isdisjoint.
4. Remove elements from a set, one at a time, until it is empty.
5. Algorithms and that alternately add and remove different elements over time (toposort, NFA, DFA, etc).  

The first three are all penalized by an extra inner loop test and the loss of the register to track the freeslot.  Those use cases get no offsetting benefit.

Case 4 doesn't exercise the dummy reuse at all.  So, it is unaffected.

The last is approximately a net wash.  It pays the inner loop price, suffers the loss of a register, and incurs an extra test outside the loop, but it occasionally gets lucky and reuses a freeslot. The benefit for reuse is that it is delays the inevitable resize which would have reclaimed the dummy entries earlier. (The comparable use case for dicts is LRU/LFU caches where new entries are added at the same rate as old entries are removed.  That use case also showed a net wash when freeslot tracking was removed from dicts).

Not on the list at all is the case of a large set, where exactly the same element is added and removed in a tight loop.  The payoff for this case is that the O(n) resize step never occurs; however, this case is so rare that there is no reason to preference it over the common cases.

If the addition and removal of the same element happens only a few times, with no other set updates, the performance is about the same as case 5.

However, if there are many such updates and the set is large (as in the OP's code sample), the add() operation becomes quadratic because the chain of dummy or used entries grows larger and larger with each reinsertion.  (FWIW, dicts also face the same issue and some additional ones related to maintaining insertion order).  The question is whether this unhappy case is worth burdening all of the normal use cases.  If this turns out to be a real issue in practice, we can reopen this; otherwise, let's stick with what we have.
History
Date User Action Args
2021-02-12 03:04:31rhettingersetrecipients: + rhettinger, zach.ware, eamartin, Dennis Sweeney
2021-02-12 03:04:30rhettingersetmessageid: <1613099070.93.0.280343346537.issue43198@roundup.psfhosted.org>
2021-02-12 03:04:30rhettingerlinkissue43198 messages
2021-02-12 03:04:30rhettingercreate