Message 287333 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	methane, rhettinger, serhiy.storchaka, tim.peters
Date	2017-02-08.14:52:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1486565556.36.0.242311298341.issue29476@psf.upfronthosting.co.za>
In-reply-to

Content
> How this change affects this case? I would expect to see a tiny net benefit. The set.add() would be microscopically faster (because of less work finding and going back to a free slot). The set.discard() would do the same work (locating a value and marking it as a dummy). The resize step would run at the same speed (it just skips over dummies and reinserts only active values). The resize would run a little earlier (because we're not reusing a small proportion of the dummy entries) but it would clean out 100% of the dummies, making the table more sparse sooner. > Sets often are used in following pattern Not really. This is one use pattern out of many and is one of the least common. The two dominant use cases for sets are uniquification (deduping) and fast membership testing. The third most common case is data analysis using fast set-to-set operations (union, intersection, and difference). Then comes cases with individual element membership tests followed by an individual element set.add (i.e. the "if x not in seen: {seen.add(x); work_on(x);}" case). Dead last are the affected cases that the bounce around with a mix of set.add() and set.discard() or set.remove(). The comment in dictobject.c suggests that the latter case is uncommon by a factor of hundreds. Hence, I think work should be taken out of the inner loop for the common cases and instead deferred to the high-speed resize step which cleans out 100% of the dummy entries all at once. The common cases all benefit (none of those have dummies, so there is no reason at all to check for dummies in set_add_entry). The uncommon case (the mix of individual adds and discards) is about neutral or slightly positive (as explained above). I've tried to think of a case that would be negatively affected and all I can think of is repeatedly adding and removing exactly the same element or small group of elements. In that case, the freeslot would be reused 100% of the time and the would never need a resize. I've not seen such a case and if I had, I would still care about the common cases more. Also, I like the simplification of set_add_entry() and the separation of concerns (dummy reclamation occurs in exactly one place and that one place eliminates 100% of the dummies in a single pass). FWIW, there is a loose analogy between this idea and the trade-off between reference counting and GC. Reference counting reuses memory quicker than waiting for GC to reclaim memory in one pass later, but it entails encumbering all of the setting and deleting code. GC-only systems make the rest of the code much cleaner and faster, but they have to wait to reclaim memory all at once. Where the analogy fails though is that use of reuse of dummies in sets is by far the least common case, that early freeslot checks only recapture a small fraction of the deletions (only the ones that happen to have exactly the same hash slot), and that early freeslot checks are completely wasted in all of the common cases (which typically have no dummies at all).

> How this change affects this case?

I would expect to see a tiny net benefit.  The set.add() would be microscopically faster (because of less work finding and going back to a free slot).  The set.discard() would do the same work (locating a value and marking it as a dummy).  The resize step would run at the same speed (it just skips over dummies and reinserts only active values).  The resize would run a little earlier (because we're not reusing a small proportion of the dummy entries) but it would clean out 100% of the dummies, making the table more sparse sooner.

> Sets often are used in following pattern

Not really.  This is one use pattern out of many and is one of the least common.  The two dominant use cases for sets are uniquification (deduping) and fast membership testing.  The third most common case is data analysis using fast set-to-set operations (union, intersection, and difference).  Then comes cases with individual element membership tests followed by an individual element set.add (i.e. the "if x not in seen: {seen.add(x); work_on(x);}" case).  Dead last are the affected cases that the bounce around with a mix of set.add() and set.discard() or set.remove().

The comment in dictobject.c suggests that the latter case is uncommon by a factor of hundreds.  Hence, I think work should be taken out of the inner loop for the common cases and instead deferred to the high-speed resize step which cleans out 100% of the dummy entries all at once.

The common cases all benefit (none of those have dummies, so there is no reason at all to check for dummies in set_add_entry).   The uncommon case (the mix of individual adds and discards) is about neutral or slightly positive (as explained above).  

I've tried to think of a case that would be negatively affected and all I can think of is repeatedly adding and removing exactly the *same* element or small group of elements.  In that case, the freeslot would be reused 100% of the time and the would never need a resize.  I've not seen such a case and if I had, I would still care about the common cases more.

Also, I like the simplification of set_add_entry() and the separation of concerns (dummy reclamation occurs in exactly one place and that one place eliminates 100% of the dummies in a single pass).

FWIW, there is a loose analogy between this idea and the trade-off between reference counting and GC.  Reference counting reuses memory quicker than waiting for GC to reclaim memory in one pass later, but it entails encumbering all of the setting and deleting code.  GC-only systems make the rest of the code much cleaner and faster, but they have to wait to reclaim memory all at once.  Where the analogy fails though is that use of reuse of dummies in sets is by far the least common case, that early freeslot checks only recapture a small fraction of the deletions (only the ones that happen to have exactly the same hash slot), and that early freeslot checks are completely wasted in all of the common cases (which typically have no dummies at all).

History
Date	User	Action	Args
2017-02-08 14:52:36	rhettinger	set	recipients: + rhettinger, tim.peters, methane, serhiy.storchaka
2017-02-08 14:52:36	rhettinger	set	messageid: <1486565556.36.0.242311298341.issue29476@psf.upfronthosting.co.za>
2017-02-08 14:52:36	rhettinger	link	issue29476 messages
2017-02-08 14:52:35	rhettinger	create