Message 155948 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	anacrolix, eric.araujo, gvanrossum, petri.lehtinen, rhettinger
Date	2012-03-15.20:05:27
SpamBayes Score	3.330669e-16
Marked as misclassified	No
Message-id	<1331841928.94.0.642956058392.issue14320@psf.upfronthosting.co.za>
In-reply-to

Content
Here's the results of the research so far: Guido gave it a -1: http://mail.python.org/pipermail/python-ideas/2012-March/014510.html Smalltalk does not return a boolean: http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#Set http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#HashedCollection add: newObject Add newObject to the set, if and only if the set doesn't already contain an occurrence of it. Don't fail if a duplicate is found. Answer anObject Java does return a boolean for Interface Set<E>>: http://docs.oracle.com/javase/6/docs/api/ boolean add(E e) Adds the specified element to this set if it is not already present (optional operation) The SETL programming language does not return a boolean for the "set plus one element operation". Instead, it returns the new set. http://setl.org/setl/doc/setl-lib.html#with Real-world examples of code using set.add in a way that benefits from the boolean result do not look promising. They match the pattern in the minimal example shown in the previous post, "if e in previously_seen.add(e): ..." This is problematic because it can easily be read as "take this branch if e has been previously seen". I consulted with the PyPy developers and found that PyPy is already eliminating double lookups in common cases (it recognizes that the two lookups refer to the same entry and constant folds them). It cannot do this in the general case because of a potential semantic change (i.e. __eq__ can change the set during the initial lookup). Looking at Jython, it appears that both sets and dicts are based on Java's concurrent mapping object which doesn't support a boolean result directly but allows the JIT to optimize away the double lookup depending on the JIT and on whether the keys are of all the same type. One other note, in general the Python language makes no guarantees about atomicity (different implementations are free to make different implementation choices). For example, people relying on dict.setdefault() to be atomic were making an incorrect assumption (until recently). Our standing recommendation is to use locks unless your willing to rely on an implementation detail. In the case of set.add(), the race condition is harmless since set.add() is idempotent. I've shown sample code to some other developers and they had mixed guesses about the polarity of s.add(e), whether it would return True or False if the e was already in the set. A mixed result is not good and implies that the proposal is error prone. Also, when shown the dedup() code example, the devs didn't feel that the additional conciseness was worth it (one said, "it just looks funny and causes me to do a double-take, but the original code could be read effortlessly"). My timings on the dedup() code showed a <2% speedup on an iterable of strings with 80% duplicates and no improvement on an iterable of strings with 5% duplicates. The improvement was mostly erased if a "seen_add = seen.add" bound method was used inside the loop. This indicates that the double lookup is cheap compared to the cost of the dotted lookup in "seen.add(e)". That being said, objects with a custom __hash__ would benefit from calling __hash__ only once. Given that the results above were uniformly negative (except for the Java library), I'll skip trying this out on my students and am rejecting the suggested change to the pure Python API. We appreciate your suggestion but it isn't appropriate for inclusion in the language. Some of those concerns may also apply to the suggested change to the C API

Here's the results of the research so far:

Guido gave it a -1:   http://mail.python.org/pipermail/python-ideas/2012-March/014510.html

Smalltalk does not return a boolean:    http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#Set
http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#HashedCollection
    add: newObject
    Add newObject to the set, if and only if the set doesn't already contain an occurrence of it. Don't fail if a duplicate is found. Answer anObject

Java does return a boolean for Interface Set<E>>: http://docs.oracle.com/javase/6/docs/api/
    boolean	add(E e)  Adds the specified element to this set if it is not already present (optional operation)

The SETL programming language does not return a boolean for the "set plus one element operation".  Instead, it returns the new set.
http://setl.org/setl/doc/setl-lib.html#with

Real-world examples of code using set.add in a way that benefits from the boolean result do not look promising.  They match the pattern in the minimal example shown in the previous post, "if e in previously_seen.add(e): ..."   This is problematic because it can easily be read as "take this branch if e has been previously seen".

I consulted with the PyPy developers and found that PyPy is already eliminating double lookups in common cases (it recognizes that the two lookups refer to the same entry and constant folds them).  It cannot do this in the general case because of a potential semantic change (i.e. __eq__ can change the set during the initial lookup).

Looking at Jython, it appears that both sets and dicts are based on Java's concurrent mapping object which doesn't support a boolean result directly but allows the JIT to optimize away the double lookup depending on the JIT and on whether the keys are of all the same type.

One other note, in general the Python language makes no guarantees about atomicity (different implementations are free to make different implementation choices).  For example, people relying on dict.setdefault() to be atomic were making an incorrect assumption (until recently).  Our standing recommendation is to use locks unless your willing to rely on an implementation detail.  In the case of set.add(), the race condition is harmless since set.add() is idempotent.

I've shown sample code to some other developers and they had mixed guesses about the polarity of s.add(e), whether it would return True or False if the e was already in the set.  A mixed result is not good and implies that the proposal is error prone.  

Also, when shown the dedup() code example, the devs didn't feel that the additional conciseness was worth it (one said, "it just looks funny and causes me to do a double-take, but the original code could be read effortlessly").

My timings on the dedup() code showed a <2% speedup on an iterable of strings with 80% duplicates and no improvement on an iterable of strings with 5% duplicates.  The improvement was mostly erased if a "seen_add = seen.add" bound method was used inside the loop.  This indicates that the double lookup is cheap compared to the cost of the dotted lookup in "seen.add(e)".  That being said, objects with a custom __hash__ would benefit from calling __hash__ only once.

Given that the results above were uniformly negative (except for the Java library), I'll skip trying this out on my students and am rejecting the suggested change to the pure Python API.  We appreciate your suggestion but it isn't appropriate for inclusion in the language.

Some of those concerns may also apply to the suggested change to the C API

History
Date	User	Action	Args
2012-03-15 20:05:29	rhettinger	set	recipients: + rhettinger, gvanrossum, eric.araujo, anacrolix, petri.lehtinen
2012-03-15 20:05:28	rhettinger	set	messageid: <1331841928.94.0.642956058392.issue14320@psf.upfronthosting.co.za>
2012-03-15 20:05:28	rhettinger	link	issue14320 messages
2012-03-15 20:05:27	rhettinger	create