This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: setdefault speedup
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Make dict.setdefault() atomic
View: 13521
Assigned To: loewis Nosy List: dschult, loewis, pitrou, rhettinger
Priority: low Keywords: patch

Created on 2009-04-09 19:35 by dschult, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
dict_setdefault.patch dschult, 2009-04-09 19:36
setdefault.diff rhettinger, 2009-04-09 21:23 _PyDict_SetItemWithHash
Messages (10)
msg85824 - (view) Author: Dan Schult (dschult) Date: 2009-04-09 19:35
In the depths of dictobject.c one can see that dict_setdefault uses two 
identical calls to PyObject_Hash and ma_lookup.  The first to see if the 
item is in the dict, the second (only if key is not present) to add the 
item to the dict.

This second lookup (and hash) are not needed and can be avoided by 
inlining a portion of PyDict_SetItem instead of calling the entire 
subroutine.
msg85826 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-09 21:23
Inlining code saves work but breaks the effort to minimize the number of
functions that have direct access to the underlying data structure.

The performance of setdefault() is hard to improve in real apps because
the cost of instantiating the default object is outside of the method. 
Also, the method often gets used in a loop where the first call adds an
entry and subsequent calls just do a get; so, any efforts to optimize
the second look-up only help on the first call and are completely wasted
on subsequent calls.

All that being said, there's no reason we can't save the double call to
PyObject_Hash().  See attached patch.

Martin, any thoughts?
msg85832 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-04-09 22:27
> Martin, any thoughts?

The same as always: Dan, do you have any benchmarks
that demonstrate the speedup?

Looking at the usage of setdefault in the standard library,
it seems that it's most of the time either an interned string,
or an object hashed by memory address that gets passed as a
key. Hashing these is very cheap, so I would be skeptical
that any real benefits can be demonstrated.

I also agree with Raymond that, by nature, setdefault only
calls hash() twice on the very first call, and not in subsequent
calls, so even if there was an improvement, it shouldn't matter
for the total runtime. I also agree that setdefault is somewhat
inappropriate for performance critical code, since it will create
the default value anyway (often a dict or a list), even if a
value is already stored. So users who care about performance
should write something like

try:
  l = d[k]
except KeyError:
  l = d[k] = []

to avoid the list creation in the common case, or use a
defaultdict in the first place.
msg85839 - (view) Author: Dan Schult (dschult) Date: 2009-04-10 04:54
Benchmarks:
Upon trying cooked up examples, I do not notice any speedup
beyond 5-10%.  Seems like function calling time swamps everything
for small examples with fast hashes.  I don't have a handy dandy
example with long hash times or long lookup times.  That's what
it would take to show a large performance boost with this patch.

I also agree with Martin that there are many reasons not to use
setdefault...   But it is part of the API.
We might as well make it worth using.  (Which probably
means changing the default value to a factory function which gets
called when the key is not found.  But that's a much bigger change...)
I'm just suggesting that the code should not do extra work.

By the way, defaultdict is NOT like setdefault--it is like get().
Missing entries do no get set.  Perhaps there should be a
collections entry that does setdefault instead of get......

Next, the second hash gets executed "only the first time" for EACH key.
So, e.g. if you have a lot of entries that get called up 2 or 3 times,
using the second hash does make a difference1/2 to 1/3 of the time.
And we don't need a second hash or lookup so why do it.

I understand Raymond's concern about more code using the
data structure directly.  There are three basic routines to deal with
the data structure:  ma_lookup/lookdict,  insertdict and resizedict.
The comments for lookdict encourage you to use the "ep" entry
to check if it is empty and add the key/value pair if desired.
But as currently implemented, insertdict calls lookdict, so
they aren't atomistic in that sense.  If atomism is a design goal
(even if it isn't a word :) then insertdict would take "ep" as an input
instead of doing a lookup.  I'm not sure if atomism is part of the  
design
in Python though...

 From my perspective creating an internal SetItem adds another function
handling the data structure just as setdefault would--that's why I  
inlined
in setdefault.  But, it does keep the name similar and thus its  
easier to
identify it as writing to the data structure.  If this style is  
promoted here
then I think there ought to be an internal insertdict that doesn't do  
the
lookup.   Also shouldn't the parent functions call these internal  
versions
instead of duplicating code?   Ack---that means more changes.  Not
difficult ones though...

What do you think?
Dan
msg85854 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-04-11 12:15
> By the way, defaultdict is NOT like setdefault--it is like get().
> Missing entries do no get set.

Why do you say that?

__missing__(...)
    __missing__(key) # Called by __getitem__ for missing key; pseudo-code:
    if self.default_factory is None: raise KeyError((key,))
    self[key] = value = self.default_factory()
    return value

In all cases of setdefault that I know of, replacing this with
a defaultdict would be appropriate. The only case where it wouldn't
work is if the default value depends on the key.

> What do you think?

If no speedup can be demonstrated, we should not change it.
msg85884 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-12 00:58
I would support fixing the double call to PyObject_Hash().  For user
defined classes with their own __hash__ methods, this is by far the
slowest part of the operation.

> from my perspective creating an internal SetItem adds another
> function handling the data structure just as setdefault would

Incorrect comparison.  Your in-lining manipulated the ep structure
directly (not a good thing).  In contrast, adding an alternative
_PyDict_SetItemWithHash uses the insertdict() function, fully isolating
itself of from the table implementation.  The whole purpose of having
insertdict() and lookdict() is to isolate the data structure internals
from all externally visible functions.

>>> setup = '''
class A(object):
	def __hash__(self):
		return 10
class B(object):
   pass
a = A()
b = B()
'''
>>> min(Timer('{}.setdefault(a, 0)', setup).repeat(7, 100000))
0.12840011789208106
>>> min(Timer('{}.setdefault(b, 0)', setup).repeat(7, 100000))
0.053155359130840907

The double call to a very simple user defined __hash__ adds .07 to call
that takes on .05 with the double call to builtin object hash.  So, we
could save half of the the .07 just by elimating the double call to
__hash__.  With more complex hash functions the overall speedup is huge
(essentially cutting the total work almost in half).
msg85933 - (view) Author: Dan Schult (dschult) Date: 2009-04-13 02:33
On Apr 11, 2009, at 8:15 AM, Martin v. Löwis  
<report@bugs.python.org>@psf.upfronthosting.co.za  
@psf.upfronthosting.co.za> wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
>> By the way, defaultdict is NOT like setdefault--it is like get().
>> Missing entries do no get set.
>
> Why do you say that?
>
> __missing__(...)
>     __missing__(key) # Called by __getitem__ for missing key;  
> pseudo-code:
>     if self.default_factory is None: raise KeyError((key,))
>     self[key] = value = self.default_factory()
>     return value
>
> In all cases of setdefault that I know of, replacing this with
> a defaultdict would be appropriate. The only case where it wouldn't
> work is if the default value depends on the key.
The missing key reports the default object, but doesn't set that key  
or value
object in the dict.  So you cannot then call up that object and do  
something
that depends on the key.  So the default value may not depend on the key
but still need to be different for each key due to later changes.

For example, to represent a graph structure, it is helpful to use a  
dict-of-dicts.
The first time a node is looked up you want to add it as a key in the  
graph dict with
an empty "nbr dict" as the value.  the "nbr dict" is keyed by  
neighbor to the edge
weight/object.  So the default object is always a fresh nbr dict...  
but what gets
added to that nbr dict depends on which node it is.  I can come up  
with examples
for lists too...   But as you pointed out before, using a list or  
dict in setdefault
requires creating the object before the call anyway...  I'm beginning to
question whether setdefault should ever be used...  Still comes back  
to--why
have it there inefficiently.   Besides I'm still interested in being
able to do this sort of thing at least through the C-API...

Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:

>> from my perspective creating an internal SetItem adds another
>> function handling the data structure just as setdefault would
> Incorrect comparison.  Your in-lining manipulated the ep structure
> directly (not a good thing).  In contrast, adding an alternative
> _PyDict_SetItemWithHash uses the insertdict() function, fully  
> isolating
> itself of from the table implementation.  The whole purpose of having
> insertdict() and lookdict() is to isolate the data structure internals
> from all externally visible functions.

Good... I am understanding better what you mean by "handling the data  
structure".
I agree that lookdict, insertdict and resizedict should be the three  
that change the
data structure.  But the way those 3 are currently written, you can't  
change an entry
without looking it up.  insertdict_clean almost does it, but it  
assumes there aren't
any dummy entries.

> The double call to a very simple user defined __hash__ adds .07 to  
> call
> that takes on .05 with the double call to builtin object hash.  So, we
> could save half of the the .07 just by elimating the double call to
> __hash__.  With more complex hash functions the overall speedup is  
> huge
> (essentially cutting the total work almost in half).

This is a good example because it shows how the hash can matter.

I think there are similar examples where the lookup is most of the  
time of dict access.
I don't know enough about what causes collisions and how dicts are  
optimized
to quickly come up with such an example.  But such an example must  
exist or
Tim Peters wouldn't have spent so much effort optimizing lookup....
besides his comments at the top of lookup suggest that we do exactly  
what
I am suggesting to do.  Get the entry from lookup and then
"the caller can (if it wishes) add the <key, value> pair to the  
returned PyDictEntry*.

Presently there is no way to do this with C-API except to write my  
own data structure
manipulation.  Wouldn't it be better to encapsulate this in the C-API  
in a standard way
instead of having everybody writing their own C-extension doing  
setdefault the right way?
(ok..ok.. "everybody" here refers to probably one person (me)...  but  
others have thought
of it.  I've even seen the double lookup as a justification for never  
using setdefault)

I'll look for an example that has lots of collisions.  Maybe that  
would help justify a
lookup-less insertdict.
Dan
msg85938 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-04-13 08:33
> The missing key reports the default object, but doesn't set that key  
> or value object in the dict.

That's just not true. Re-read the line in the doc string where it
puts the default value into the dict, under the key being accessed.

> For example, to represent a graph structure, it is helpful to use a  
> dict-of-dicts.

No problem:

py> from collections import defaultdict
py> weights=defaultdict(dict)
py> weights['a']['b']=17
py> weights['a']['c']=5
py> weights['a']
{'c': 5, 'b': 17}

See? It remembered the default value, and the value for the 'a'
key got changed and can now be modified. The whole point of defaultdict
was to replace setdefault, so that setdefault can eventually be
eliminated, see

http://mail.python.org/pipermail/python-dev/2006-February/061169.html
http://mail.python.org/pipermail/python-dev/2006-February/061261.html

Apparently, the plan for removing setdefault is now forgotten. Perhaps
this would be a good time to revive it, with a proper deprecation
procedure. If it is to be removed, I'm -1 for "fixing" anything about
it.
msg85958 - (view) Author: Dan Schult (dschult) Date: 2009-04-14 02:51
Thank you... I stand corrected....
That's actually very helpful!

Of course using defdict makes the default assignment take two hashes,  
two lookups, an attribute/function lookup (__missing__) and the extra  
function call, plus doesn't allow arguments to the object factory...  
but it means that there is a hook into the dict API that allows us to  
create something when GetItem/subscript gives a KeyError...  I  
understand much more what I should be doing and where to look for it.

Thanks very much!

I still would prefer that the C-API allow PyDict_SetItem to be called  
without a hash or lookup.  Maybe a method called PyDict_SetEntry().   
But I haven't really looked at it, don't have a patch and am not sure  
if it is worth it.  I am willing to spend time doing looking into  
that kind of thing if you think that might be helpful.  Otherwise I  
will put it on my list of fun-projects-for-hazy-summer-weekends and  
get to it when I get to it.

Dan
msg160888 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-16 16:49
Out of date, fixed in #13521.
History
Date User Action Args
2022-04-11 14:56:47adminsetgithub: 49980
2012-05-16 16:49:58pitrousetstatus: open -> closed

nosy: + pitrou
messages: + msg160888

superseder: Make dict.setdefault() atomic
resolution: duplicate
2009-04-14 02:51:29dschultsetmessages: + msg85958
2009-04-13 08:33:46loewissetmessages: + msg85938
2009-04-13 02:33:44dschultsetmessages: + msg85933
2009-04-12 00:58:53rhettingersetmessages: + msg85884
2009-04-11 12:15:15loewissetmessages: + msg85854
2009-04-10 04:54:52dschultsetmessages: + msg85839
2009-04-09 22:27:20loewissetmessages: + msg85832
2009-04-09 21:23:17rhettingersetfiles: + setdefault.diff
priority: low

nosy: + loewis
messages: + msg85826

assignee: rhettinger -> loewis
2009-04-09 19:42:18rhettingersetassignee: rhettinger

nosy: + rhettinger
versions: - Python 2.6, Python 3.0
2009-04-09 19:36:34dschultsetfiles: + dict_setdefault.patch
keywords: + patch
2009-04-09 19:35:29dschultcreate