Issue 5730: setdefault speedup

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49980

classification

Title:	setdefault speedup
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 3.1, Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Make dict.setdefault() atomic View: 13521
Assigned To:	loewis	Nosy List:	dschult, loewis, pitrou, rhettinger
Priority:	low	Keywords:	patch

Created on 2009-04-09 19:35 by dschult, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
dict_setdefault.patch	dschult, 2009-04-09 19:36
setdefault.diff	rhettinger, 2009-04-09 21:23	_PyDict_SetItemWithHash

Messages (10)
msg85824 - (view)	Author: Dan Schult (dschult)	Date: 2009-04-09 19:35
In the depths of dictobject.c one can see that dict_setdefault uses two identical calls to PyObject_Hash and ma_lookup. The first to see if the item is in the dict, the second (only if key is not present) to add the item to the dict. This second lookup (and hash) are not needed and can be avoided by inlining a portion of PyDict_SetItem instead of calling the entire subroutine.
msg85826 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-04-09 21:23
Inlining code saves work but breaks the effort to minimize the number of functions that have direct access to the underlying data structure. The performance of setdefault() is hard to improve in real apps because the cost of instantiating the default object is outside of the method. Also, the method often gets used in a loop where the first call adds an entry and subsequent calls just do a get; so, any efforts to optimize the second look-up only help on the first call and are completely wasted on subsequent calls. All that being said, there's no reason we can't save the double call to PyObject_Hash(). See attached patch. Martin, any thoughts?
msg85832 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-04-09 22:27
> Martin, any thoughts? The same as always: Dan, do you have any benchmarks that demonstrate the speedup? Looking at the usage of setdefault in the standard library, it seems that it's most of the time either an interned string, or an object hashed by memory address that gets passed as a key. Hashing these is very cheap, so I would be skeptical that any real benefits can be demonstrated. I also agree with Raymond that, by nature, setdefault only calls hash() twice on the very first call, and not in subsequent calls, so even if there was an improvement, it shouldn't matter for the total runtime. I also agree that setdefault is somewhat inappropriate for performance critical code, since it will create the default value anyway (often a dict or a list), even if a value is already stored. So users who care about performance should write something like try: l = d[k] except KeyError: l = d[k] = [] to avoid the list creation in the common case, or use a defaultdict in the first place.
msg85839 - (view)	Author: Dan Schult (dschult)	Date: 2009-04-10 04:54
Benchmarks: Upon trying cooked up examples, I do not notice any speedup beyond 5-10%. Seems like function calling time swamps everything for small examples with fast hashes. I don't have a handy dandy example with long hash times or long lookup times. That's what it would take to show a large performance boost with this patch. I also agree with Martin that there are many reasons not to use setdefault... But it is part of the API. We might as well make it worth using. (Which probably means changing the default value to a factory function which gets called when the key is not found. But that's a much bigger change...) I'm just suggesting that the code should not do extra work. By the way, defaultdict is NOT like setdefault--it is like get(). Missing entries do no get set. Perhaps there should be a collections entry that does setdefault instead of get...... Next, the second hash gets executed "only the first time" for EACH key. So, e.g. if you have a lot of entries that get called up 2 or 3 times, using the second hash does make a difference1/2 to 1/3 of the time. And we don't need a second hash or lookup so why do it. I understand Raymond's concern about more code using the data structure directly. There are three basic routines to deal with the data structure: ma_lookup/lookdict, insertdict and resizedict. The comments for lookdict encourage you to use the "ep" entry to check if it is empty and add the key/value pair if desired. But as currently implemented, insertdict calls lookdict, so they aren't atomistic in that sense. If atomism is a design goal (even if it isn't a word :) then insertdict would take "ep" as an input instead of doing a lookup. I'm not sure if atomism is part of the design in Python though... From my perspective creating an internal SetItem adds another function handling the data structure just as setdefault would--that's why I inlined in setdefault. But, it does keep the name similar and thus its easier to identify it as writing to the data structure. If this style is promoted here then I think there ought to be an internal insertdict that doesn't do the lookup. Also shouldn't the parent functions call these internal versions instead of duplicating code? Ack---that means more changes. Not difficult ones though... What do you think? Dan
msg85854 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-04-11 12:15
> By the way, defaultdict is NOT like setdefault--it is like get(). > Missing entries do no get set. Why do you say that? __missing__(...) __missing__(key) # Called by __getitem__ for missing key; pseudo-code: if self.default_factory is None: raise KeyError((key,)) self[key] = value = self.default_factory() return value In all cases of setdefault that I know of, replacing this with a defaultdict would be appropriate. The only case where it wouldn't work is if the default value depends on the key. > What do you think? If no speedup can be demonstrated, we should not change it.
msg85884 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-04-12 00:58
I would support fixing the double call to PyObject_Hash(). For user defined classes with their own __hash__ methods, this is by far the slowest part of the operation. > from my perspective creating an internal SetItem adds another > function handling the data structure just as setdefault would Incorrect comparison. Your in-lining manipulated the ep structure directly (not a good thing). In contrast, adding an alternative _PyDict_SetItemWithHash uses the insertdict() function, fully isolating itself of from the table implementation. The whole purpose of having insertdict() and lookdict() is to isolate the data structure internals from all externally visible functions. >>> setup = ''' class A(object): def __hash__(self): return 10 class B(object): pass a = A() b = B() ''' >>> min(Timer('{}.setdefault(a, 0)', setup).repeat(7, 100000)) 0.12840011789208106 >>> min(Timer('{}.setdefault(b, 0)', setup).repeat(7, 100000)) 0.053155359130840907 The double call to a very simple user defined __hash__ adds .07 to call that takes on .05 with the double call to builtin object hash. So, we could save half of the the .07 just by elimating the double call to __hash__. With more complex hash functions the overall speedup is huge (essentially cutting the total work almost in half).
msg85933 - (view)	Author: Dan Schult (dschult)	Date: 2009-04-13 02:33
On Apr 11, 2009, at 8:15 AM, Martin v. Löwis <report@bugs.python.org>@psf.upfronthosting.co.za @psf.upfronthosting.co.za> wrote: > Martin v. Löwis <martin@v.loewis.de> added the comment: > >> By the way, defaultdict is NOT like setdefault--it is like get(). >> Missing entries do no get set. > > Why do you say that? > > __missing__(...) > __missing__(key) # Called by __getitem__ for missing key; > pseudo-code: > if self.default_factory is None: raise KeyError((key,)) > self[key] = value = self.default_factory() > return value > > In all cases of setdefault that I know of, replacing this with > a defaultdict would be appropriate. The only case where it wouldn't > work is if the default value depends on the key. The missing key reports the default object, but doesn't set that key or value object in the dict. So you cannot then call up that object and do something that depends on the key. So the default value may not depend on the key but still need to be different for each key due to later changes. For example, to represent a graph structure, it is helpful to use a dict-of-dicts. The first time a node is looked up you want to add it as a key in the graph dict with an empty "nbr dict" as the value. the "nbr dict" is keyed by neighbor to the edge weight/object. So the default object is always a fresh nbr dict... but what gets added to that nbr dict depends on which node it is. I can come up with examples for lists too... But as you pointed out before, using a list or dict in setdefault requires creating the object before the call anyway... I'm beginning to question whether setdefault should ever be used... Still comes back to--why have it there inefficiently. Besides I'm still interested in being able to do this sort of thing at least through the C-API... Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment: >> from my perspective creating an internal SetItem adds another >> function handling the data structure just as setdefault would > Incorrect comparison. Your in-lining manipulated the ep structure > directly (not a good thing). In contrast, adding an alternative > _PyDict_SetItemWithHash uses the insertdict() function, fully > isolating > itself of from the table implementation. The whole purpose of having > insertdict() and lookdict() is to isolate the data structure internals > from all externally visible functions. Good... I am understanding better what you mean by "handling the data structure". I agree that lookdict, insertdict and resizedict should be the three that change the data structure. But the way those 3 are currently written, you can't change an entry without looking it up. insertdict_clean almost does it, but it assumes there aren't any dummy entries. > The double call to a very simple user defined __hash__ adds .07 to > call > that takes on .05 with the double call to builtin object hash. So, we > could save half of the the .07 just by elimating the double call to > __hash__. With more complex hash functions the overall speedup is > huge > (essentially cutting the total work almost in half). This is a good example because it shows how the hash can matter. I think there are similar examples where the lookup is most of the time of dict access. I don't know enough about what causes collisions and how dicts are optimized to quickly come up with such an example. But such an example must exist or Tim Peters wouldn't have spent so much effort optimizing lookup.... besides his comments at the top of lookup suggest that we do exactly what I am suggesting to do. Get the entry from lookup and then "the caller can (if it wishes) add the <key, value> pair to the returned PyDictEntry*. Presently there is no way to do this with C-API except to write my own data structure manipulation. Wouldn't it be better to encapsulate this in the C-API in a standard way instead of having everybody writing their own C-extension doing setdefault the right way? (ok..ok.. "everybody" here refers to probably one person (me)... but others have thought of it. I've even seen the double lookup as a justification for never using setdefault) I'll look for an example that has lots of collisions. Maybe that would help justify a lookup-less insertdict. Dan
msg85938 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-04-13 08:33
> The missing key reports the default object, but doesn't set that key > or value object in the dict. That's just not true. Re-read the line in the doc string where it puts the default value into the dict, under the key being accessed. > For example, to represent a graph structure, it is helpful to use a > dict-of-dicts. No problem: py> from collections import defaultdict py> weights=defaultdict(dict) py> weights['a']['b']=17 py> weights['a']['c']=5 py> weights['a'] {'c': 5, 'b': 17} See? It remembered the default value, and the value for the 'a' key got changed and can now be modified. The whole point of defaultdict was to replace setdefault, so that setdefault can eventually be eliminated, see http://mail.python.org/pipermail/python-dev/2006-February/061169.html http://mail.python.org/pipermail/python-dev/2006-February/061261.html Apparently, the plan for removing setdefault is now forgotten. Perhaps this would be a good time to revive it, with a proper deprecation procedure. If it is to be removed, I'm -1 for "fixing" anything about it.
msg85958 - (view)	Author: Dan Schult (dschult)	Date: 2009-04-14 02:51
Thank you... I stand corrected.... That's actually very helpful! Of course using defdict makes the default assignment take two hashes, two lookups, an attribute/function lookup (__missing__) and the extra function call, plus doesn't allow arguments to the object factory... but it means that there is a hook into the dict API that allows us to create something when GetItem/subscript gives a KeyError... I understand much more what I should be doing and where to look for it. Thanks very much! I still would prefer that the C-API allow PyDict_SetItem to be called without a hash or lookup. Maybe a method called PyDict_SetEntry(). But I haven't really looked at it, don't have a patch and am not sure if it is worth it. I am willing to spend time doing looking into that kind of thing if you think that might be helpful. Otherwise I will put it on my list of fun-projects-for-hazy-summer-weekends and get to it when I get to it. Dan
msg160888 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-16 16:49
Out of date, fixed in #13521.

History
Date	User	Action	Args
2022-04-11 14:56:47	admin	set	github: 49980
2012-05-16 16:49:58	pitrou	set	status: open -> closed nosy: + pitrou messages: + msg160888 superseder: Make dict.setdefault() atomic resolution: duplicate
2009-04-14 02:51:29	dschult	set	messages: + msg85958
2009-04-13 08:33:46	loewis	set	messages: + msg85938
2009-04-13 02:33:44	dschult	set	messages: + msg85933
2009-04-12 00:58:53	rhettinger	set	messages: + msg85884
2009-04-11 12:15:15	loewis	set	messages: + msg85854
2009-04-10 04:54:52	dschult	set	messages: + msg85839
2009-04-09 22:27:20	loewis	set	messages: + msg85832
2009-04-09 21:23:17	rhettinger	set	files: + setdefault.diff priority: low nosy: + loewis messages: + msg85826 assignee: rhettinger -> loewis
2009-04-09 19:42:18	rhettinger	set	assignee: rhettinger nosy: + rhettinger versions: - Python 2.6, Python 3.0
2009-04-09 19:36:34	dschult	set	files: + dict_setdefault.patch keywords: + patch
2009-04-09 19:35:29	dschult	create