Issue 34396: Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/78577

classification

Title:	Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty
Type:	performance	Stage:	patch review
Components:	C API	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Marco Sulla, benjamin.peterson, bup, jdemeyer, methane, rhettinger, scoder, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2018-08-13 19:36 by bup, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 14863	closed	jdemeyer, 2019-07-19 15:54

Messages (8)
msg323490 - (view)	Author: Dan Snider (bup) *	Date: 2018-08-13 19:36
The ones I know of are list.__getitem__, dict __contains__ & __getitem__, and (frozen)set.__contains__ but from what I can tell so far it seems like dict.__getitem__ takes the worst hit. For dicts, I've spent a few hours trying to figure out what's getting put into the heap type's mp_subscript slot to slow it down so drastically but I just can't figure it out. So I just forced `dict_subscript` into the heap type's mp_subscript slot to confirm the rather obvious impossibility of it breaking anything. Here's a "little" timeit script to demonstrate how horribly inefficient `dict.__getitem__` is when called on anything other than an extension type: if __name__ == '__main__': import timeit from collections import OrderedDict as FastDictSub from ctypes import * class mappingmethods(Structure): _fields_ = [ ('mp_length', c_void_p), ('mp_subscript', PYFUNCTYPE(py_object,py_object,py_object)), ('mp_ass_subscript', c_void_p)] class _type_object(Structure): _fields_ = [('junk', (c_uint8*56)), # <- brevity ('tp_as_mapping', POINTER(mappingmethods))] py_type = POINTER(_type_object) class SlowDictSub(dict): pass assert SlowDictSub.__base__ is dict and FastDictSub.__base__ is dict assert SlowDictSub.__getitem__ is dict.__getitem__ assert FastDictSub.__getitem__ is dict.__getitem__ mp = dict.fromkeys(range(15)) setup = 'from __main__ import mp, %s as Dict; d=Dict(mp)' print(f'Comparing execution time of heap allocated dict subtype ' f'versus PyODict_Type (who also inherits dict.__getitem__)...\n') slown, slowt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange() print(f'avg. exec {slown/slowt:_.0f} SlowDictSub[x] statements per second') fastn, fastt = timeit.Timer('d[10]', setup%'FastDictSub').autorange() print(f'avg. exec {fastn/fastt:_.0f} OrderedDict[x] statements per second') print() print(f'SlowDictSub was {1/(fastt/slowt):.2f}x slower than OrderedDict... ' f"Let's see what happens when we fix SlowDictSub's 'broken' " "mp_subscript slot:\n") slow_tp = cast(id(SlowDictSub), py_type).contents fast_tp = cast(id(FastDictSub), py_type).contents slow_tp.tp_as_mapping.contents.mp_subscript = ( fast_tp.tp_as_mapping.contents.mp_subscript) postn, postt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange() print(f'avg. exec {postn/postt:_.0f} SlowDictSub[x] after slot change') print(f'which is now {1/(fastt/postt):.2}x the speed of dict_item') Comparing execution time of heap allocated dict subtype versus PyODict_Type (who also inherits dict.__getitem__)... avg. exec 6_220_709 SlowDictSub[x] statements per second avg. exec 21_894_971 OrderedDict[x] statements per second SlowDictSub was 3.52x slower than OrderedDict... Let's see what happens when we fix SlowDictSub's 'broken' mp_subscript slot: avg. exec 21_665_595 SlowDictSub[x] after slot change which is now 1.0x the speed of dict_item
msg323498 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2018-08-14 04:40
Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast: https://github.com/python/cpython/commit/8f5cdaa784f555149adf5e94fd2e989f99d6b1db Reverting the dictobject.c changes of the commit, i.e., the following patch, fixes your example. diff --git a/Objects/dictobject.c b/Objects/dictobject.c index 413557d667..d85834977d 100644 --- a/Objects/dictobject.c +++ b/Objects/dictobject.c @@ -3032,8 +3032,6 @@ dict_sizeof(PyDictObject mp, PyObject Py_UNUSED(ignored)) return PyLong_FromSsize_t(_PyDict_SizeOf(mp)); } -PyDoc_STRVAR(getitem__doc__, "x.__getitem__(y) <==> x[y]"); - PyDoc_STRVAR(sizeof__doc__, "D.__sizeof__() -> size of D in memory, in bytes"); @@ -3071,8 +3069,6 @@ PyDoc_STRVAR(values__doc__, static PyMethodDef mapp_methods[] = { DICT___CONTAINS___METHODDEF - {"__getitem__", (PyCFunction)dict_subscript, METH_O \| METH_COEXIST, - getitem__doc__}, {"__sizeof__", (PyCFunction)dict_sizeof, METH_NOARGS, sizeof__doc__}, DICT_GET_METHODDEF
msg328187 - (view)	Author: Dan Snider (bup) *	Date: 2018-10-21 04:09
Well, I found another mystery. Calling tuple.__getitem__ directly, on an actual tuple instance, is 50-80% slower than calling list.__getitem__ even though in theory, they should be virtually identical with maybe tuple access being infinitesimally be faster. The only difference I found here between the two __getitem__s surprisingly is that tuple doesn't have __getitem__ in its method list, while list does. >>> [].__getitem__ <built-in method __getitem__ of list object at 0x012EE2D8> >>> ().__getitem__ <method-wrapper '__getitem__' of tuple object at 0x00130030> Since list is using the wrapper in tp_methods and tuple isn't, it look like METH_COEXIST is helping but since we know that in both the dict case and now the tuple/list case it is unnecessary, there's clearly a bug somewhere causing multiple arg tuple allocs and or arg parsing. Unfortunately, it takes me several hours to "catch up" from CALL_FUNCTION in ceval.c to something like call_method in typeobject.c and I don't have that kind of time at the moment but hopefully someone notices this again.
msg339449 - (view)	Author: Jeroen Demeyer (jdemeyer) *	Date: 2019-04-04 15:45
> Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast: https://github.com/python/cpython/commit/8f5cdaa784f555149adf5e94fd2e989f99d6b1db But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead. If PEP 580 or PEP 590 is accepted, it should be easy to simply make wrapper descriptors such as somedict.__getitem__ faster, removing the need for METH_COEXIST.
msg339453 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-04-04 16:13
> But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead. Using them in itertools functions. somedict.__getitem__ is faster that lambda: somedict[x]. As for map() and filter(), comprehensions are almost as fast as map() and filter() with somedict.__getitem__ and are more readable.
msg339477 - (view)	Author: Jeroen Demeyer (jdemeyer) *	Date: 2019-04-05 08:52
OK, makes sense. Also super() calls I guess: you can write super().__getitem__(x) but not super()[x] (although the latter could be implemented if we wanted to). I see two ways of fixing this: 1. Make wrapper descriptors faster, removing the need for the METH_COEXIST hack in the first place. 2. Deal better with METH_COEXIST methods when inheriting. Since both of these solutions are closely related to the topic of PEP 580/PEP 590, I suggest to try to fix this after either is implemented.
msg348185 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-07-19 17:00
Some thoughs: * Lib/sets.py benefitted significantly from itertools and METH_COEXIST. * The dunder methods are part of the public API. Serhiy's persona tastes aside, there is no reason not to use them like any other method. * Given a performance trade-off between classes and subclasses, almost universally we give preference to the parent class. * The folks who use itertools with __getitem__ and __contains__ almost always do so because they care about speed and have a preference for a functional style. Removing the current optimization would directly hit those folks and their style of coding, resulting in code that used to be fast and elegant becoming slow.
msg362715 - (view)	Author: Marco Sulla (Marco Sulla) *	Date: 2020-02-26 19:11
I asked why on StackOverflow, and an user seemed to find the reason. The problem for him/her is in `update_one_slot()`. `dict` implements directly `__contains__()` and `__getitem__()`. Usually, `sq_contains` and `mp_subscript` are wrapped to implement `__contains__()` and `__getitem__()`, but this way `dict` is a little faster. The problem is that `update_one_slot()` searches for the wrappers. If it does not find them, it does not inherit the `__contains__()` and `__getitem__()` of the class, but create a `__contains__()` and `__getitem__()` functions that do an MRO search and call the superclass method. This is why `__contains__()` and `__getitem__()` of `dict` subclasses are slower. Is it possible to modify `update_one_slot()` so that, if no wrapper is found, the explicit implementation is inherited? SO answer: https://stackoverflow.com/a/59914459/1763602

History
Date	User	Action	Args
2022-04-11 14:59:04	admin	set	github: 78577
2020-02-28 08:02:38	methane	set	nosy: + methane
2020-02-26 19:11:37	Marco Sulla	set	versions: + Python 3.9, - Python 3.8 nosy: + Marco Sulla messages: + msg362715 components: + C API, - Interpreter Core
2019-07-19 17:00:35	rhettinger	set	messages: + msg348185
2019-07-19 15:54:49	jdemeyer	set	keywords: + patch stage: patch review pull_requests: + pull_request14652
2019-04-05 08:52:58	jdemeyer	set	messages: + msg339477
2019-04-04 16:13:40	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg339453
2019-04-04 15:54:05	scoder	set	nosy: + scoder
2019-04-04 15:45:14	jdemeyer	set	nosy: + jdemeyer messages: + msg339449
2018-10-21 04:32:52	rhettinger	set	nosy: + rhettinger versions: - Python 3.7
2018-10-21 04:09:47	bup	set	messages: + msg328187
2018-08-14 04:40:20	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg323498
2018-08-13 19:36:41	bup	create