Title: Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, bup, jdemeyer, rhettinger, scoder, serhiy.storchaka
Priority: normal Keywords:

Created on 2018-08-13 19:36 by bup, last changed 2019-04-05 08:52 by jdemeyer.

Messages (6)
msg323490 - (view) Author: Dan Snider (bup) * Date: 2018-08-13 19:36
The ones I know of are list.__getitem__, dict __contains__ & __getitem__, and (frozen)set.__contains__ but from what I can tell so far it seems like dict.__getitem__ takes the worst hit. For dicts, I've spent a few hours trying to figure out what's getting put into the heap type's mp_subscript slot to slow it down so drastically but I just can't figure it out. 

So I just forced `dict_subscript` into the heap type's mp_subscript slot to confirm the rather obvious impossibility of it breaking anything. Here's a "little" timeit script to demonstrate how horribly inefficient `dict.__getitem__` is when called on anything other than an extension type:

if __name__ == '__main__':

    import timeit
    from collections import OrderedDict as FastDictSub
    from ctypes import *
    class mappingmethods(Structure):
        _fields_ = [
            ('mp_length', c_void_p),
            ('mp_subscript', PYFUNCTYPE(py_object,py_object,py_object)),
            ('mp_ass_subscript', c_void_p)]
    class _type_object(Structure):
        _fields_ = [('junk', (c_uint8*56)), # <- brevity
                    ('tp_as_mapping', POINTER(mappingmethods))]
    py_type = POINTER(_type_object)
    class SlowDictSub(dict):

    assert SlowDictSub.__base__ is dict and FastDictSub.__base__ is dict
    assert SlowDictSub.__getitem__ is dict.__getitem__
    assert FastDictSub.__getitem__ is dict.__getitem__

    mp = dict.fromkeys(range(15))
    setup = 'from __main__ import mp, %s as Dict; d=Dict(mp)'
    print(f'Comparing execution time of heap allocated dict subtype '
          f'versus PyODict_Type (who also inherits dict.__getitem__)...\n')
    slown, slowt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
    print(f'avg. exec {slown/slowt:_.0f} SlowDictSub[x] statements per second')
    fastn, fastt = timeit.Timer('d[10]', setup%'FastDictSub').autorange()
    print(f'avg. exec {fastn/fastt:_.0f} OrderedDict[x] statements per second')
    print(f'SlowDictSub was {1/(fastt/slowt):.2f}x slower than OrderedDict... '
          f"Let's see what happens when we fix SlowDictSub's 'broken' "
          "mp_subscript slot:\n")
    slow_tp = cast(id(SlowDictSub), py_type).contents
    fast_tp = cast(id(FastDictSub), py_type).contents
    slow_tp.tp_as_mapping.contents.mp_subscript = (
    postn, postt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
    print(f'avg. exec {postn/postt:_.0f} SlowDictSub[x] after slot change')
    print(f'which is now {1/(fastt/postt):.2}x the speed of dict_item')

Comparing execution time of heap allocated dict subtype versus PyODict_Type (who also inherits dict.__getitem__)...

avg. exec 6_220_709 SlowDictSub[x] statements per second
avg. exec 21_894_971 OrderedDict[x] statements per second

SlowDictSub was 3.52x slower than OrderedDict... Let's see what happens when we fix SlowDictSub's 'broken' mp_subscript slot:

avg. exec 21_665_595 SlowDictSub[x] after slot change
which is now 1.0x the speed of dict_item
msg323498 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-08-14 04:40
Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast:

Reverting the dictobject.c changes of the commit, i.e., the following patch, fixes your example.

diff --git a/Objects/dictobject.c b/Objects/dictobject.c
index 413557d667..d85834977d 100644
--- a/Objects/dictobject.c
+++ b/Objects/dictobject.c
@@ -3032,8 +3032,6 @@ dict_sizeof(PyDictObject *mp, PyObject *Py_UNUSED(ignored))
     return PyLong_FromSsize_t(_PyDict_SizeOf(mp));
-PyDoc_STRVAR(getitem__doc__, "x.__getitem__(y) <==> x[y]");
 "D.__sizeof__() -> size of D in memory, in bytes");
@@ -3071,8 +3069,6 @@ PyDoc_STRVAR(values__doc__,
 static PyMethodDef mapp_methods[] = {
-    {"__getitem__", (PyCFunction)dict_subscript,        METH_O | METH_COEXIST,
-     getitem__doc__},
     {"__sizeof__",      (PyCFunction)dict_sizeof,       METH_NOARGS,
msg328187 - (view) Author: Dan Snider (bup) * Date: 2018-10-21 04:09
Well, I found another mystery. Calling tuple.__getitem__ directly, on an actual tuple instance, is 50-80% slower than calling list.__getitem__ even though in theory, they should be virtually identical with maybe tuple access being infinitesimally be faster.

The only difference I found here between the two __getitem__s surprisingly is that tuple *doesn't* have __getitem__ in its method list, while list does.

>>> [].__getitem__
<built-in method __getitem__ of list object at 0x012EE2D8>
>>> ().__getitem__
<method-wrapper '__getitem__' of tuple object at 0x00130030>

Since list is using the wrapper in tp_methods and tuple isn't,  it *look* like METH_COEXIST *is* helping but since we know that in both the dict case and now the tuple/list case it is unnecessary, there's clearly a bug somewhere causing multiple arg tuple allocs and or arg parsing. Unfortunately, it takes me several hours to "catch up" from CALL_FUNCTION in ceval.c to something like call_method in typeobject.c and I don't have that kind of time at the moment but hopefully someone notices this again.
msg339449 - (view) Author: Jeroen Demeyer (jdemeyer) * Date: 2019-04-04 15:45
> Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast:

But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead.

If PEP 580 or PEP 590 is accepted, it should be easy to simply make wrapper descriptors such as somedict.__getitem__ faster, removing the need for METH_COEXIST.
msg339453 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-04-04 16:13
> But what's the use case of making somedict.__getitem__(x) fast? One should just write somedict[x] instead.

Using them in itertools functions. somedict.__getitem__ is faster that lambda: somedict[x].

As for map() and filter(), comprehensions are almost as fast as map() and filter() with somedict.__getitem__ and are more readable.
msg339477 - (view) Author: Jeroen Demeyer (jdemeyer) * Date: 2019-04-05 08:52
OK, makes sense. Also super() calls I guess: you can write super().__getitem__(x) but not super()[x] (although the latter *could* be implemented if we wanted to).

I see two ways of fixing this:

1. Make wrapper descriptors faster, removing the need for the METH_COEXIST hack in the first place.

2. Deal better with METH_COEXIST methods when inheriting.

Since both of these solutions are closely related to the topic of PEP 580/PEP 590, I suggest to try to fix this after either is implemented.
Date User Action Args
2019-04-05 08:52:58jdemeyersetmessages: + msg339477
2019-04-04 16:13:40serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg339453
2019-04-04 15:54:05scodersetnosy: + scoder
2019-04-04 15:45:14jdemeyersetnosy: + jdemeyer
messages: + msg339449
2018-10-21 04:32:52rhettingersetnosy: + rhettinger

versions: - Python 3.7
2018-10-21 04:09:47bupsetmessages: + msg328187
2018-08-14 04:40:20benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg323498
2018-08-13 19:36:41bupcreate