Title: Certain methods that heap allocated subtypes inherit suffer a 50-80% performance penalty
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, bup, rhettinger
Priority: normal Keywords:

Created on 2018-08-13 19:36 by bup, last changed 2018-10-21 04:32 by rhettinger.

Messages (3)
msg323490 - (view) Author: Dan Snider (bup) * Date: 2018-08-13 19:36
The ones I know of are list.__getitem__, dict __contains__ & __getitem__, and (frozen)set.__contains__ but from what I can tell so far it seems like dict.__getitem__ takes the worst hit. For dicts, I've spent a few hours trying to figure out what's getting put into the heap type's mp_subscript slot to slow it down so drastically but I just can't figure it out. 

So I just forced `dict_subscript` into the heap type's mp_subscript slot to confirm the rather obvious impossibility of it breaking anything. Here's a "little" timeit script to demonstrate how horribly inefficient `dict.__getitem__` is when called on anything other than an extension type:

if __name__ == '__main__':

    import timeit
    from collections import OrderedDict as FastDictSub
    from ctypes import *
    class mappingmethods(Structure):
        _fields_ = [
            ('mp_length', c_void_p),
            ('mp_subscript', PYFUNCTYPE(py_object,py_object,py_object)),
            ('mp_ass_subscript', c_void_p)]
    class _type_object(Structure):
        _fields_ = [('junk', (c_uint8*56)), # <- brevity
                    ('tp_as_mapping', POINTER(mappingmethods))]
    py_type = POINTER(_type_object)
    class SlowDictSub(dict):

    assert SlowDictSub.__base__ is dict and FastDictSub.__base__ is dict
    assert SlowDictSub.__getitem__ is dict.__getitem__
    assert FastDictSub.__getitem__ is dict.__getitem__

    mp = dict.fromkeys(range(15))
    setup = 'from __main__ import mp, %s as Dict; d=Dict(mp)'
    print(f'Comparing execution time of heap allocated dict subtype '
          f'versus PyODict_Type (who also inherits dict.__getitem__)...\n')
    slown, slowt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
    print(f'avg. exec {slown/slowt:_.0f} SlowDictSub[x] statements per second')
    fastn, fastt = timeit.Timer('d[10]', setup%'FastDictSub').autorange()
    print(f'avg. exec {fastn/fastt:_.0f} OrderedDict[x] statements per second')
    print(f'SlowDictSub was {1/(fastt/slowt):.2f}x slower than OrderedDict... '
          f"Let's see what happens when we fix SlowDictSub's 'broken' "
          "mp_subscript slot:\n")
    slow_tp = cast(id(SlowDictSub), py_type).contents
    fast_tp = cast(id(FastDictSub), py_type).contents
    slow_tp.tp_as_mapping.contents.mp_subscript = (
    postn, postt = timeit.Timer('d[10]', setup%'SlowDictSub').autorange()
    print(f'avg. exec {postn/postt:_.0f} SlowDictSub[x] after slot change')
    print(f'which is now {1/(fastt/postt):.2}x the speed of dict_item')

Comparing execution time of heap allocated dict subtype versus PyODict_Type (who also inherits dict.__getitem__)...

avg. exec 6_220_709 SlowDictSub[x] statements per second
avg. exec 21_894_971 OrderedDict[x] statements per second

SlowDictSub was 3.52x slower than OrderedDict... Let's see what happens when we fix SlowDictSub's 'broken' mp_subscript slot:

avg. exec 21_665_595 SlowDictSub[x] after slot change
which is now 1.0x the speed of dict_item
msg323498 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-08-14 04:40
Amusingly, this is because of an old hack to make directly calling somedict.__getitem__ fast:

Reverting the dictobject.c changes of the commit, i.e., the following patch, fixes your example.

diff --git a/Objects/dictobject.c b/Objects/dictobject.c
index 413557d667..d85834977d 100644
--- a/Objects/dictobject.c
+++ b/Objects/dictobject.c
@@ -3032,8 +3032,6 @@ dict_sizeof(PyDictObject *mp, PyObject *Py_UNUSED(ignored))
     return PyLong_FromSsize_t(_PyDict_SizeOf(mp));
-PyDoc_STRVAR(getitem__doc__, "x.__getitem__(y) <==> x[y]");
 "D.__sizeof__() -> size of D in memory, in bytes");
@@ -3071,8 +3069,6 @@ PyDoc_STRVAR(values__doc__,
 static PyMethodDef mapp_methods[] = {
-    {"__getitem__", (PyCFunction)dict_subscript,        METH_O | METH_COEXIST,
-     getitem__doc__},
     {"__sizeof__",      (PyCFunction)dict_sizeof,       METH_NOARGS,
msg328187 - (view) Author: Dan Snider (bup) * Date: 2018-10-21 04:09
Well, I found another mystery. Calling tuple.__getitem__ directly, on an actual tuple instance, is 50-80% slower than calling list.__getitem__ even though in theory, they should be virtually identical with maybe tuple access being infinitesimally be faster.

The only difference I found here between the two __getitem__s surprisingly is that tuple *doesn't* have __getitem__ in its method list, while list does.

>>> [].__getitem__
<built-in method __getitem__ of list object at 0x012EE2D8>
>>> ().__getitem__
<method-wrapper '__getitem__' of tuple object at 0x00130030>

Since list is using the wrapper in tp_methods and tuple isn't,  it *look* like METH_COEXIST *is* helping but since we know that in both the dict case and now the tuple/list case it is unnecessary, there's clearly a bug somewhere causing multiple arg tuple allocs and or arg parsing. Unfortunately, it takes me several hours to "catch up" from CALL_FUNCTION in ceval.c to something like call_method in typeobject.c and I don't have that kind of time at the moment but hopefully someone notices this again.
Date User Action Args
2018-10-21 04:32:52rhettingersetnosy: + rhettinger

versions: - Python 3.7
2018-10-21 04:09:47bupsetmessages: + msg328187
2018-08-14 04:40:20benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg323498
2018-08-13 19:36:41bupcreate