Issue 44921: dict subclassing is slow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/89084

classification

Title:	dict subclassing is slow
Type:	performance	Stage:
Components:	C API	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Kevin Shweh, Marco Sulla, gvanrossum, jack__d, willingc
Priority:	normal	Keywords:

Created on 2021-08-15 20:10 by Marco Sulla, last changed 2022-04-11 14:59 by admin.

Messages (6)
msg399625 - (view)	Author: Marco Sulla (Marco Sulla) *	Date: 2021-08-15 20:10
I asked on SO why subclassing dict makes the subclass much slower in some operations. This is the answer by Monica (https://stackoverflow.com/a/59914459/1763602): Indexing and in are slower in dict subclasses because of a bad interaction between a dict optimization and the logic subclasses use to inherit C slots. This should be fixable, though not from your end. The CPython implementation has two sets of hooks for operator overloads. There are Python-level methods like __contains__ and __getitem__, but there's also a separate set of slots for C function pointers in the memory layout of a type object. Usually, either the Python method will be a wrapper around the C implementation, or the C slot will contain a function that searches for and calls the Python method. It's more efficient for the C slot to implement the operation directly, as the C slot is what Python actually accesses. Mappings written in C implement the C slots sq_contains and mp_subscript to provide in and indexing. Ordinarily, the Python-level __contains__ and __getitem__ methods would be automatically generated as wrappers around the C functions, but the dict class has explicit implementations of __contains__ and __getitem__, because the explicit implementations (https://github.com/python/cpython/blob/v3.8.1/Objects/dictobject.c) are a bit faster than the generated wrappers: static PyMethodDef mapp_methods[] = { DICT___CONTAINS___METHODDEF {"__getitem__", (PyCFunction)(void(*)(void))dict_subscript, METH_O \| METH_COEXIST, getitem__doc__}, ... (Actually, the explicit __getitem__ implementation is the same function as the mp_subscript implementation, just with a different kind of wrapper.) Ordinarily, a subclass would inherit its parent's implementations of C-level hooks like sq_contains and mp_subscript, and the subclass would be just as fast as the superclass. However, the logic in update_one_slot (https://github.com/python/cpython/blob/v3.8.1/Objects/typeobject.c#L7202) looks for the parent implementation by trying to find the generated wrapper methods through an MRO search. dict doesn't have generated wrappers for sq_contains and mp_subscript, because it provides explicit __contains__ and __getitem__ implementations. Instead of inheriting sq_contains and mp_subscript, update_one_slot ends up giving the subclass sq_contains and mp_subscript implementations that perform an MRO search for __contains__ and __getitem__ and call those. This is much less efficient than inheriting the C slots directly. Fixing this will require changes to the update_one_slot implementation. Aside from what I described above, dict_subscript also looks up __missing__ for dict subclasses, so fixing the slot inheritance issue won't make subclasses completely on par with dict itself for lookup speed, but it should get them a lot closer. As for pickling, on the dumps side, the pickle implementation has a dedicated fast path (https://github.com/python/cpython/blob/v3.8.1/Modules/_pickle.c#L4291) for dicts, while the dict subclass takes a more roundabout path through object.__reduce_ex__ and save_reduce. On the loads side, the time difference is mostly just from the extra opcodes and lookups to retrieve and instantiate the __main__.A class, while dicts have a dedicated pickle opcode for making a new dict. If we compare the disassembly for the pickles: In [26]: pickletools.dis(pickle.dumps({0: 0, 1: 1, 2: 2, 3: 3, 4: 4})) 0: \x80 PROTO 4 2: \x95 FRAME 25 11: } EMPTY_DICT 12: \x94 MEMOIZE (as 0) 13: ( MARK 14: K BININT1 0 16: K BININT1 0 18: K BININT1 1 20: K BININT1 1 22: K BININT1 2 24: K BININT1 2 26: K BININT1 3 28: K BININT1 3 30: K BININT1 4 32: K BININT1 4 34: u SETITEMS (MARK at 13) 35: . STOP highest protocol among opcodes = 4 In [27]: pickletools.dis(pickle.dumps(A({0: 0, 1: 1, 2: 2, 3: 3, 4: 4}))) 0: \x80 PROTO 4 2: \x95 FRAME 43 11: \x8c SHORT_BINUNICODE '__main__' 21: \x94 MEMOIZE (as 0) 22: \x8c SHORT_BINUNICODE 'A' 25: \x94 MEMOIZE (as 1) 26: \x93 STACK_GLOBAL 27: \x94 MEMOIZE (as 2) 28: ) EMPTY_TUPLE 29: \x81 NEWOBJ 30: \x94 MEMOIZE (as 3) 31: ( MARK 32: K BININT1 0 34: K BININT1 0 36: K BININT1 1 38: K BININT1 1 40: K BININT1 2 42: K BININT1 2 44: K BININT1 3 46: K BININT1 3 48: K BININT1 4 50: K BININT1 4 52: u SETITEMS (MARK at 31) 53: . STOP highest protocol among opcodes = 4 we see that the difference between the two is that the second pickle needs a whole bunch of opcodes to look up __main__.A and instantiate it, while the first pickle just does EMPTY_DICT to get an empty dict. After that, both pickles push the same keys and values onto the pickle operand stack and run SETITEMS
msg399632 - (view)	Author: Jack DeVries (jack__d) *	Date: 2021-08-16 01:36
There was a thorough discussion about the concerns associated with supporting dict subclasses in general here: bpo-32615 If I understand correctly, allowing dict subclasses to inherit __contains__ and __getitem__ will be a step towards supporting dict subclasses in a more broad context, like being able to use them as namespaces passed to exec. Ultimately, dictionaries are at the very heart of Cpython's implementation. I think it is reasonable to say that it is not possible or reasonable to support users in creating custom dict implantation with their own behavior. There is no good real world use case for it (please tell me if anyone has one), and it opens the door to enormous unnecessary complexity. Also related to dict subclassing: bpo-15099 bpo-27561 Marco, I hope that the discussions I've linked will at least make it clear why cpython behaves this way, and what the concerns are especially around supporting user subclasses of dict. I noticed you tagged this bpo as performance, but there are significant negative performance implications around dict subclasses that you'll see in past discussions. With this additional context in mind, what changes do you propose?
msg399790 - (view)	Author: Marco Sulla (Marco Sulla) *	Date: 2021-08-17 20:51
Since my knowledge of this is very poor, I informed Monica about the issue. I'm quite sure that if there's a way to turn lemons into lemonade :)
msg399791 - (view)	Author: Marco Sulla (Marco Sulla) *	Date: 2021-08-17 20:53
.... I not finished my phrase. I'm sure that if there's a way to turn lemons into lemonade, she is MUCH more skilled than me to find one.
msg399797 - (view)	Author: Kevin Shweh (Kevin Shweh)	Date: 2021-08-17 21:21
Of course it's reasonable to support dict subclasses. We already have a bunch of dict subclasses in the standard library, like collections.defaultdict and collections.Counter, and collections.Counter is significantly slower than it could be because of this issue. (collections.defaultdict seems to be unaffected due to differences between classes implemented in C and Python.) dict.__getitem__ even has dedicated support for a __missing__ hook, which is only useful for subclasses.
msg399922 - (view)	Author: Marco Sulla (Marco Sulla) *	Date: 2021-08-19 16:21
Since probably Monica are taking her holidays, I try to decipher her answer. Probably, the more problematic function spotted by Monica is update_one_slot. I re-quote her sentence: update_one_slot looks for the parent implementation by trying to find the generated wrapper methods through an MRO search. dict doesn't have generated wrappers for sq_contains and mp_subscript, because it provides explicit __contains__ and __getitem__ implementations. Instead of inheriting sq_contains and mp_subscript, update_one_slot ends up giving the subclass sq_contains and mp_subscript implementations that perform an MRO search for __contains__ and __getitem__ and call those. This is much less efficient than inheriting the C slots directly. The solution for Monica is to change the behaviour of update_one_slot for these cases (no wrappers, C slots directly). I don't know the implications of this change...

History
Date	User	Action	Args
2022-04-11 14:59:48	admin	set	github: 89084
2021-08-19 16:21:29	Marco Sulla	set	messages: + msg399922
2021-08-19 04:26:31	gvanrossum	set	nosy: + gvanrossum
2021-08-17 21:21:42	Kevin Shweh	set	nosy: + Kevin Shweh messages: + msg399797
2021-08-17 20:53:16	Marco Sulla	set	messages: + msg399791
2021-08-17 20:51:15	Marco Sulla	set	messages: + msg399790
2021-08-16 17:14:25	willingc	set	nosy: + willingc
2021-08-16 01:36:09	jack__d	set	nosy: + jack__d messages: + msg399632
2021-08-15 20:10:26	Marco Sulla	create