Issue 31336: Speed up _PyType_Lookup() for class creation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/75517

classification

Title:	Speed up _PyType_Lookup() for class creation
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	vstinner	Nosy List:	christian.heimes, methane, pitrou, rhettinger, scoder, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2017-09-04 11:39 by scoder, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 3279	merged	scoder, 2017-09-04 11:39

Messages (31)
msg301213 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-04 11:39
The method lookup fast path in _PyType_Lookup() does not apply during type creation, which is highly dominated by the performance of the dict lookups along the mro chain. Pre-calculating the name hash speeds up the creation of an empty class (i.e. "class Test: pass") by around 20%. Will send a pull request shortly.
msg301214 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-04 11:49
Could you please provide benchmarks that you used?
msg301215 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-04 11:54
I literally just ran timeit on "class Test: pass", but I'll see if I can provide proper numbers.
msg301216 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-04 12:09
Comparing against CPython master as of 122e88a8354e3f75aeaf6211232dac88ac296d54 I rebuilt my CPython to get clean results, and that still gave me almost 15% overall speedup. Original: $ ./python -m timeit 'class Test: pass' 20000 loops, best of 5: 9.55 usec per loop Patched: $ ./python -m timeit 'class Test: pass' 50000 loops, best of 5: 8.27 usec per loop I came across this when benchmarking the class creation in Cython, which seemed slow, and after a couple of tweaks, I ended up with 97% of the runtime inside of CPython, so I took a look over the fence. According to callgrind, the original version spent over 125M instructions inside of _PyType_Lookup() for the timeit run, whereas the patched version only counts about 92M, that's about 25% less.
msg301220 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-04 12:53
What names are looked up when you create an empty class? I'm surprised that this change has measurable performance effect. Aren't name an interned string with precalculated cache?
msg301221 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-04 13:00
It's the slot names in "slotdefs". See "update_one_slot()". The time that is saved is mostly the overhead of calling PyDict_GetItem(). I actually tried PyDict_GetItemWithError() first, which is faster due to the lower error handling overhead, before I noticed that the real problem is the repeated lookups of the same name, which can be optimised further by pre-calculating the hash and calling the low-level lookup function.
msg301227 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-09-04 17:18
I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash().
msg301232 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-04 17:45
After adding PyErr_Clear() the benefit of this optimization is smaller. Only 8% on my 64-bit computer. And no any effect on 32-bit. I wounder if it is worth to complicate the code for such small benefit. For non-empty classes the relative benefit is even smaller. Maybe there are other opportunities for optimization?
msg301240 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-04 18:47
> I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash(). Do you mean for the table of slot descriptions? I'm not sure that the effect would be comparable. > Maybe there are other opportunities for optimization? I would guess so. According to callgrind, for the timeit run, the original implementation has the following call fanout for the type creation: - 1755 calls to type_call() - 3850 calls to type_new() - 224000 calls to update_one_slot() - 317000 calls to_PyType_Lookup() - 566000 calls to PyDict_GetItem() That's 147 calls to PyDict_GetItem() per type creation, just inside of _PyType_Lookup(). About 20% of the time in PyDict_GetItem() is spent in PyErr_Clear(), and 23/26% respectively in lookdict_unicode_nodummy() (349000 calls) and lookdict_unicode() (278000 calls). There is probably some space for algorithmic improvements here, in order to reduce the overall number of calls. For example, I wonder if bypassing the method cache while building the type might help. The cache maintenance seems to amount for something like 30% of the time spent in _PyType_Lookup(). Or reversing the inspection order, i.e. running over the type attributes and looking up the slots, instead of running over the slots and looking up the attributes. Or using a faster set intersection algorithm for that purpose. Something like that. OTOH, quick gains might be worth it, but since applications that find their bottleneck in type creation are probably rare, I doubt that it's worth putting weeks into this. Even the application startup time is unlikely to be domination by type creations, rather than object instantiations.
msg301300 - (view)	Author: Inada Naoki (methane) *	Date: 2017-09-05 06:20
There is special internal API in dictobject.c _PyDict_LoadGlobal(PyDictObject globals, PyDictObject builtins, PyObject key) Maybe, we can have special API like that _PyDict_LookupMro(PyObject mro, PyObject *key); BTW, method cache in _PyType_Lookup is not good for initializing type. It always mishit. And all specialized method is cached even if it isn't used anymore. Stefan, would you benchmark these ideas?
msg301347 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-05 18:10
I updated the pull request with a split version of _PyType_Lookup() that bypasses the method cache during slot updates. I also ran the benchmarks with PGO enabled now to get realistic results. The overall gain is around 15%. Original: $ ./python -m timeit 'class Test: pass' 20000 loops, best of 5: 7.29 usec per loop Patched: $ ./python -m timeit 'class Test: pass' 50000 loops, best of 5: 6.15 usec per loop Patched with non-trivial bases: $ ./python -m timeit 'class Test(object): pass' 50000 loops, best of 5: 6.05 usec per loop $ ./python -m timeit 'class Test(type): pass' 50000 loops, best of 5: 6.08 usec per loop $ ./python -m timeit 'class Test(int): pass' 50000 loops, best of 5: 9.08 usec per loop I do not consider the optimisations a code degredation. There is one semantic change: the new function _PyType_LookupUncached() returns on the first error and might set an exception. I considered that better behaviour than the old function, but it means that if one base class name lookup fails and the next one previously succeeded, it will no longer succeed now. I don't have a strong feeling about this and would change it back if compatibility is considered more valuable. It generally feels wrong to have errors pass silently here, however unlikely they may be in practice.
msg301350 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-09-05 18:18
Please use perf.timeit not timeit for microbenchmarks: https://perf.readthedocs.io/en/latest/
msg301364 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-05 19:24
Since I'm getting highly reproducible results on re-runs, I tend to trust these numbers.
msg301446 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-06 06:39
BTW, it seems that Yury's dict copy optimisation would also help here. When I use a benchmark scenario with a simple non-empty method/attribute dict (from Cython this time), almost 10% of the creation time is spent copying that dict, which should essentially just be a memcopy() since it doesn't need any resizing at that point.
msg301452 - (view)	Author: Inada Naoki (methane) *	Date: 2017-09-06 07:18
Confirmed: $ ./python-patched -m perf timeit --compare-to `pwd`/python -- 'class C: pass' python: ..................... 11.9 us +- 0.1 us python-patched: ..................... 10.3 us +- 0.1 us Mean +- std dev: [python] 11.9 us +- 0.1 us -> [python-patched] 10.3 us +- 0.1 us: 1.15x faster (-13%)
msg301823 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-10 18:01
Any more comments on the proposed implementation? 13-15% seem worth it to me. @Victor, or are you saying "PyId, or no change at all"?
msg301824 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2017-09-10 18:16
Good work, Stefan! It's an impressive speedup of class creation. It looks like you have not yet addressed Serhiy's comment https://github.com/python/cpython/pull/3279/files#r136815648
msg301825 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-10 18:40
My comment was addressed. As for using the _Py_IDENTIFIER API, I don't think it is related. The speedup was caused by avoiding few simple checks and function calls. The _Py_IDENTIFIER API is great, but it has a small overhead which is comparable to the small difference caused by Stefan's patch.
msg301827 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-10 18:42
No, that one was addressed. I think only Victor's comment is still open, that's why I asked back.
msg301830 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2017-09-10 19:18
Victor is currently travelling and recovering from jetlag. I'm sure he'll reply within a day or two.
msg301878 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-09-11 15:26
I ran a microbenchmark on the current PR 3279 using: ./python -m perf timeit --inherit=PYTHONPATH 'class C: pass' Result: haypo@selma$ ./python -m perf compare_to ref.json patch.json Mean +- std dev: [ref] 9.71 us +- 0.38 us -> [patch] 8.74 us +- 0.22 us: 1.11x faster (-10%) I compiled Python using "./configure && make", no LTO nor PGO.
msg301887 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-09-11 16:53
Serhiy on the PR: "This is overgeneralization. Can tp_dict be not exact dict at all? I don't think this is possible. In many places concrete dict API is used with tp_dict. If you want to allow tp_dict be not exact dict, please open a separate issue for this." Using the following code, A.__dict__ type is dict even if the metaclass creates a different type, probably because type_new() calls PyDict_Copy(orig_dict): --- class mydict(dict): def __setitem__(self, name, value): if name == "__module__": value = "<mock module>" super().__setitem__(name, value) class MetaClass(type): @classmethod def __prepare__(mcl, name, bases): return mydict() class A(metaclass=MetaClass): pass print(A.__module__) ---
msg302052 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-13 10:13
On my computer the speed up is 13% without LTO and PGO and around 20% with LTO and PGO. Building with LTO and PGO adds other 20%.
msg302104 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-13 18:42
> Since _PyDict_GetItem_KnownHash() may or may not set an exception, we have to check for a live exception after calling it, and that finds the old exception of the last attribute lookup and decides that its own lookup failed. Hmm, with PyDict_GetItem() we don't falsely detect a lookup failing with a live exception set. Is it correct to call _PyType_Lookup() with an exception set? Perhaps we should save a current exception before calling find_name_in_mro() and restore it after. Or raise SystemError if an exception set. Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro(). I don't know what is more correct.
msg302108 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-13 19:49
> Is it correct to call _PyType_Lookup() with an exception set? The general rule of thumb is that it's not safe to call any user code with a live exception set, and lookups can call into user code. I quickly looked through all occurrences (there aren't that many actually), and they all seem be be called from a context where a live exception would be wrong. > Perhaps we should save a current exception before calling find_name_in_mro() and restore it after. I thought about that, too, but it feels wrong. This just shouldn't be necessary. It's ok to save away exceptions in exception related code, but general purpose code shouldn't have to assume that it gets called with a live exception. > Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro(). That would catch the error at hand. The only concern is backwards compatibility. But it's easy to argue that it's the foreign code that is broken in this case (as show-cased by cElementTree) and clearly needs fixing if this occurs. An assert() seems like a reasonable way to catch this kind of bug, at least in a debug build.
msg302157 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-14 08:09
One more thing: the fact that the lookup does not propagate exceptions leaves some space for ambiguity. If, in a chain of MRO lookups, one would fail and a later one would succeed, is it correct to keep trying? What if the first failure actually failed to see an otherwise existing attribute in that class? We simply can't know because the lookup failed for some reason. I think, the only way to really avoid that ambiguity would be to allow _PyType_Lookup() to raise exceptions, and to make it fail on the first error. I've created issue 31465 for this.
msg303010 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-09-26 06:00
Still ready for merging :)
msg303011 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-26 06:10
I'm going to merge this over the next 24 hours if there are no comments from other core developers.
msg303453 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-10-01 08:37
New changeset 2102c789035ccacbac4362589402ac68baa2cd29 by Serhiy Storchaka (scoder) in branch 'master': bpo-31336: Speed up type creation. (#3279) https://github.com/python/cpython/commit/2102c789035ccacbac4362589402ac68baa2cd29
msg303454 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-10-01 08:41
Thank you for your contribution Stefan! I had doubts about your initial (much simpler) changes, but changed my mind.
msg303455 - (view)	Author: Stefan Behnel (scoder) *	Date: 2017-10-01 08:43
Thanks for the reviews, and thank you for merging it, Serhiy.

History
Date	User	Action	Args
2022-04-11 14:58:51	admin	set	github: 75517
2017-10-01 08:43:26	scoder	set	messages: + msg303455
2017-10-01 08:41:49	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2017-10-01 08:41:39	serhiy.storchaka	set	messages: + msg303454
2017-10-01 08:37:53	serhiy.storchaka	set	messages: + msg303453
2017-09-26 06:10:11	serhiy.storchaka	set	messages: + msg303011
2017-09-26 06:00:16	scoder	set	messages: + msg303010
2017-09-14 08:09:32	scoder	set	messages: + msg302157
2017-09-13 19:49:12	scoder	set	messages: + msg302108
2017-09-13 18:42:20	serhiy.storchaka	set	messages: + msg302104
2017-09-13 10:13:57	serhiy.storchaka	set	messages: + msg302052
2017-09-11 16:53:21	vstinner	set	messages: + msg301887
2017-09-11 15:26:42	vstinner	set	messages: + msg301878
2017-09-10 19:18:30	christian.heimes	set	assignee: vstinner messages: + msg301830
2017-09-10 18:42:03	scoder	set	messages: + msg301827
2017-09-10 18:40:42	serhiy.storchaka	set	messages: + msg301825
2017-09-10 18:16:19	christian.heimes	set	nosy: + christian.heimes messages: + msg301824
2017-09-10 18:01:12	scoder	set	messages: + msg301823
2017-09-06 07:18:03	methane	set	messages: + msg301452
2017-09-06 06:39:45	scoder	set	messages: + msg301446
2017-09-05 19:24:05	scoder	set	messages: + msg301364
2017-09-05 18:18:07	vstinner	set	messages: + msg301350
2017-09-05 18:10:32	scoder	set	messages: + msg301347
2017-09-05 06:20:50	methane	set	messages: + msg301300
2017-09-04 18:47:26	scoder	set	messages: + msg301240
2017-09-04 17:45:18	serhiy.storchaka	set	messages: + msg301232
2017-09-04 17:18:35	vstinner	set	messages: + msg301227
2017-09-04 13:00:24	scoder	set	messages: + msg301221
2017-09-04 12:53:37	serhiy.storchaka	set	nosy: + rhettinger, pitrou, vstinner, methane messages: + msg301220
2017-09-04 12:09:51	scoder	set	messages: + msg301216
2017-09-04 11:54:55	scoder	set	messages: + msg301215
2017-09-04 11:49:13	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg301214 stage: patch review
2017-09-04 11:39:57	scoder	set	pull_requests: + pull_request3322
2017-09-04 11:39:18	scoder	create