Issue 32346: Speed up slot lookup for class creation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76527

classification

Title:	Speed up slot lookup for class creation
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	gvanrossum, methane, pitrou, scoder, serhiy.storchaka
Priority:	low	Keywords:	patch

Created on 2017-12-16 12:21 by pitrou, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
benchgcclasses.py	pitrou, 2018-01-12 11:23
benchgcclasses2.py	methane, 2018-01-12 11:43

Pull Requests
URL	Status	Linked	Edit
PR 4902	closed	pitrou, 2017-12-16 12:24

Messages (23)
msg308474 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-12-16 12:21
As mentioned in https://mail.python.org/pipermail/python-dev/2017-December/151298.html, slot lookup can take 75 to 80% of the runtime when creating a new class. This was slightly improved in https://bugs.python.org/issue31336 but we're still doing one lookup per possible slot name and per class along the MRO. I propose to speed up this step by caching the known descriptors for each class. This way, when instantiating a new class, you can simply iterate over the known descriptors of the MRO without wasting time on hundreds of dict lookups. (and it is reasonably easy to find all slot descriptors in a dict: first select only __dunder__ names, then look them up in a dedicated mapping)
msg308475 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-12-16 12:26
I posted https://github.com/python/cpython/pull/4902 for this. This approach has two drawbacks: - uses an additional tp_ slot (and a corresponding TPFLAGS) - adds a bit of code complexity (but quite localized)
msg308476 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-12-16 12:32
Some micro-benchmarks: $ ./python -m timeit "class Test: pass" - before: 8.84 usec per loop - after: 7.03 usec per loop $ ./python -m timeit "class Test(tuple): pass" - before: 10.1 usec per loop - after: 8.4 usec per loop $ ./python -m timeit -s "from logging import Logger" "class Test(Logger): pass" - before: 12 usec per loop - after: 6.25 usec per loop $ ./python -m timeit -s "from logging.handlers import DatagramHandler" "class Test(DatagramHandler): pass" - before: 15 usec per loop - after: 6.68 usec per loop $ ./python -m timeit -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass" - before: 18.2 usec per loop - after: 6.56 usec per loop $ ./python -m timeit -s "from shelve import Shelf" "class Test(Shelf): pass" - before: 28.6 usec per loop - after: 18.4 usec per loop
msg308738 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-12-20 15:29
Updated benchmarks against git master (and using pyperf): - before: $ ./env-orig/bin/pyperf timeit -q "class Test: pass" Mean +- std dev: 8.89 us +- 0.09 us $ ./env-orig/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass" Mean +- std dev: 12.0 us +- 0.2 us $ ./env-orig/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass" Mean +- std dev: 18.6 us +- 0.2 us - after: $ ./env/bin/pyperf timeit -q "class Test: pass" Mean +- std dev: 6.90 us +- 0.11 us $ ./env/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass" Mean +- std dev: 5.86 us +- 0.08 us $ ./env/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass" Mean +- std dev: 6.13 us +- 0.08 us
msg308744 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-12-20 16:09
Due to its complexity it will take a time to make a review of this patch.
msg308747 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-12-20 16:14
Yes... The patch itself is not very complex, but you have to dive into the intricacies of typeobject.c to understand it :-/
msg308866 - (view)	Author: Inada Naoki (methane) *	Date: 2017-12-21 11:23
New slot is really required? My idea is: * Copy slots from MRO * Iterate namespace dict and override slots if dunder is defined Is it impossible or harder than my thought?
msg308867 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-12-21 11:37
I don't really understand your proposal, so it's hard to answer :-) Perhaps you can try to implement it so that we can compare?
msg309384 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-02 16:24
./python -m timeit -- "class A: pass" - before: 6.63 usec per loop - after: 5.41 usec per loop ./python -m timeit -s "class A: pass" -- "class B(A): pass" - before: 7.04 usec per loop - after: 4.91 usec per loop ./python -m timeit -s "class A: pass" -s "class B(A): pass" -- "class C(B): pass" - before: 8.24 usec per loop - after: 5.09 usec per loop ./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -- "class D(C): pass" - before: 9.59 usec per loop - after: 5.29 usec per loop ./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -s "class D(C): pass" -- "class E(D): pass" - before: 10.9 usec per loop - after: 5.45 usec per loop
msg309411 - (view)	Author: Inada Naoki (methane) *	Date: 2018-01-03 11:53
How about reusing tp_cache instead of adding new slot?
msg309420 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-03 17:41
What exactly is tp_cache? Apparently it was added by Guido in https://github.com/python/cpython/commit/687ae00460da9cac04eb1ba8f6f5ab4db25fbfc2 but never used (at least officially). commit 687ae00460da9cac04eb1ba8f6f5ab4db25fbfc2 Author: Guido van Rossum <guido@python.org> Date: Mon Oct 15 22:03:32 2001 +0000 Get rid of __defined__ and tp_defined -- there's no need to distinguish __dict__ and __defined__ any more. In the C structure, tp_cache takes its place -- but this hasn't been implemented yet.
msg309444 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2018-01-03 22:17
I don't recall what I meant to use tp_cache for, but everything I can find about it says it's unused, so let's kill it if we can. I guess the ABI requires that we keep the slot around until we find a new use for it?
msg309465 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-04 13:26
I see, so I should be able to reuse tp_cache for this PR.
msg309474 - (view)	Author: Inada Naoki (methane) *	Date: 2018-01-04 17:30
In my environment, python -X importtime -c 'import asyncio' speed up from 53.2ms to 51ms. While I like faster startup time, I don't know 4% speed up is worth enough for this complexity.
msg309476 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-04 18:00
> While I like faster startup time, I don't know 4% speed up is worth enough for this complexity. There's no magic bullet that I know of to reduce startup time by a large amount. So we're left with optimizing progressively. For comparison, https://github.com/python/cpython/pull/3279 had an even smaller effect on startup time. This change has the nice property that it's a genuine speed-up on a generic operation, and it reduces the cost of adding slots as well as the cost of having long inheritance chains.
msg309824 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-11 19:25
The relative speed up looks nice. But it is just few microseconds per class. You have to create many thousands of classes to gain a significant fraction of second. And the complexity added by this patch is not tiny. I'm not sure how this caching works when change the parent class after creating the child class. Without caching the benefit is 20-50% smaller. Perhaps this result can be improved. Actually we don't need to search in all dictionaries of all classes in the MRO. We can check the correspondent slot in the parent class (the offsets are constants) and look up in the class dict of the first class with non-zero slot value. I tried to simplify this change (even at the cost of smaller speedup). But the result still looks too complex. Since the benefit in any case will be small, and seems there are no other issues that depend on this change, I suggest to defer it to 3.8. There are more priority changes that should be made before the feature freeze in 3.7.
msg309828 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-11 19:46
Le 11/01/2018 à 20:25, Serhiy Storchaka a écrit : > > I'm not sure how this caching works when change the parent class after creating the child class. The caching is invalidated at the same place the method cache is invalidated. > Without caching the benefit is 20-50% smaller. Perhaps this result can be improved. Actually we don't need to search in all dictionaries of all classes in the MRO. We can check the correspondent slot in the parent class (the offsets are constants) and look up in the class dict of the first class with non-zero slot value. I'm not sure that's always true. The relationship between class dict attributes and tp_ slots is not 100% bijective, there's more complicated stuff going on for some of the descriptors. My patch keeps the current algorithm (ensuring no descriptor gets broken) but optimizes the lookups. > There are more priority changes that should be made before the feature freeze in 3.7. I am not working on any of those changes, so deferring this PR will not have any effect on those changes.
msg309845 - (view)	Author: Inada Naoki (methane) *	Date: 2018-01-12 10:06
As my understand, this patch creates cache for all classe, not only for parent classes. Caches may has much tuples, and they are GC tracked because they contains function descriptors. And they actually creates reference cycles. Am I correct? If so, I want estimate of GC overhead and memory overhead of cache for large project. Is it really negligible?
msg309848 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-12 11:23
Here is a simple script creating 10000 classes (a large number, but perhaps not out of sight for a large application importing a lot of libraries ()). () see the experiment I did in https://mail.python.org/pipermail/python-dev/2017-December/151260.html Before: $ ./python-orig -I benchgcclasses.py GC time: 6.8 ms RSS: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND antoine 11248 0.0 0.3 41296 24576 pts/1 S+ 12:18 0:00 ./python-orig -I benchgcclasses.py After: $ ./python -I benchgcclasses.py GC time: 6.9 ms RSS: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND antoine 11097 0.0 0.3 41300 24740 pts/1 S+ 12:18 0:00 ./python -I benchgcclasses.py RSS is a bit unstable from run to run, but roughly the patch seems to add 100 to 200KB in this case. As for full GC time, it is quite stable and there's a 0.1ms increase with the patch. Note this is really a worst-case benchmark: lots of classes, no methods, no user data beside the classes.
msg309849 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-12 11:28
Serhiy: > The relative speed up looks nice. But it is just few microseconds per class. You have to create many thousands of classes to gain a significant fraction of second. This work started with your message in https://mail.python.org/pipermail/python-dev/2017-December/151277.html pointing out that we shouldn't add tp_ slots because it makes type creation and Python initialization slower (), so I'm a bit surprised that you're now claiming speeding up type creation (by up to 3x!) is not important... () To quote: ''' 2. Increased class initialization time. For every class for every slot we need to look up corresponding methods in dictionaries of the class itself and all its parents (caching doesn't work fine at this stage). Significant part of class initialization time is spent on initializing slots. This will increase the startup time and the time of creating local classes. The relative overhead is more significant in Cython. '''
msg309850 - (view)	Author: Inada Naoki (methane) *	Date: 2018-01-12 11:43
> Note this is really a worst-case benchmark: lots of classes, no methods, no user data beside the classes. Since benchgcclasses.py doesn't creates dunder methods, cache doesn't have GC-tracked tuples, and ref cycles. benchgcclasses2.py adds three dunder methods: $ ./python benchgcclasses.py GC time: 13.0 ms gc: collecting generation 2... gc: objects in each generation: 1 0 94942 gc: objects in permanent generation: 0 gc: done, 0.0131s elapsed RSS: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND inada-n 29604 0.0 0.1 47052 30692 pts/2 S+ 20:42 0:00 ./python benchgcclasses.py $ ./python-patched benchgcclasses.py GC time: 17.2 ms gc: collecting generation 2... gc: objects in each generation: 1 0 135220 gc: objects in permanent generation: 0 gc: done, 0.0173s elapsed RSS: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND inada-n 29626 0.0 0.2 49880 33464 pts/2 S+ 20:43 0:00 ./python-patched benchgcclasses.py
msg309851 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-12 11:47
> Since benchgcclasses.py doesn't creates dunder methods, cache doesn't have GC-tracked tuples, and ref cycles. Hmm, you're right, thank you. I can also reproduce your numbers here (using benchgcclasses2.py). There's definitely an impact.
msg309892 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-13 17:38
I think that Inada is right: there's too much impact on memory consumption and garbage collection to call this an optimization. Serhiy suggested removing the cache. But, once you remove the cache, you're going to iterate on all values of all parent classes' dicts to find which values are __dunder__ methods. This risks being much more expensive for classes with tens or hundreds of namespace entries, making the optimization even less interesting. So in the end I think I'm going to reject this issue.

History
Date	User	Action	Args
2022-04-11 14:58:55	admin	set	github: 76527
2018-01-15 18:34:21	pitrou	set	status: pending -> closed
2018-01-13 17:38:50	pitrou	set	status: open -> pending resolution: rejected messages: + msg309892 stage: patch review -> resolved
2018-01-12 11:47:33	pitrou	set	messages: + msg309851
2018-01-12 11:43:42	methane	set	files: + benchgcclasses2.py messages: + msg309850
2018-01-12 11:28:29	pitrou	set	messages: + msg309849
2018-01-12 11:23:25	pitrou	set	files: + benchgcclasses.py messages: + msg309848
2018-01-12 10:06:27	methane	set	messages: + msg309845
2018-01-11 19:46:44	pitrou	set	messages: + msg309828
2018-01-11 19:25:13	serhiy.storchaka	set	messages: + msg309824
2018-01-04 18:00:48	pitrou	set	messages: + msg309476
2018-01-04 17:30:07	methane	set	messages: + msg309474
2018-01-04 13:26:00	pitrou	set	messages: + msg309465
2018-01-03 22:17:59	gvanrossum	set	messages: + msg309444
2018-01-03 17:41:13	pitrou	set	nosy: + gvanrossum messages: + msg309420
2018-01-03 11:53:36	methane	set	messages: + msg309411
2018-01-02 16:24:54	serhiy.storchaka	set	messages: + msg309384
2017-12-21 11:37:33	pitrou	set	messages: + msg308867
2017-12-21 11:23:21	methane	set	nosy: + methane messages: + msg308866
2017-12-20 16:14:16	pitrou	set	messages: + msg308747
2017-12-20 16:09:37	serhiy.storchaka	set	messages: + msg308744
2017-12-20 15:29:11	pitrou	set	messages: + msg308738
2017-12-16 13:34:53	serhiy.storchaka	set	priority: normal -> low
2017-12-16 12:32:32	pitrou	set	stage: patch review
2017-12-16 12:32:04	pitrou	set	messages: + msg308476
2017-12-16 12:26:37	pitrou	set	messages: + msg308475 stage: patch review -> (no value)
2017-12-16 12:24:09	pitrou	set	keywords: + patch stage: patch review pull_requests: + pull_request4797
2017-12-16 12:21:18	pitrou	create