classification
Title: Speed up slot lookup for class creation
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: gvanrossum, inada.naoki, pitrou, scoder, serhiy.storchaka
Priority: low Keywords: patch

Created on 2017-12-16 12:21 by pitrou, last changed 2018-01-15 18:34 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
benchgcclasses.py pitrou, 2018-01-12 11:23
benchgcclasses2.py inada.naoki, 2018-01-12 11:43
Pull Requests
URL Status Linked Edit
PR 4902 closed pitrou, 2017-12-16 12:24
Messages (23)
msg308474 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-16 12:21
As mentioned in https://mail.python.org/pipermail/python-dev/2017-December/151298.html, slot lookup can take 75 to 80% of the runtime when creating a new class.  This was slightly improved in https://bugs.python.org/issue31336 but we're still doing one lookup per possible slot name *and* per class along the MRO.

I propose to speed up this step by caching the known descriptors for each class.  This way, when instantiating a new class, you can simply iterate over the known descriptors of the MRO without wasting time on hundreds of dict lookups.

(and it is reasonably easy to find all slot descriptors in a dict: first select only __dunder__ names, then look them up in a dedicated mapping)
msg308475 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-16 12:26
I posted https://github.com/python/cpython/pull/4902 for this.

This approach has two drawbacks:
- uses an additional tp_ slot (and a corresponding TPFLAGS)
- adds a bit of code complexity (but quite localized)
msg308476 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-16 12:32
Some micro-benchmarks:

$ ./python -m timeit "class Test: pass"
- before: 8.84 usec per loop
- after: 7.03 usec per loop

$ ./python -m timeit "class Test(tuple): pass"
- before: 10.1 usec per loop
- after: 8.4 usec per loop

$ ./python -m timeit -s "from logging import Logger" "class Test(Logger): pass"
- before: 12 usec per loop
- after: 6.25 usec per loop

$ ./python -m timeit -s "from logging.handlers import DatagramHandler" "class Test(DatagramHandler): pass"
- before: 15 usec per loop
- after: 6.68 usec per loop

$ ./python -m timeit -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
- before: 18.2 usec per loop
- after: 6.56 usec per loop

$ ./python -m timeit -s "from shelve import Shelf" "class Test(Shelf): pass"
- before: 28.6 usec per loop
- after: 18.4 usec per loop
msg308738 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-20 15:29
Updated benchmarks against git master (and using pyperf):

- before:

$ ./env-orig/bin/pyperf timeit -q "class Test: pass"
Mean +- std dev: 8.89 us +- 0.09 us
$ ./env-orig/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass"
Mean +- std dev: 12.0 us +- 0.2 us
$ ./env-orig/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
Mean +- std dev: 18.6 us +- 0.2 us

- after:

$ ./env/bin/pyperf timeit -q "class Test: pass"
Mean +- std dev: 6.90 us +- 0.11 us
$ ./env/bin/pyperf timeit -q -s "from logging import Logger" "class Test(Logger): pass"
Mean +- std dev: 5.86 us +- 0.08 us
$ ./env/bin/pyperf timeit -q -s "from unittest.mock import MagicMock" "class Test(MagicMock): pass"
Mean +- std dev: 6.13 us +- 0.08 us
msg308744 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-12-20 16:09
Due to its complexity it will take a time to make a review of this patch.
msg308747 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-20 16:14
Yes... The patch itself is not very complex, but you have to dive into the intricacies of typeobject.c to understand it :-/
msg308866 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-12-21 11:23
New slot is really required?

My idea is:

* Copy slots from MRO
* Iterate namespace dict and override slots if dunder is defined

Is it impossible or harder than my thought?
msg308867 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-21 11:37
I don't really understand your proposal, so it's hard to answer :-) Perhaps you can try to implement it so that we can compare?
msg309384 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-02 16:24
./python -m timeit -- "class A: pass"
- before: 6.63 usec per loop
- after:  5.41 usec per loop

./python -m timeit -s "class A: pass" -- "class B(A): pass"
- before: 7.04 usec per loop
- after:  4.91 usec per loop

./python -m timeit -s "class A: pass" -s "class B(A): pass" -- "class C(B): pass"
- before: 8.24 usec per loop
- after:  5.09 usec per loop

./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -- "class D(C): pass"
- before: 9.59 usec per loop
- after:  5.29 usec per loop

./python -m timeit -s "class A: pass" -s "class B(A): pass" -s "class C(B): pass" -s "class D(C): pass" -- "class E(D): pass"
- before: 10.9 usec per loop
- after:  5.45 usec per loop
msg309411 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-01-03 11:53
How about reusing tp_cache instead of adding new slot?
msg309420 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-03 17:41
What exactly is tp_cache? Apparently it was added by Guido in https://github.com/python/cpython/commit/687ae00460da9cac04eb1ba8f6f5ab4db25fbfc2 but never used (at least officially).

commit 687ae00460da9cac04eb1ba8f6f5ab4db25fbfc2
Author: Guido van Rossum <guido@python.org>
Date:   Mon Oct 15 22:03:32 2001 +0000

    Get rid of __defined__ and tp_defined -- there's no need to
    distinguish __dict__ and __defined__ any more.  In the C structure,
    tp_cache takes its place -- but this hasn't been implemented yet.
msg309444 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2018-01-03 22:17
I don't recall what I meant to use tp_cache for, but everything I can find about it says it's unused, so let's kill it if we can. I guess the ABI requires that we keep the slot around until we find a new use for it?
msg309465 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-04 13:26
I see, so I should be able to reuse tp_cache for this PR.
msg309474 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-01-04 17:30
In my environment, python -X importtime -c 'import asyncio' speed up from 53.2ms to 51ms.
While I like faster startup time, I don't know 4% speed up is worth enough for
this complexity.
msg309476 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-04 18:00
> While I like faster startup time, I don't know 4% speed up is worth enough for this complexity.

There's no magic bullet that I know of to reduce startup time by a large amount.  So we're left with optimizing progressively.  For comparison, https://github.com/python/cpython/pull/3279 had an even smaller effect on startup time.

This change has the nice property that it's a genuine speed-up on a generic operation, and it reduces the cost of adding slots as well as the cost of having long inheritance chains.
msg309824 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-11 19:25
The relative speed up looks nice. But it is just few microseconds per class. You have to create many thousands of classes to gain a significant fraction of second. And the complexity added by this patch is not tiny.

I'm not sure how this caching works when change the parent class after creating the child class.

Without caching the benefit is 20-50% smaller. Perhaps this result can be improved. Actually we don't need to search in all dictionaries of all classes in the MRO. We can check the correspondent slot in the parent class (the offsets are constants) and look up in the class dict of the first class with non-zero slot value.

I tried to simplify this change (even at the cost of smaller speedup). But the result still looks too complex.

Since the benefit in any case will be small, and seems there are no other issues that depend on this change, I suggest to defer it to 3.8. There are more priority changes that should be made before the feature freeze in 3.7.
msg309828 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-11 19:46
Le 11/01/2018 à 20:25, Serhiy Storchaka a écrit :
> 
> I'm not sure how this caching works when change the parent class after creating the child class.

The caching is invalidated at the same place the method cache is
invalidated.

> Without caching the benefit is 20-50% smaller. Perhaps this result can be improved. Actually we don't need to search in all dictionaries of all classes in the MRO. We can check the correspondent slot in the parent class (the offsets are constants) and look up in the class dict of the first class with non-zero slot value.

I'm not sure that's always true.  The relationship between class dict
attributes and tp_ slots is not 100% bijective, there's more complicated
stuff going on for some of the descriptors.  My patch keeps the current
algorithm (ensuring no descriptor gets broken) but optimizes the lookups.

> There are more priority changes that should be made before the feature freeze in 3.7.

I am not working on any of those changes, so deferring this PR will not
have any effect on those changes.
msg309845 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-01-12 10:06
As my understand, this patch creates cache for all classe,
not only for parent classes.
Caches may has much tuples, and they are GC tracked because
they contains function descriptors.  And they actually creates
reference cycles.

Am I correct?

If so, I want estimate of GC overhead and memory overhead of cache
for large project.  Is it really negligible?
msg309848 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-12 11:23
Here is a simple script creating 10000 classes (a large number, but perhaps not out of sight for a large application importing a lot of libraries (*)).

(*) see the experiment I did in https://mail.python.org/pipermail/python-dev/2017-December/151260.html

Before:
$ ./python-orig -I benchgcclasses.py 
GC time: 6.8 ms
RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
antoine  11248  0.0  0.3  41296 24576 pts/1    S+   12:18   0:00 ./python-orig -I benchgcclasses.py

After:
$ ./python -I benchgcclasses.py 
GC time: 6.9 ms
RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
antoine  11097  0.0  0.3  41300 24740 pts/1    S+   12:18   0:00 ./python -I benchgcclasses.py


RSS is a bit unstable from run to run, but roughly the patch seems to add 100 to 200KB in this case.

As for full GC time, it is quite stable and there's a 0.1ms increase with the patch.

Note this is really a worst-case benchmark: lots of classes, no methods, no user data beside the classes.
msg309849 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-12 11:28
Serhiy:

> The relative speed up looks nice. But it is just few microseconds per class. You have to create many thousands of classes to gain a significant fraction of second.

This work started with your message in https://mail.python.org/pipermail/python-dev/2017-December/151277.html pointing out that we shouldn't add tp_ slots because it makes type creation and Python initialization slower (*), so I'm a bit surprised that you're now claiming speeding up type creation (by up to 3x!) is not important...

(*) To quote:

'''
2. Increased class initialization time. For every class for every slot 
we need to look up corresponding methods in dictionaries of the class 
itself and all its parents (caching doesn't work fine at this stage). 
Significant part of class initialization time is spent on initializing 
slots. This will increase the startup time and the time of creating 
local classes. The relative overhead is more significant in Cython.
'''
msg309850 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-01-12 11:43
> Note this is really a worst-case benchmark: lots of classes, no methods, no user data beside the classes.

Since benchgcclasses.py doesn't creates dunder methods,
cache doesn't have GC-tracked tuples, and ref cycles.

benchgcclasses2.py adds three dunder methods:

$ ./python benchgcclasses.py
GC time: 13.0 ms
gc: collecting generation 2...
gc: objects in each generation: 1 0 94942
gc: objects in permanent generation: 0
gc: done, 0.0131s elapsed
RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
inada-n  29604  0.0  0.1  47052 30692 pts/2    S+   20:42   0:00 ./python benchgcclasses.py

$ ./python-patched benchgcclasses.py
GC time: 17.2 ms
gc: collecting generation 2...
gc: objects in each generation: 1 0 135220
gc: objects in permanent generation: 0
gc: done, 0.0173s elapsed
RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
inada-n  29626  0.0  0.2  49880 33464 pts/2    S+   20:43   0:00 ./python-patched benchgcclasses.py
msg309851 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-12 11:47
> Since benchgcclasses.py doesn't creates dunder methods,
cache doesn't have GC-tracked tuples, and ref cycles.

Hmm, you're right, thank you.  I can also reproduce your numbers here (using benchgcclasses2.py).  There's definitely an impact.
msg309892 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-13 17:38
I think that Inada is right: there's too much impact on memory consumption and garbage collection to call this an optimization.

Serhiy suggested removing the cache.  But, once you remove the cache, you're going to iterate on all values of all parent classes' dicts to find which values are __dunder__ methods.  This risks being much more expensive for classes with tens or hundreds of namespace entries, making the optimization even less interesting.

So in the end I think I'm going to reject this issue.
History
Date User Action Args
2018-01-15 18:34:21pitrousetstatus: pending -> closed
2018-01-13 17:38:50pitrousetstatus: open -> pending
resolution: rejected
messages: + msg309892

stage: patch review -> resolved
2018-01-12 11:47:33pitrousetmessages: + msg309851
2018-01-12 11:43:42inada.naokisetfiles: + benchgcclasses2.py

messages: + msg309850
2018-01-12 11:28:29pitrousetmessages: + msg309849
2018-01-12 11:23:25pitrousetfiles: + benchgcclasses.py

messages: + msg309848
2018-01-12 10:06:27inada.naokisetmessages: + msg309845
2018-01-11 19:46:44pitrousetmessages: + msg309828
2018-01-11 19:25:13serhiy.storchakasetmessages: + msg309824
2018-01-04 18:00:48pitrousetmessages: + msg309476
2018-01-04 17:30:07inada.naokisetmessages: + msg309474
2018-01-04 13:26:00pitrousetmessages: + msg309465
2018-01-03 22:17:59gvanrossumsetmessages: + msg309444
2018-01-03 17:41:13pitrousetnosy: + gvanrossum
messages: + msg309420
2018-01-03 11:53:36inada.naokisetmessages: + msg309411
2018-01-02 16:24:54serhiy.storchakasetmessages: + msg309384
2017-12-21 11:37:33pitrousetmessages: + msg308867
2017-12-21 11:23:21inada.naokisetnosy: + inada.naoki
messages: + msg308866
2017-12-20 16:14:16pitrousetmessages: + msg308747
2017-12-20 16:09:37serhiy.storchakasetmessages: + msg308744
2017-12-20 15:29:11pitrousetmessages: + msg308738
2017-12-16 13:34:53serhiy.storchakasetpriority: normal -> low
2017-12-16 12:32:32pitrousetstage: patch review
2017-12-16 12:32:04pitrousetmessages: + msg308476
2017-12-16 12:26:37pitrousetmessages: + msg308475
stage: patch review -> (no value)
2017-12-16 12:24:09pitrousetkeywords: + patch
stage: patch review
pull_requests: + pull_request4797
2017-12-16 12:21:18pitroucreate