classification
Title: Speed up _PyType_Lookup() for class creation
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: haypo Nosy List: christian.heimes, haypo, inada.naoki, pitrou, rhettinger, scoder, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-09-04 11:39 by scoder, last changed 2017-09-14 08:09 by scoder.

Pull Requests
URL Status Linked Edit
PR 3279 open scoder, 2017-09-04 11:39
Messages (26)
msg301213 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-04 11:39
The method lookup fast path in _PyType_Lookup() does not apply during type creation, which is highly dominated by the performance of the dict lookups along the mro chain. Pre-calculating the name hash speeds up the creation of an empty class (i.e. "class Test: pass") by around 20%.

Will send a pull request shortly.
msg301214 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-04 11:49
Could you please provide benchmarks that you used?
msg301215 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-04 11:54
I literally just ran timeit on "class Test: pass", but I'll see if I can provide proper numbers.
msg301216 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-04 12:09
Comparing against CPython master as of 122e88a8354e3f75aeaf6211232dac88ac296d54

I rebuilt my CPython to get clean results, and that still gave me almost 15% overall speedup.

Original:
$ ./python -m timeit 'class Test: pass'
20000 loops, best of 5: 9.55 usec per loop

Patched:
$ ./python -m timeit 'class Test: pass'
50000 loops, best of 5: 8.27 usec per loop

I came across this when benchmarking the class creation in Cython, which seemed slow, and after a couple of tweaks, I ended up with 97% of the runtime inside of CPython, so I took a look over the fence.

According to callgrind, the original version spent over 125M instructions inside of _PyType_Lookup() for the timeit run, whereas the patched version only counts about 92M, that's about 25% less.
msg301220 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-04 12:53
What names are looked up when you create an empty class?

I'm surprised that this change has measurable performance effect. Aren't name an interned string with precalculated cache?
msg301221 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-04 13:00
It's the slot names in "slotdefs". See "update_one_slot()".

The time that is saved is mostly the overhead of calling PyDict_GetItem(). I actually tried PyDict_GetItemWithError() first, which is faster due to the lower error handling overhead, before I noticed that the real problem is the repeated lookups of the same name, which can be optimised further by pre-calculating the hash and calling the low-level lookup function.
msg301227 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-09-04 17:18
I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash().
msg301232 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-04 17:45
After adding PyErr_Clear() the benefit of this optimization is smaller. Only 8% on my 64-bit computer. And no any effect on 32-bit. I wounder if it is worth to complicate the code for such small benefit. For non-empty classes the relative benefit is even smaller. Maybe there are other opportunities for optimization?
msg301240 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-04 18:47
> I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash().

Do you mean for the table of slot descriptions? I'm not sure that the effect would be comparable.

> Maybe there are other opportunities for optimization?

I would guess so. According to callgrind, for the timeit run, the original implementation has the following call fanout for the type creation:

- 1755 calls to type_call()
- 3850 calls to type_new()
- 224000 calls to update_one_slot()
- 317000 calls to_PyType_Lookup()
- 566000 calls to PyDict_GetItem()

That's 147 calls to PyDict_GetItem() per type creation, just inside of _PyType_Lookup().

About 20% of the time in PyDict_GetItem() is spent in PyErr_Clear(), and 23/26% respectively in lookdict_unicode_nodummy() (349000 calls) and lookdict_unicode() (278000 calls).

There is probably some space for algorithmic improvements here, in order to reduce the overall number of calls. For example, I wonder if bypassing the method cache while building the type might help. The cache maintenance seems to amount for something like 30% of the time spent in _PyType_Lookup(). Or reversing the inspection order, i.e. running over the type attributes and looking up the slots, instead of running over the slots and looking up the attributes. Or using a faster set intersection algorithm for that purpose. Something like that.

OTOH, quick gains might be worth it, but since applications that find their bottleneck in type creation are probably rare, I doubt that it's worth putting weeks into this. Even the application startup time is unlikely to be domination by *type* creations, rather than *object* instantiations.
msg301300 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-09-05 06:20
There is special internal API in dictobject.c
_PyDict_LoadGlobal(PyDictObject *globals, PyDictObject *builtins, PyObject *key)

Maybe, we can have special API like that
_PyDict_LookupMro(PyObject *mro, PyObject *key);

BTW, method cache in _PyType_Lookup is not good for initializing type.
It always mishit.  And all specialized method is cached even if it isn't used anymore.

Stefan, would you benchmark these ideas?
msg301347 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-05 18:10
I updated the pull request with a split version of _PyType_Lookup() that bypasses the method cache during slot updates. I also ran the benchmarks with PGO enabled now to get realistic results. The overall gain is around 15%.

Original:
$ ./python -m timeit 'class Test: pass'
20000 loops, best of 5: 7.29 usec per loop

Patched:
$ ./python -m timeit 'class Test: pass'
50000 loops, best of 5: 6.15 usec per loop

Patched with non-trivial bases:
$ ./python -m timeit 'class Test(object): pass'
50000 loops, best of 5: 6.05 usec per loop
$ ./python -m timeit 'class Test(type): pass'
50000 loops, best of 5: 6.08 usec per loop
$ ./python -m timeit 'class Test(int): pass'
50000 loops, best of 5: 9.08 usec per loop

I do not consider the optimisations a code degredation.

There is one semantic change: the new function _PyType_LookupUncached() returns on the first error and might set an exception. I considered that better behaviour than the old function, but it means that if one base class name lookup fails and the next one previously succeeded, it will no longer succeed now. I don't have a strong feeling about this and would change it back if compatibility is considered more valuable. It generally feels wrong to have errors pass silently here, however unlikely they may be in practice.
msg301350 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-09-05 18:18
Please use perf.timeit not timeit for microbenchmarks: https://perf.readthedocs.io/en/latest/
msg301364 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-05 19:24
Since I'm getting highly reproducible results on re-runs, I tend to trust these numbers.
msg301446 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-06 06:39
BTW, it seems that Yury's dict copy optimisation would also help here. When I use a benchmark scenario with a simple non-empty method/attribute dict (from Cython this time), almost 10% of the creation time is spent copying that dict, which should essentially just be a memcopy() since it doesn't need any resizing at that point.
msg301452 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-09-06 07:18
Confirmed:
$ ./python-patched -m perf timeit --compare-to `pwd`/python -- 'class C: pass'
python: ..................... 11.9 us +- 0.1 us
python-patched: ..................... 10.3 us +- 0.1 us

Mean +- std dev: [python] 11.9 us +- 0.1 us -> [python-patched] 10.3 us +- 0.1 us: 1.15x faster (-13%)
msg301823 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-10 18:01
Any more comments on the proposed implementation? 13-15% seem worth it to me.

@Victor, or are you saying "PyId, or no change at all"?
msg301824 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-09-10 18:16
Good work, Stefan! It's an impressive speedup of class creation.

It looks like you have not yet addressed Serhiy's comment r136815648">https://github.com/python/cpython/pull/3279/files#r136815648
msg301825 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-10 18:40
My comment was addressed.

As for using the _Py_IDENTIFIER API, I don't think it is related. The speedup was caused by avoiding few simple checks and function calls. The _Py_IDENTIFIER API is great, but it has a small overhead which is comparable to the small difference caused by Stefan's patch.
msg301827 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-10 18:42
No, that one was addressed. I think only Victor's comment is still open, that's why I asked back.
msg301830 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-09-10 19:18
Victor is currently travelling and recovering from jetlag. I'm sure he'll reply within a day or two.
msg301878 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-09-11 15:26
I ran a microbenchmark on the current PR 3279 using:

./python -m perf timeit --inherit=PYTHONPATH 'class C: pass'

Result:

haypo@selma$ ./python -m perf compare_to ref.json patch.json 
Mean +- std dev: [ref] 9.71 us +- 0.38 us -> [patch] 8.74 us +- 0.22 us: 1.11x faster (-10%)

I compiled Python using "./configure && make", no LTO nor PGO.
msg301887 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-09-11 16:53
Serhiy on the PR: "This is overgeneralization. Can tp_dict be not exact dict at all? I don't think this is possible. In many places concrete dict API is used with tp_dict. If you want to allow tp_dict be not exact dict, please open a separate issue for this."

Using the following code, A.__dict__ type is dict even if the metaclass creates a different type, probably because type_new() calls PyDict_Copy(orig_dict):
---
class mydict(dict):
    def __setitem__(self, name, value):
        if name == "__module__":
            value = "<mock module>"
        super().__setitem__(name, value)

class MetaClass(type):
    @classmethod
    def __prepare__(mcl, name, bases):
        return mydict()

class A(metaclass=MetaClass):
    pass

print(A.__module__)
---
msg302052 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-13 10:13
On my computer the speed up is 13% without LTO and PGO and around 20% with LTO and PGO. Building with LTO and PGO adds other 20%.
msg302104 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-13 18:42
> Since _PyDict_GetItem_KnownHash() may or may not set an exception, we have to check for a live exception after calling it, and that finds the old exception of the last attribute lookup and decides that its own lookup failed.

Hmm, with PyDict_GetItem() we don't falsely detect a lookup failing with a live exception set. Is it correct to call _PyType_Lookup() with an exception set? Perhaps we should save a current exception before calling find_name_in_mro() and restore it after. Or raise SystemError if an exception set. Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro(). I don't know what is more correct.
msg302108 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-13 19:49
> Is it correct to call _PyType_Lookup() with an exception set?

The general rule of thumb is that it's not safe to call any user code with a live exception set, and lookups can call into user code.

I quickly looked through all occurrences (there aren't that many actually), and they all seem be be called from a context where a live exception would be wrong.

> Perhaps we should save a current exception before calling find_name_in_mro() and restore it after.

I thought about that, too, but it feels wrong. This just shouldn't be necessary. It's ok to save away exceptions in exception related code, but general purpose code shouldn't have to assume that it gets called with a live exception.

> Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro().

That would catch the error at hand. The only concern is backwards compatibility. But it's easy to argue that it's the foreign code that is broken in this case (as show-cased by cElementTree) and clearly needs fixing if this occurs. An assert() seems like a reasonable way to catch this kind of bug, at least in a debug build.
msg302157 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-14 08:09
One more thing: the fact that the lookup does not propagate exceptions leaves some space for ambiguity. If, in a chain of MRO lookups, one would fail and a later one would succeed, is it correct to keep trying? What if the first failure actually failed to see an otherwise existing attribute in that class? We simply can't know because the lookup failed for *some* reason.

I think, the only way to really avoid that ambiguity would be to allow _PyType_Lookup() to raise exceptions, and to make it fail on the first error. I've created issue 31465 for this.
History
Date User Action Args
2017-09-14 08:09:32scodersetmessages: + msg302157
2017-09-13 19:49:12scodersetmessages: + msg302108
2017-09-13 18:42:20serhiy.storchakasetmessages: + msg302104
2017-09-13 10:13:57serhiy.storchakasetmessages: + msg302052
2017-09-11 16:53:21hayposetmessages: + msg301887
2017-09-11 15:26:42hayposetmessages: + msg301878
2017-09-10 19:18:30christian.heimessetassignee: haypo
messages: + msg301830
2017-09-10 18:42:03scodersetmessages: + msg301827
2017-09-10 18:40:42serhiy.storchakasetmessages: + msg301825
2017-09-10 18:16:19christian.heimessetnosy: + christian.heimes
messages: + msg301824
2017-09-10 18:01:12scodersetmessages: + msg301823
2017-09-06 07:18:03inada.naokisetmessages: + msg301452
2017-09-06 06:39:45scodersetmessages: + msg301446
2017-09-05 19:24:05scodersetmessages: + msg301364
2017-09-05 18:18:07hayposetmessages: + msg301350
2017-09-05 18:10:32scodersetmessages: + msg301347
2017-09-05 06:20:50inada.naokisetmessages: + msg301300
2017-09-04 18:47:26scodersetmessages: + msg301240
2017-09-04 17:45:18serhiy.storchakasetmessages: + msg301232
2017-09-04 17:18:35hayposetmessages: + msg301227
2017-09-04 13:00:24scodersetmessages: + msg301221
2017-09-04 12:53:37serhiy.storchakasetnosy: + rhettinger, pitrou, haypo, inada.naoki
messages: + msg301220
2017-09-04 12:09:51scodersetmessages: + msg301216
2017-09-04 11:54:55scodersetmessages: + msg301215
2017-09-04 11:49:13serhiy.storchakasetnosy: + serhiy.storchaka

messages: + msg301214
stage: patch review
2017-09-04 11:39:57scodersetpull_requests: + pull_request3322
2017-09-04 11:39:18scodercreate