Speed up _PyType_Lookup() for class creation #75517

scoder · 2017-09-04T11:39:18Z

BPO	31336
Nosy	@rhettinger, @pitrou, @scoder, @vstinner, @tiran, @methane, @serhiy-storchaka
PRs	bpo-31336: Speed up type creation, which is highly dominated by slow dict lookups. #3279

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/vstinner'
closed_at = <Date 2017-10-01.08:41:49.922>
created_at = <Date 2017-09-04.11:39:18.162>
labels = ['interpreter-core', '3.7', 'performance']
title = 'Speed up _PyType_Lookup() for class creation'
updated_at = <Date 2017-10-01.08:43:26.464>
user = 'https://github.com/scoder'

bugs.python.org fields:

activity = <Date 2017-10-01.08:43:26.464>
actor = 'scoder'
assignee = 'vstinner'
closed = True
closed_date = <Date 2017-10-01.08:41:49.922>
closer = 'serhiy.storchaka'
components = ['Interpreter Core']
creation = <Date 2017-09-04.11:39:18.162>
creator = 'scoder'
dependencies = []
files = []
hgrepos = []
issue_num = 31336
keywords = []
message_count = 31.0
messages = ['301213', '301214', '301215', '301216', '301220', '301221', '301227', '301232', '301240', '301300', '301347', '301350', '301364', '301446', '301452', '301823', '301824', '301825', '301827', '301830', '301878', '301887', '302052', '302104', '302108', '302157', '303010', '303011', '303453', '303454', '303455']
nosy_count = 7.0
nosy_names = ['rhettinger', 'pitrou', 'scoder', 'vstinner', 'christian.heimes', 'methane', 'serhiy.storchaka']
pr_nums = ['3279']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue31336'
versions = ['Python 3.7']

scoder · 2017-09-04T11:39:18Z

The method lookup fast path in _PyType_Lookup() does not apply during type creation, which is highly dominated by the performance of the dict lookups along the mro chain. Pre-calculating the name hash speeds up the creation of an empty class (i.e. "class Test: pass") by around 20%.

Will send a pull request shortly.

serhiy-storchaka · 2017-09-04T11:49:14Z

Could you please provide benchmarks that you used?

scoder · 2017-09-04T11:54:55Z

I literally just ran timeit on "class Test: pass", but I'll see if I can provide proper numbers.

scoder · 2017-09-04T12:09:51Z

Comparing against CPython master as of 122e88a

I rebuilt my CPython to get clean results, and that still gave me almost 15% overall speedup.

Original:
$ ./python -m timeit 'class Test: pass'
20000 loops, best of 5: 9.55 usec per loop

Patched:
$ ./python -m timeit 'class Test: pass'
50000 loops, best of 5: 8.27 usec per loop

I came across this when benchmarking the class creation in Cython, which seemed slow, and after a couple of tweaks, I ended up with 97% of the runtime inside of CPython, so I took a look over the fence.

According to callgrind, the original version spent over 125M instructions inside of _PyType_Lookup() for the timeit run, whereas the patched version only counts about 92M, that's about 25% less.

serhiy-storchaka · 2017-09-04T12:53:38Z

What names are looked up when you create an empty class?

I'm surprised that this change has measurable performance effect. Aren't name an interned string with precalculated cache?

scoder · 2017-09-04T13:00:24Z

It's the slot names in "slotdefs". See "update_one_slot()".

The time that is saved is mostly the overhead of calling PyDict_GetItem(). I actually tried PyDict_GetItemWithError() first, which is faster due to the lower error handling overhead, before I noticed that the real problem is the repeated lookups of the same name, which can be optimised further by pre-calculating the hash and calling the low-level lookup function.

vstinner · 2017-09-04T17:18:35Z

I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash().

serhiy-storchaka · 2017-09-04T17:45:19Z

After adding PyErr_Clear() the benefit of this optimization is smaller. Only 8% on my 64-bit computer. And no any effect on 32-bit. I wounder if it is worth to complicate the code for such small benefit. For non-empty classes the relative benefit is even smaller. Maybe there are other opportunities for optimization?

scoder · 2017-09-04T18:47:26Z

I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash().

Do you mean for the table of slot descriptions? I'm not sure that the effect would be comparable.

Maybe there are other opportunities for optimization?

I would guess so. According to callgrind, for the timeit run, the original implementation has the following call fanout for the type creation:

1755 calls to type_call()
3850 calls to type_new()
224000 calls to update_one_slot()
317000 calls to_PyType_Lookup()
566000 calls to PyDict_GetItem()

That's 147 calls to PyDict_GetItem() per type creation, just inside of _PyType_Lookup().

About 20% of the time in PyDict_GetItem() is spent in PyErr_Clear(), and 23/26% respectively in lookdict_unicode_nodummy() (349000 calls) and lookdict_unicode() (278000 calls).

There is probably some space for algorithmic improvements here, in order to reduce the overall number of calls. For example, I wonder if bypassing the method cache while building the type might help. The cache maintenance seems to amount for something like 30% of the time spent in _PyType_Lookup(). Or reversing the inspection order, i.e. running over the type attributes and looking up the slots, instead of running over the slots and looking up the attributes. Or using a faster set intersection algorithm for that purpose. Something like that.

OTOH, quick gains might be worth it, but since applications that find their bottleneck in type creation are probably rare, I doubt that it's worth putting weeks into this. Even the application startup time is unlikely to be domination by *type* creations, rather than *object* instantiations.

methane · 2017-09-05T06:20:50Z

There is special internal API in dictobject.c
_PyDict_LoadGlobal(PyDictObject *globals, PyDictObject *builtins, PyObject *key)

Maybe, we can have special API like that
_PyDict_LookupMro(PyObject *mro, PyObject *key);

BTW, method cache in _PyType_Lookup is not good for initializing type.
It always mishit. And all specialized method is cached even if it isn't used anymore.

Stefan, would you benchmark these ideas?

scoder · 2017-09-05T18:10:32Z

I updated the pull request with a split version of _PyType_Lookup() that bypasses the method cache during slot updates. I also ran the benchmarks with PGO enabled now to get realistic results. The overall gain is around 15%.

Original:
$ ./python -m timeit 'class Test: pass'
20000 loops, best of 5: 7.29 usec per loop

Patched:
$ ./python -m timeit 'class Test: pass'
50000 loops, best of 5: 6.15 usec per loop

Patched with non-trivial bases:
$ ./python -m timeit 'class Test(object): pass'
50000 loops, best of 5: 6.05 usec per loop
$ ./python -m timeit 'class Test(type): pass'
50000 loops, best of 5: 6.08 usec per loop
$ ./python -m timeit 'class Test(int): pass'
50000 loops, best of 5: 9.08 usec per loop

I do not consider the optimisations a code degredation.

There is one semantic change: the new function _PyType_LookupUncached() returns on the first error and might set an exception. I considered that better behaviour than the old function, but it means that if one base class name lookup fails and the next one previously succeeded, it will no longer succeed now. I don't have a strong feeling about this and would change it back if compatibility is considered more valuable. It generally feels wrong to have errors pass silently here, however unlikely they may be in practice.

vstinner · 2017-09-05T18:18:08Z

Please use perf.timeit not timeit for microbenchmarks: https://perf.readthedocs.io/en/latest/

scoder · 2017-09-05T19:24:05Z

Since I'm getting highly reproducible results on re-runs, I tend to trust these numbers.

scoder · 2017-09-06T06:39:45Z

BTW, it seems that Yury's dict copy optimisation would also help here. When I use a benchmark scenario with a simple non-empty method/attribute dict (from Cython this time), almost 10% of the creation time is spent copying that dict, which should essentially just be a memcopy() since it doesn't need any resizing at that point.

methane · 2017-09-06T07:18:04Z

Confirmed:
$ ./python-patched -m perf timeit --compare-to pwd/python -- 'class C: pass'
python: ..................... 11.9 us +- 0.1 us
python-patched: ..................... 10.3 us +- 0.1 us

Mean +- std dev: [python] 11.9 us +- 0.1 us -> [python-patched] 10.3 us +- 0.1 us: 1.15x faster (-13%)

scoder · 2017-09-10T18:01:13Z

Any more comments on the proposed implementation? 13-15% seem worth it to me.

@victor, or are you saying "PyId, or no change at all"?

tiran · 2017-09-10T18:16:19Z

Good work, Stefan! It's an impressive speedup of class creation.

It looks like you have not yet addressed Serhiy's comment https://github.com/python/cpython/pull/3279/files#r136815648

serhiy-storchaka · 2017-09-10T18:40:42Z

My comment was addressed.

As for using the _Py_IDENTIFIER API, I don't think it is related. The speedup was caused by avoiding few simple checks and function calls. The _Py_IDENTIFIER API is great, but it has a small overhead which is comparable to the small difference caused by Stefan's patch.

scoder · 2017-09-10T18:42:04Z

No, that one was addressed. I think only Victor's comment is still open, that's why I asked back.

tiran · 2017-09-10T19:18:31Z

Victor is currently travelling and recovering from jetlag. I'm sure he'll reply within a day or two.

vstinner · 2017-09-11T15:26:42Z

I ran a microbenchmark on the current PR 3279 using:

./python -m perf timeit --inherit=PYTHONPATH 'class C: pass'

Result:

haypo@selma$ ./python -m perf compare_to ref.json patch.json
Mean +- std dev: [ref] 9.71 us +- 0.38 us -> [patch] 8.74 us +- 0.22 us: 1.11x faster (-10%)

I compiled Python using "./configure && make", no LTO nor PGO.

vstinner · 2017-09-11T16:53:22Z

Serhiy on the PR: "This is overgeneralization. Can tp_dict be not exact dict at all? I don't think this is possible. In many places concrete dict API is used with tp_dict. If you want to allow tp_dict be not exact dict, please open a separate issue for this."

Using the following code, A.__dict__ type is dict even if the metaclass creates a different type, probably because type_new() calls PyDict_Copy(orig_dict):
---

class mydict(dict):
    def __setitem__(self, name, value):
        if name == "__module__":
            value = "<mock module>"
        super().__setitem__(name, value)

class MetaClass(type):
    @classmethod
    def __prepare__(mcl, name, bases):
        return mydict()

class A(metaclass=MetaClass):
    pass

print(A.__module__)

serhiy-storchaka · 2017-09-13T10:13:57Z

On my computer the speed up is 13% without LTO and PGO and around 20% with LTO and PGO. Building with LTO and PGO adds other 20%.

serhiy-storchaka · 2017-09-13T18:42:21Z

Since _PyDict_GetItem_KnownHash() may or may not set an exception, we have to check for a live exception after calling it, and that finds the old exception of the last attribute lookup and decides that its own lookup failed.

Hmm, with PyDict_GetItem() we don't falsely detect a lookup failing with a live exception set. Is it correct to call _PyType_Lookup() with an exception set? Perhaps we should save a current exception before calling find_name_in_mro() and restore it after. Or raise SystemError if an exception set. Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro(). I don't know what is more correct.

scoder · 2017-09-13T19:49:13Z

Is it correct to call _PyType_Lookup() with an exception set?

The general rule of thumb is that it's not safe to call any user code with a live exception set, and lookups can call into user code.

I quickly looked through all occurrences (there aren't that many actually), and they all seem be be called from a context where a live exception would be wrong.

Perhaps we should save a current exception before calling find_name_in_mro() and restore it after.

I thought about that, too, but it feels wrong. This just shouldn't be necessary. It's ok to save away exceptions in exception related code, but general purpose code shouldn't have to assume that it gets called with a live exception.

Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro().

That would catch the error at hand. The only concern is backwards compatibility. But it's easy to argue that it's the foreign code that is broken in this case (as show-cased by cElementTree) and clearly needs fixing if this occurs. An assert() seems like a reasonable way to catch this kind of bug, at least in a debug build.

scoder · 2017-09-14T08:09:33Z

One more thing: the fact that the lookup does not propagate exceptions leaves some space for ambiguity. If, in a chain of MRO lookups, one would fail and a later one would succeed, is it correct to keep trying? What if the first failure actually failed to see an otherwise existing attribute in that class? We simply can't know because the lookup failed for *some* reason.

I think, the only way to really avoid that ambiguity would be to allow _PyType_Lookup() to raise exceptions, and to make it fail on the first error. I've created bpo-31465 for this.

scoder · 2017-09-26T06:00:16Z

Still ready for merging :)

serhiy-storchaka · 2017-09-26T06:10:12Z

I'm going to merge this over the next 24 hours if there are no comments from other core developers.

serhiy-storchaka · 2017-10-01T08:37:53Z

New changeset 2102c78 by Serhiy Storchaka (scoder) in branch 'master':
bpo-31336: Speed up type creation. (bpo-3279)
2102c78

serhiy-storchaka · 2017-10-01T08:41:40Z

Thank you for your contribution Stefan! I had doubts about your initial (much simpler) changes, but changed my mind.

scoder · 2017-10-01T08:43:26Z

Thanks for the reviews, and thank you for merging it, Serhiy.

scoder added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Sep 4, 2017

tiran assigned vstinner Sep 10, 2017

serhiy-storchaka closed this as completed Oct 1, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up _PyType_Lookup() for class creation #75517

Speed up _PyType_Lookup() for class creation #75517

scoder commented Sep 4, 2017

scoder commented Sep 4, 2017

serhiy-storchaka commented Sep 4, 2017

scoder commented Sep 4, 2017

scoder commented Sep 4, 2017

serhiy-storchaka commented Sep 4, 2017

scoder commented Sep 4, 2017

vstinner commented Sep 4, 2017

serhiy-storchaka commented Sep 4, 2017

scoder commented Sep 4, 2017

methane commented Sep 5, 2017

scoder commented Sep 5, 2017

vstinner commented Sep 5, 2017

scoder commented Sep 5, 2017

scoder commented Sep 6, 2017

methane commented Sep 6, 2017

scoder commented Sep 10, 2017

tiran commented Sep 10, 2017

serhiy-storchaka commented Sep 10, 2017

scoder commented Sep 10, 2017

tiran commented Sep 10, 2017

vstinner commented Sep 11, 2017

vstinner commented Sep 11, 2017

serhiy-storchaka commented Sep 13, 2017

serhiy-storchaka commented Sep 13, 2017

scoder commented Sep 13, 2017

scoder commented Sep 14, 2017

scoder commented Sep 26, 2017

serhiy-storchaka commented Sep 26, 2017

serhiy-storchaka commented Oct 1, 2017

serhiy-storchaka commented Oct 1, 2017

scoder commented Oct 1, 2017

Speed up _PyType_Lookup() for class creation #75517

Speed up _PyType_Lookup() for class creation #75517

Comments

scoder commented Sep 4, 2017

scoder commented Sep 4, 2017

serhiy-storchaka commented Sep 4, 2017

scoder commented Sep 4, 2017

scoder commented Sep 4, 2017

serhiy-storchaka commented Sep 4, 2017

scoder commented Sep 4, 2017

vstinner commented Sep 4, 2017

serhiy-storchaka commented Sep 4, 2017

scoder commented Sep 4, 2017

methane commented Sep 5, 2017

scoder commented Sep 5, 2017

vstinner commented Sep 5, 2017

scoder commented Sep 5, 2017

scoder commented Sep 6, 2017

methane commented Sep 6, 2017

scoder commented Sep 10, 2017

tiran commented Sep 10, 2017

serhiy-storchaka commented Sep 10, 2017

scoder commented Sep 10, 2017

tiran commented Sep 10, 2017

vstinner commented Sep 11, 2017

vstinner commented Sep 11, 2017

serhiy-storchaka commented Sep 13, 2017

serhiy-storchaka commented Sep 13, 2017

scoder commented Sep 13, 2017

scoder commented Sep 14, 2017

scoder commented Sep 26, 2017

serhiy-storchaka commented Sep 26, 2017

serhiy-storchaka commented Oct 1, 2017

serhiy-storchaka commented Oct 1, 2017

scoder commented Oct 1, 2017