New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up _PyType_Lookup() for class creation #75517
Comments
The method lookup fast path in _PyType_Lookup() does not apply during type creation, which is highly dominated by the performance of the dict lookups along the mro chain. Pre-calculating the name hash speeds up the creation of an empty class (i.e. "class Test: pass") by around 20%. Will send a pull request shortly. |
Could you please provide benchmarks that you used? |
I literally just ran timeit on "class Test: pass", but I'll see if I can provide proper numbers. |
Comparing against CPython master as of 122e88a I rebuilt my CPython to get clean results, and that still gave me almost 15% overall speedup. Original: Patched: I came across this when benchmarking the class creation in Cython, which seemed slow, and after a couple of tweaks, I ended up with 97% of the runtime inside of CPython, so I took a look over the fence. According to callgrind, the original version spent over 125M instructions inside of _PyType_Lookup() for the timeit run, whereas the patched version only counts about 92M, that's about 25% less. |
What names are looked up when you create an empty class? I'm surprised that this change has measurable performance effect. Aren't name an interned string with precalculated cache? |
It's the slot names in "slotdefs". See "update_one_slot()". The time that is saved is mostly the overhead of calling PyDict_GetItem(). I actually tried PyDict_GetItemWithError() first, which is faster due to the lower error handling overhead, before I noticed that the real problem is the repeated lookups of the same name, which can be optimised further by pre-calculating the hash and calling the low-level lookup function. |
I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash(). |
After adding PyErr_Clear() the benefit of this optimization is smaller. Only 8% on my 64-bit computer. And no any effect on 32-bit. I wounder if it is worth to complicate the code for such small benefit. For non-empty classes the relative benefit is even smaller. Maybe there are other opportunities for optimization? |
Do you mean for the table of slot descriptions? I'm not sure that the effect would be comparable.
I would guess so. According to callgrind, for the timeit run, the original implementation has the following call fanout for the type creation:
That's 147 calls to PyDict_GetItem() per type creation, just inside of _PyType_Lookup(). About 20% of the time in PyDict_GetItem() is spent in PyErr_Clear(), and 23/26% respectively in lookdict_unicode_nodummy() (349000 calls) and lookdict_unicode() (278000 calls). There is probably some space for algorithmic improvements here, in order to reduce the overall number of calls. For example, I wonder if bypassing the method cache while building the type might help. The cache maintenance seems to amount for something like 30% of the time spent in _PyType_Lookup(). Or reversing the inspection order, i.e. running over the type attributes and looking up the slots, instead of running over the slots and looking up the attributes. Or using a faster set intersection algorithm for that purpose. Something like that. OTOH, quick gains might be worth it, but since applications that find their bottleneck in type creation are probably rare, I doubt that it's worth putting weeks into this. Even the application startup time is unlikely to be domination by *type* creations, rather than *object* instantiations. |
There is special internal API in dictobject.c Maybe, we can have special API like that BTW, method cache in _PyType_Lookup is not good for initializing type. Stefan, would you benchmark these ideas? |
I updated the pull request with a split version of _PyType_Lookup() that bypasses the method cache during slot updates. I also ran the benchmarks with PGO enabled now to get realistic results. The overall gain is around 15%. Original: Patched: Patched with non-trivial bases: I do not consider the optimisations a code degredation. There is one semantic change: the new function _PyType_LookupUncached() returns on the first error and might set an exception. I considered that better behaviour than the old function, but it means that if one base class name lookup fails and the next one previously succeeded, it will no longer succeed now. I don't have a strong feeling about this and would change it back if compatibility is considered more valuable. It generally feels wrong to have errors pass silently here, however unlikely they may be in practice. |
Please use perf.timeit not timeit for microbenchmarks: https://perf.readthedocs.io/en/latest/ |
Since I'm getting highly reproducible results on re-runs, I tend to trust these numbers. |
BTW, it seems that Yury's dict copy optimisation would also help here. When I use a benchmark scenario with a simple non-empty method/attribute dict (from Cython this time), almost 10% of the creation time is spent copying that dict, which should essentially just be a memcopy() since it doesn't need any resizing at that point. |
Confirmed: Mean +- std dev: [python] 11.9 us +- 0.1 us -> [python-patched] 10.3 us +- 0.1 us: 1.15x faster (-13%) |
Any more comments on the proposed implementation? 13-15% seem worth it to me. @victor, or are you saying "PyId, or no change at all"? |
Good work, Stefan! It's an impressive speedup of class creation. It looks like you have not yet addressed Serhiy's comment https://github.com/python/cpython/pull/3279/files#r136815648 |
My comment was addressed. As for using the _Py_IDENTIFIER API, I don't think it is related. The speedup was caused by avoiding few simple checks and function calls. The _Py_IDENTIFIER API is great, but it has a small overhead which is comparable to the small difference caused by Stefan's patch. |
No, that one was addressed. I think only Victor's comment is still open, that's why I asked back. |
Victor is currently travelling and recovering from jetlag. I'm sure he'll reply within a day or two. |
I ran a microbenchmark on the current PR 3279 using: ./python -m perf timeit --inherit=PYTHONPATH 'class C: pass' Result: haypo@selma$ ./python -m perf compare_to ref.json patch.json I compiled Python using "./configure && make", no LTO nor PGO. |
Serhiy on the PR: "This is overgeneralization. Can tp_dict be not exact dict at all? I don't think this is possible. In many places concrete dict API is used with tp_dict. If you want to allow tp_dict be not exact dict, please open a separate issue for this." Using the following code, A.__dict__ type is dict even if the metaclass creates a different type, probably because type_new() calls PyDict_Copy(orig_dict): class mydict(dict):
def __setitem__(self, name, value):
if name == "__module__":
value = "<mock module>"
super().__setitem__(name, value)
class MetaClass(type):
@classmethod
def __prepare__(mcl, name, bases):
return mydict()
class A(metaclass=MetaClass):
pass
print(A.__module__) |
On my computer the speed up is 13% without LTO and PGO and around 20% with LTO and PGO. Building with LTO and PGO adds other 20%. |
Hmm, with PyDict_GetItem() we don't falsely detect a lookup failing with a live exception set. Is it correct to call _PyType_Lookup() with an exception set? Perhaps we should save a current exception before calling find_name_in_mro() and restore it after. Or raise SystemError if an exception set. Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro(). I don't know what is more correct. |
The general rule of thumb is that it's not safe to call any user code with a live exception set, and lookups can call into user code. I quickly looked through all occurrences (there aren't that many actually), and they all seem be be called from a context where a live exception would be wrong.
I thought about that, too, but it feels wrong. This just shouldn't be necessary. It's ok to save away exceptions in exception related code, but general purpose code shouldn't have to assume that it gets called with a live exception.
That would catch the error at hand. The only concern is backwards compatibility. But it's easy to argue that it's the foreign code that is broken in this case (as show-cased by cElementTree) and clearly needs fixing if this occurs. An assert() seems like a reasonable way to catch this kind of bug, at least in a debug build. |
One more thing: the fact that the lookup does not propagate exceptions leaves some space for ambiguity. If, in a chain of MRO lookups, one would fail and a later one would succeed, is it correct to keep trying? What if the first failure actually failed to see an otherwise existing attribute in that class? We simply can't know because the lookup failed for *some* reason. I think, the only way to really avoid that ambiguity would be to allow _PyType_Lookup() to raise exceptions, and to make it fail on the first error. I've created bpo-31465 for this. |
Still ready for merging :) |
I'm going to merge this over the next 24 hours if there are no comments from other core developers. |
Thank you for your contribution Stefan! I had doubts about your initial (much simpler) changes, but changed my mind. |
Thanks for the reviews, and thank you for merging it, Serhiy. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: