Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up _PyType_Lookup() for class creation #75517

Closed
scoder opened this issue Sep 4, 2017 · 31 comments
Closed

Speed up _PyType_Lookup() for class creation #75517

scoder opened this issue Sep 4, 2017 · 31 comments
Assignees
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@scoder
Copy link
Contributor

scoder commented Sep 4, 2017

BPO 31336
Nosy @rhettinger, @pitrou, @scoder, @vstinner, @tiran, @methane, @serhiy-storchaka
PRs
  • bpo-31336: Speed up type creation, which is highly dominated by slow dict lookups. #3279
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2017-10-01.08:41:49.922>
    created_at = <Date 2017-09-04.11:39:18.162>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'Speed up _PyType_Lookup() for class creation'
    updated_at = <Date 2017-10-01.08:43:26.464>
    user = 'https://github.com/scoder'

    bugs.python.org fields:

    activity = <Date 2017-10-01.08:43:26.464>
    actor = 'scoder'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2017-10-01.08:41:49.922>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2017-09-04.11:39:18.162>
    creator = 'scoder'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 31336
    keywords = []
    message_count = 31.0
    messages = ['301213', '301214', '301215', '301216', '301220', '301221', '301227', '301232', '301240', '301300', '301347', '301350', '301364', '301446', '301452', '301823', '301824', '301825', '301827', '301830', '301878', '301887', '302052', '302104', '302108', '302157', '303010', '303011', '303453', '303454', '303455']
    nosy_count = 7.0
    nosy_names = ['rhettinger', 'pitrou', 'scoder', 'vstinner', 'christian.heimes', 'methane', 'serhiy.storchaka']
    pr_nums = ['3279']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue31336'
    versions = ['Python 3.7']

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 4, 2017

    The method lookup fast path in _PyType_Lookup() does not apply during type creation, which is highly dominated by the performance of the dict lookups along the mro chain. Pre-calculating the name hash speeds up the creation of an empty class (i.e. "class Test: pass") by around 20%.

    Will send a pull request shortly.

    @scoder scoder added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Sep 4, 2017
    @serhiy-storchaka
    Copy link
    Member

    Could you please provide benchmarks that you used?

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 4, 2017

    I literally just ran timeit on "class Test: pass", but I'll see if I can provide proper numbers.

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 4, 2017

    Comparing against CPython master as of 122e88a

    I rebuilt my CPython to get clean results, and that still gave me almost 15% overall speedup.

    Original:
    $ ./python -m timeit 'class Test: pass'
    20000 loops, best of 5: 9.55 usec per loop

    Patched:
    $ ./python -m timeit 'class Test: pass'
    50000 loops, best of 5: 8.27 usec per loop

    I came across this when benchmarking the class creation in Cython, which seemed slow, and after a couple of tweaks, I ended up with 97% of the runtime inside of CPython, so I took a look over the fence.

    According to callgrind, the original version spent over 125M instructions inside of _PyType_Lookup() for the timeit run, whereas the patched version only counts about 92M, that's about 25% less.

    @serhiy-storchaka
    Copy link
    Member

    What names are looked up when you create an empty class?

    I'm surprised that this change has measurable performance effect. Aren't name an interned string with precalculated cache?

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 4, 2017

    It's the slot names in "slotdefs". See "update_one_slot()".

    The time that is saved is mostly the overhead of calling PyDict_GetItem(). I actually tried PyDict_GetItemWithError() first, which is faster due to the lower error handling overhead, before I noticed that the real problem is the repeated lookups of the same name, which can be optimised further by pre-calculating the hash and calling the low-level lookup function.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 4, 2017

    I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash().

    @serhiy-storchaka
    Copy link
    Member

    After adding PyErr_Clear() the benefit of this optimization is smaller. Only 8% on my 64-bit computer. And no any effect on 32-bit. I wounder if it is worth to complicate the code for such small benefit. For non-empty classes the relative benefit is even smaller. Maybe there are other opportunities for optimization?

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 4, 2017

    I would prefer to use the _Py_IDENTIFIER API rather than using _PyDict_GetItem_KnownHash().

    Do you mean for the table of slot descriptions? I'm not sure that the effect would be comparable.

    Maybe there are other opportunities for optimization?

    I would guess so. According to callgrind, for the timeit run, the original implementation has the following call fanout for the type creation:

    • 1755 calls to type_call()
    • 3850 calls to type_new()
    • 224000 calls to update_one_slot()
    • 317000 calls to_PyType_Lookup()
    • 566000 calls to PyDict_GetItem()

    That's 147 calls to PyDict_GetItem() per type creation, just inside of _PyType_Lookup().

    About 20% of the time in PyDict_GetItem() is spent in PyErr_Clear(), and 23/26% respectively in lookdict_unicode_nodummy() (349000 calls) and lookdict_unicode() (278000 calls).

    There is probably some space for algorithmic improvements here, in order to reduce the overall number of calls. For example, I wonder if bypassing the method cache while building the type might help. The cache maintenance seems to amount for something like 30% of the time spent in _PyType_Lookup(). Or reversing the inspection order, i.e. running over the type attributes and looking up the slots, instead of running over the slots and looking up the attributes. Or using a faster set intersection algorithm for that purpose. Something like that.

    OTOH, quick gains might be worth it, but since applications that find their bottleneck in type creation are probably rare, I doubt that it's worth putting weeks into this. Even the application startup time is unlikely to be domination by *type* creations, rather than *object* instantiations.

    @methane
    Copy link
    Member

    methane commented Sep 5, 2017

    There is special internal API in dictobject.c
    _PyDict_LoadGlobal(PyDictObject *globals, PyDictObject *builtins, PyObject *key)

    Maybe, we can have special API like that
    _PyDict_LookupMro(PyObject *mro, PyObject *key);

    BTW, method cache in _PyType_Lookup is not good for initializing type.
    It always mishit. And all specialized method is cached even if it isn't used anymore.

    Stefan, would you benchmark these ideas?

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 5, 2017

    I updated the pull request with a split version of _PyType_Lookup() that bypasses the method cache during slot updates. I also ran the benchmarks with PGO enabled now to get realistic results. The overall gain is around 15%.

    Original:
    $ ./python -m timeit 'class Test: pass'
    20000 loops, best of 5: 7.29 usec per loop

    Patched:
    $ ./python -m timeit 'class Test: pass'
    50000 loops, best of 5: 6.15 usec per loop

    Patched with non-trivial bases:
    $ ./python -m timeit 'class Test(object): pass'
    50000 loops, best of 5: 6.05 usec per loop
    $ ./python -m timeit 'class Test(type): pass'
    50000 loops, best of 5: 6.08 usec per loop
    $ ./python -m timeit 'class Test(int): pass'
    50000 loops, best of 5: 9.08 usec per loop

    I do not consider the optimisations a code degredation.

    There is one semantic change: the new function _PyType_LookupUncached() returns on the first error and might set an exception. I considered that better behaviour than the old function, but it means that if one base class name lookup fails and the next one previously succeeded, it will no longer succeed now. I don't have a strong feeling about this and would change it back if compatibility is considered more valuable. It generally feels wrong to have errors pass silently here, however unlikely they may be in practice.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 5, 2017

    Please use perf.timeit not timeit for microbenchmarks: https://perf.readthedocs.io/en/latest/

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 5, 2017

    Since I'm getting highly reproducible results on re-runs, I tend to trust these numbers.

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 6, 2017

    BTW, it seems that Yury's dict copy optimisation would also help here. When I use a benchmark scenario with a simple non-empty method/attribute dict (from Cython this time), almost 10% of the creation time is spent copying that dict, which should essentially just be a memcopy() since it doesn't need any resizing at that point.

    @methane
    Copy link
    Member

    methane commented Sep 6, 2017

    Confirmed:
    $ ./python-patched -m perf timeit --compare-to pwd/python -- 'class C: pass'
    python: ..................... 11.9 us +- 0.1 us
    python-patched: ..................... 10.3 us +- 0.1 us

    Mean +- std dev: [python] 11.9 us +- 0.1 us -> [python-patched] 10.3 us +- 0.1 us: 1.15x faster (-13%)

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 10, 2017

    Any more comments on the proposed implementation? 13-15% seem worth it to me.

    @victor, or are you saying "PyId, or no change at all"?

    @tiran
    Copy link
    Member

    tiran commented Sep 10, 2017

    Good work, Stefan! It's an impressive speedup of class creation.

    It looks like you have not yet addressed Serhiy's comment https://github.com/python/cpython/pull/3279/files#r136815648

    @serhiy-storchaka
    Copy link
    Member

    My comment was addressed.

    As for using the _Py_IDENTIFIER API, I don't think it is related. The speedup was caused by avoiding few simple checks and function calls. The _Py_IDENTIFIER API is great, but it has a small overhead which is comparable to the small difference caused by Stefan's patch.

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 10, 2017

    No, that one was addressed. I think only Victor's comment is still open, that's why I asked back.

    @tiran
    Copy link
    Member

    tiran commented Sep 10, 2017

    Victor is currently travelling and recovering from jetlag. I'm sure he'll reply within a day or two.

    @vstinner
    Copy link
    Member

    I ran a microbenchmark on the current PR 3279 using:

    ./python -m perf timeit --inherit=PYTHONPATH 'class C: pass'

    Result:

    haypo@selma$ ./python -m perf compare_to ref.json patch.json
    Mean +- std dev: [ref] 9.71 us +- 0.38 us -> [patch] 8.74 us +- 0.22 us: 1.11x faster (-10%)

    I compiled Python using "./configure && make", no LTO nor PGO.

    @vstinner
    Copy link
    Member

    Serhiy on the PR: "This is overgeneralization. Can tp_dict be not exact dict at all? I don't think this is possible. In many places concrete dict API is used with tp_dict. If you want to allow tp_dict be not exact dict, please open a separate issue for this."

    Using the following code, A.__dict__ type is dict even if the metaclass creates a different type, probably because type_new() calls PyDict_Copy(orig_dict):
    ---

    class mydict(dict):
        def __setitem__(self, name, value):
            if name == "__module__":
                value = "<mock module>"
            super().__setitem__(name, value)
    
    class MetaClass(type):
        @classmethod
        def __prepare__(mcl, name, bases):
            return mydict()
    
    class A(metaclass=MetaClass):
        pass
    
    print(A.__module__)

    @serhiy-storchaka
    Copy link
    Member

    On my computer the speed up is 13% without LTO and PGO and around 20% with LTO and PGO. Building with LTO and PGO adds other 20%.

    @serhiy-storchaka
    Copy link
    Member

    Since _PyDict_GetItem_KnownHash() may or may not set an exception, we have to check for a live exception after calling it, and that finds the old exception of the last attribute lookup and decides that its own lookup failed.

    Hmm, with PyDict_GetItem() we don't falsely detect a lookup failing with a live exception set. Is it correct to call _PyType_Lookup() with an exception set? Perhaps we should save a current exception before calling find_name_in_mro() and restore it after. Or raise SystemError if an exception set. Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro(). I don't know what is more correct.

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 13, 2017

    Is it correct to call _PyType_Lookup() with an exception set?

    The general rule of thumb is that it's not safe to call any user code with a live exception set, and lookups can call into user code.

    I quickly looked through all occurrences (there aren't that many actually), and they all seem be be called from a context where a live exception would be wrong.

    Perhaps we should save a current exception before calling find_name_in_mro() and restore it after.

    I thought about that, too, but it feels wrong. This just shouldn't be necessary. It's ok to save away exceptions in exception related code, but general purpose code shouldn't have to assume that it gets called with a live exception.

    Or just add assert(!PyErr_Occurred()) at the begin of find_name_in_mro().

    That would catch the error at hand. The only concern is backwards compatibility. But it's easy to argue that it's the foreign code that is broken in this case (as show-cased by cElementTree) and clearly needs fixing if this occurs. An assert() seems like a reasonable way to catch this kind of bug, at least in a debug build.

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 14, 2017

    One more thing: the fact that the lookup does not propagate exceptions leaves some space for ambiguity. If, in a chain of MRO lookups, one would fail and a later one would succeed, is it correct to keep trying? What if the first failure actually failed to see an otherwise existing attribute in that class? We simply can't know because the lookup failed for *some* reason.

    I think, the only way to really avoid that ambiguity would be to allow _PyType_Lookup() to raise exceptions, and to make it fail on the first error. I've created bpo-31465 for this.

    @scoder
    Copy link
    Contributor Author

    scoder commented Sep 26, 2017

    Still ready for merging :)

    @serhiy-storchaka
    Copy link
    Member

    I'm going to merge this over the next 24 hours if there are no comments from other core developers.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 2102c78 by Serhiy Storchaka (scoder) in branch 'master':
    bpo-31336: Speed up type creation. (bpo-3279)
    2102c78

    @serhiy-storchaka
    Copy link
    Member

    Thank you for your contribution Stefan! I had doubts about your initial (much simpler) changes, but changed my mind.

    @scoder
    Copy link
    Contributor Author

    scoder commented Oct 1, 2017

    Thanks for the reviews, and thank you for merging it, Serhiy.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants