Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO list when PEP 573 "Module State Access from C Extension Methods" will be implemented #84318

Closed
vstinner opened this issue Apr 1, 2020 · 28 comments
Labels
3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Apr 1, 2020

BPO 40137
Nosy @rhettinger, @vstinner, @phsilva, @encukou, @serhiy-storchaka, @corona10, @pablogsal, @miss-islington, @shihai1991, @erlend-aasland
PRs
  • bpo-40077: Convert _jsonmodule to use PyType_FromSpec. #19177
  • [WIP] bpo-40137: Use pep573 in functools extension module. #20055
  • bpo-40137: Convert _functools module to use PyType_FromModuleAndSpec. #23405
  • bpo-40137: Fix refleak in _functools_exec() #24006
  • bpo-40137: Micro-optimize _PyType_GetModuleByDef() #25393
  • bpo-40137: Move state lookups out of the critical path #25492
  • bpo-40137: _PyType_GetModuleByDef() doesn't check tp_flags #25504
  • bpo-40137: Optimize _PyType_GetModuleByDef() loop #25505
  • bpo-40137: Add pycore_moduleobject.h internal header #25507
  • Files
  • patch.diff: Move get_functools_state_by_type() to lru_cache_new()
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-04-21.23:28:53.908>
    created_at = <Date 2020-04-01.15:08:23.822>
    labels = ['type-feature', 'library', '3.10']
    title = 'TODO list when PEP 573 "Module State Access from C Extension Methods" will be implemented'
    updated_at = <Date 2021-05-06.00:04:16.932>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2021-05-06.00:04:16.932>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-04-21.23:28:53.908>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2020-04-01.15:08:23.822>
    creator = 'vstinner'
    dependencies = []
    files = ['49958']
    hgrepos = []
    issue_num = 40137
    keywords = ['patch']
    message_count = 28.0
    messages = ['365482', '365486', '368852', '380668', '380669', '383993', '384036', '384037', '384043', '384048', '384049', '384056', '390756', '390758', '390759', '390768', '390769', '391004', '391005', '391045', '391547', '391549', '391550', '391553', '391557', '391560', '391561', '393056']
    nosy_count = 10.0
    nosy_names = ['rhettinger', 'vstinner', 'phsilva', 'petr.viktorin', 'serhiy.storchaka', 'corona10', 'pablogsal', 'miss-islington', 'shihai1991', 'erlendaasland']
    pr_nums = ['19177', '20055', '23405', '24006', '25393', '25492', '25504', '25505', '25507']
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue40137'
    versions = ['Python 3.10']

    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 1, 2020

    In bpo-1635741, many C extension modules are converted to PEP-489 multiphase initialization and/or modified to get a module state. Problem: the module state cannot be accessed in some functions, and so some of these changes had to workaround the fact that PEP-573 "Module State Access from C Extension Methods" is not implemented yet.

    This issue tracks C extension modules which should be modified once PEP-573 will be implemented:

    • _functools: Py_CLEAR(kwd_mark); is commented in _functools_free()
      See commit eacc074

    • _abc: abc_invalidation_counter is shared by all module instances. abc_data_new() requires access to abc_invalidation_counter but it doesn't have access to the module state. abc_invalidation_counter should be moved to the module state.

    @vstinner vstinner added 3.9 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 1, 2020
    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 1, 2020

    Another minor issue: "assert(PyScanner_Check(self));" and "assert(PyEncoder_Check(self));" were removed by commit 33f15a1 when _json module got a module state.

    scanner_traverse(), scanner_clear(), encoder_traverse() and encoder_clear() cannot get the module state of a module using PEP-489 multiphase initialization.

    Similar issue in scanner_call() and encoder_call().

    I'm not sure that the PEP-573 provide a solution for this exact issue, since the intent of the assertion is to check the type of the argument. The PEP-573 gives access to the module state through types. But here we are not sure that the object type is the expected type... :-)

    Again, it's a minor issue, it can be ignored. That's why I accepted to merge the commit. Technically, I don't see how we could get the wrong type in practice.

    @vstinner
    Copy link
    Member Author

    I consider that all things that could be done have already been done, so I close the issue. Thanks for Hai and Dong-hee who helped ;-)

    • _functools: Py_CLEAR(kwd_mark); is commented in _functools_free()

    Sadly, PEP-573 implementation is not complete enough to use it in _functools: see PR 20055.

    • _abc: abc_invalidation_counter

    Fixed in bpo-40566 by:

    commit 77c6146
    Author: Dong-hee Na <donghee.na92@gmail.com>
    Date: Sat May 9 17:31:40 2020 +0900

    bpo-40566: Apply PEP-573 to abc module (GH-20005)
    

    scanner_traverse(), scanner_clear(), encoder_traverse() and encoder_clear()

    tp_clear slot cannot get the defining type (PEP-573 may be extended later to allow that).

    Using PyType_GetModule(Py_TYPE(self)) to access types stored in the module state and then compare Py_TYPE(self) to these types... looks badly broken :-) If we get the wrong type, PyType_GetModule(Py_TYPE(self)) will return the wrong module and so we cannot get the json state from the wrong module... It's a chicken-and-egg issue :-)

    I think that it's not worth it to attempt to add back these assertions. It's very unlikely to get the wrong types in practice.

    @encukou
    Copy link
    Member

    encukou commented Nov 10, 2020

    It should be possible to get them back using _PyType_GetModuleByDef.

    @vstinner
    Copy link
    Member Author

    Oh nice, in this case, I reopen my issue :-)

    @vstinner vstinner reopened this Nov 10, 2020
    @vstinner vstinner reopened this Nov 10, 2020
    @shihai1991 shihai1991 added 3.10 only security fixes and removed 3.9 only security fixes labels Nov 10, 2020
    @miss-islington
    Copy link
    Contributor

    New changeset dd39123 by Hai Shi in branch 'master':
    bpo-40137: Convert _functools module to use PyType_FromModuleAndSpec. (GH-23405)
    dd39123

    @vstinner
    Copy link
    Member Author

    The following commit introduces a reference leak:

    commit dd39123
    Author: Hai Shi <shihai1992@gmail.com>
    Date: Tue Dec 29 20:45:07 2020 +0800

    bpo-40137: Convert _functools module to use PyType_FromModuleAndSpec. (GH-23405)
    

    Lib/test/test_functools.py | 3 +-
    .../2020-11-19-23-12-57.bpo-40137.bihl9O.rst | 1 +
    Modules/_functoolsmodule.c | 475 +++++++++++----------
    3 files changed, 255 insertions(+), 224 deletions(-)
    create mode 100644 Misc/NEWS.d/next/Core and Builtins/2020-11-19-23-12-57.bpo-40137.bihl9O.rst

    Leaks:

    test_interpreters leaked [2038, 2034, 2038] references, sum=6110
    test_threading leaked [452, 452, 452] references, sum=1356
    test_capi leaked [452, 452, 452] references, sum=1356
    test__xxsubinterpreters leaked [6332, 6328, 6332] references, sum=18992
    test_ast leaked [226, 226, 226] references, sum=678

    https://buildbot.python.org/all/#/builders/129/builds/144

    @vstinner
    Copy link
    Member Author

    I wrote PR 24006 to fix the leak.

    @vstinner
    Copy link
    Member Author

    New changeset ba0e49a by Victor Stinner in branch 'master':
    bpo-40137: Fix refleak in _functools_exec() (GH-24006)
    ba0e49a

    @shihai1991
    Copy link
    Member

    I wrote PR 24006 to fix the leak.
    Oh, thanks for your fix, victor.

    @shihai1991
    Copy link
    Member

    After PR 23405 and PR 24006 merged, I think this bpo can be closed.

    @vstinner
    Copy link
    Member Author

    _abc: abc_invalidation_counter is shared by all module instances.

    For the record, this one has been fixed by:

    commit 77c6146
    Author: Dong-hee Na <donghee.na92@gmail.com>
    Date: Sat May 9 17:31:40 2020 +0900

    bpo-40566: Apply PEP-573 to abc module (GH-20005)
    

    @rhettinger
    Copy link
    Contributor

    State lookups and indirections were added to most performance sensitive code part of the

    _functools_state *state;
    
        state = get_functools_state_by_type(Py_TYPE(self));
        if (state == NULL) {
            return NULL;
        }
        key = lru_cache_make_key(state, args, kwds, self->typed);

    I think it should be removed for Python3.10. Right now, it is of zero value to users (actually, negative value).

    @rhettinger rhettinger reopened this Apr 11, 2021
    @rhettinger rhettinger reopened this Apr 11, 2021
    @pablogsal
    Copy link
    Member

    Here are some benchmarks of using lru_cache in 3.9 and 3.10 (PGO/LTO/CPU isol):

    ❯ ./python -m pyperf timeit "from functools import lru_cache; f = lru_cache(lambda: 42)" "f()" --compare-to ../3.9/python
    /home/pablogsal/github/3.9/python: ..................... 2.60 us +- 0.05 us
    /home/pablogsal/github/cpython/python: ..................... 2.74 us +- 0.06 us

    Mean +- std dev: [/home/pablogsal/github/3.9/python] 2.60 us +- 0.05 us -> [/home/pablogsal/github/cpython/python] 2.74 us +- 0.06 us: 1.06x slower

    Given that lru_cache is normally used to seek speed, this is a bit unfortunate :(

    @pablogsal
    Copy link
    Member

    As a Release Manager, I am not very happy with a 6% reduction in speed, especially in these functions. Can someone take a look to see if we can get the performance closer to Python 3.9?

    @rhettinger
    Copy link
    Contributor

    Serhiy, do you think we could substitute the "self" instance of functools._lru_cache_wrapper object for the kw_mark? It already compares on identity, so it should be just a fast as the instance of object(). AFAICT, kwd_mark doesn't have to be unique across instances.

    To avoid creating a GC only circular reference, we could skip the INCREF, leaving the object weakly referencing itself.

    In bounded_lru_cache_wrapper(), the other state reference reference is in creating a new lru_list_elem. There are several things we can do about this. 1) move the expensive get_functools_state_by_type() inside the code block that creates new links, so that it won't affect a full cache. 2) create a per-instance variable that references the per-type lru_list_elem. This has the downside of making each lru_cache instance slightly larger. 3) prepopulate the lru_cache with maxsize links. This simplifies the code and makes it slightly faster but will waste spaces for use cases that never actually fill-up the cache lazily, also it slows down start-up time.

    Ideally, these PRs should be reverted for Python 3.10. AFAICT, no user of functools will benefit in any way from these changes. Presumably, that is why there are no tests. It is all cost and no benefit.

    BTW, the 6% timing is likely understated because the loop is executed entirely in L1 cache or the CPU's recent write cache rather than real world conditions where there is fierce competition of the limited 32kb L1 data cache. Just looking at the total number of sequentially dependent memory accesses, I expect that the actual cost is more than a third.

    @rhettinger
    Copy link
    Contributor

    FWIW, I've only looked at the lru_cache() code. Someone should go through all the PyModule_GetState calls to see if they are on a critical path. AFAICT a huge number of these changes were applied without regard to whether or not they occurred on an existing fast path.

    This is a considerable number of clock cycles for a type lookup that used to be almost cost free. Whether it is high or low impact greatly depends on whether the surrounding code was doing a lot of work or very little work. In thin functions, the incremental costs will be higher.

    @vstinner
    Copy link
    Member Author

    ./python -m pyperf timeit "from functools import lru_cache; f = lru_cache(lambda: 42)" "f()" --compare-to ../3.9/python
    /home/pablogsal/github/3.9/python: ..................... 2.60 us +- 0.05 us

    You misused pyperf timeit: the two statements are run at each iteration of the benchmark.

    I rerun the benchmark on Linux with PGO+LTO and CPU isolation:

    Mean +- std dev: [py39] 37.5 ns +- 1.0 ns -> [master] 43.2 ns +- 0.7 ns: 1.15x slower

    I understand that adding get_functools_state_by_type() has a cost of +5.7 ns on the performance of functions decorated with @lru_cache.

    I used the commands:

    ./configure --enable-optimizations --with-lto
    make
    ./python -m venv env
    env/bin/python -m pip install pyperf
    ./env/bin/python -m pyperf timeit -s "from functools import lru_cache; f = lru_cache(lambda: 42)" "f()" -o master.json -v --duplicate=4096

    @vstinner
    Copy link
    Member Author

    I wrote PR 25393 to micro-optimize _PyType_GetModuleByDef(). It reduces the overhead from +5.7 ns to +3.4 ns (-2.3 ns).

    Well, I didn't check in depth if _PyType_GetModuleByDef() is the only change in functools.lru_cache() from Python 3.9 to master.

    Compare master to Python 3.9:

    Mean +- std dev: [py39] 37.5 ns +- 1.0 ns -> [master] 43.2 ns +- 0.7 ns: 1.15x slower

    Compare PR 25393 to Python 3.9:

    Mean +- std dev: [py39] 37.5 ns +- 1.0 ns -> [inline] 40.9 ns +- 1.0 ns: 1.09x slower

    Compare PR 25393 to master:

    Mean +- std dev: [master] 43.2 ns +- 0.7 ns -> [inline] 40.9 ns +- 1.0 ns: 1.05x faster

    @erlend-aasland
    Copy link
    Contributor

    Raymond:

    1. move the expensive get_functools_state_by_type() inside the code block that creates new links, so that it won't affect a full cache.

    This should be as easy as adding a _functools_state pointer to struct lru_cache_object, init that pointer in lru_cache_new(), and then fetch it from self in infinite_lru_cache_wrapper() and bounded_lru_cache_wrapper(), no? Diff attached.

    @vstinner
    Copy link
    Member Author

    New changeset d4aaa34 by Victor Stinner in branch 'master':
    bpo-40137: _PyType_GetModuleByDef() doesn't check tp_flags (GH-25504)
    d4aaa34

    @vstinner
    Copy link
    Member Author

    New changeset 760da62 by Victor Stinner in branch 'master':
    bpo-40137: Optimize _PyType_GetModuleByDef() loop (GH-25505)
    760da62

    @rhettinger
    Copy link
    Contributor

    New changeset 139c232 by Raymond Hettinger in branch 'master':
    bpo-40137: Move state lookups out of the critical path (GH-25492)
    139c232

    @vstinner
    Copy link
    Member Author

    New changeset cdad272 by Victor Stinner in branch 'master':
    bpo-40137: Add pycore_moduleobject.h internal header (GH-25507)
    cdad272

    @vstinner
    Copy link
    Member Author

    New microbenchmark on the functools.lru_cache(lambda: 42) function using my 3 optimizations on _PyType_GetModuleByDef():

    • commit d4aaa34
    • commit 760da62
    • commit cdad272
    • I didn't pick Raymond's optimization for this benchmark, so I can still use _functools for my benchmark.
    $ python3 -m pyperf compare_to py39.json master.json 
    Mean +- std dev: [py39] 38.8 ns +- 0.5 ns -> [master] 39.2 ns +- 0.9 ns: 1.01x slower

    The _PyType_GetModuleByDef() overhead in _functools is now about +0.4 ns, it's better than my previous measurement before optimization: +5.7 ns (37.5 ns +- 1.0 ns -> 43.2 ns +- 0.7 ns).

    These timings are really tiny, it's really hard to get reliable timing even with CPU isolation. For example, two measurements on Python 3.9:

    • Old measurement: 37.5 ns +- 1.0 ns
    • New measurement: 38.8 ns +- 0.5 ns

    I guess that for timings under 100 ns, the PGO build is no longer reliable enough. Moreover, the std dev is around to 1 ns on 40 ns.

    @vstinner
    Copy link
    Member Author

    msg391004 benchmark comparing master with Raymond's optimization (commit 139c232) compared to Python 3.9:

    Mean +- std dev: [py39] 38.8 ns +- 0.5 ns -> [obj_cache] 40.3 ns +- 0.1 ns: 1.04x slower

    For me, these numbers make no sense :-D Getting kwd_mark from the instance (Raymond's new code) *must* be faster than calling _PyType_GetModuleByDef() on the instance.

    As I wrote in my previous comment, my bet is that PGO is not reproducible and so we should not pay too much attention to differences of a few nanoseconds.

    @vstinner
    Copy link
    Member Author

    IMO we spent enough time micro-optimizing _functools and _PyType_GetModuleByDef(). I don't think that it's worth it to spend too much time trying to investigate differences between benchmarks around +- 1 ns on ~40 ns. I close the issue.

    Thanks Raymond for fixing the slowdown ;-)

    @rhettinger
    Copy link
    Contributor

    Victor, thanks for the work to mitigate the costs of _PyType_GetModuleByDef(). It now has much lower overhead.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants