Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify _PyObject_FastCall() to reduce stack consumption #73651

Closed
vstinner opened this issue Feb 6, 2017 · 18 comments
Closed

Modify _PyObject_FastCall() to reduce stack consumption #73651

vstinner opened this issue Feb 6, 2017 · 18 comments
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@vstinner
Copy link
Member

vstinner commented Feb 6, 2017

BPO 29465
Nosy @vstinner, @methane, @serhiy-storchaka
Files
  • pyobject_fastcall.patch
  • meth_fastcall_stacksize.patch
  • pyobject_fastcall-2.patch
  • pyobject_fastcall-3.patch
  • pyobject_fastcall-4.patch
  • pyobject_fastcall-5.patch
  • pyobject_fastcall-6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-09-19.23:21:23.237>
    created_at = <Date 2017-02-06.17:08:27.926>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'Modify _PyObject_FastCall() to reduce stack consumption'
    updated_at = <Date 2018-09-19.23:21:23.236>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2018-09-19.23:21:23.236>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-09-19.23:21:23.237>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2017-02-06.17:08:27.926>
    creator = 'vstinner'
    dependencies = []
    files = ['46545', '46546', '46553', '46554', '46559', '46599', '46600']
    hgrepos = []
    issue_num = 29465
    keywords = ['patch']
    message_count = 18.0
    messages = ['287153', '287154', '287183', '287184', '287204', '287224', '287234', '287236', '287254', '287257', '287261', '287358', '287360', '287373', '287377', '287531', '288526', '325823']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'methane', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue29465'
    versions = ['Python 3.7']

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 6, 2017

    While testing issue bpo-29464 patch, I failed to see a major enhancement on the stack usage of fast calls without keyword arguments.

    The problem is that functions like _PyObject_FastCallKeywords() and _PyObject_FastCallDict() still have to pass a NULL argument for kwargs/kwnames, and have code to handle keyword arguments.

    Attached patch adds _PyObject_FastCall() to reduce the stack consumption. At the C level, keyword arguments are almost never used. For example, PyObject_CallFunctionObjArgs() is commonly used, whereas it only uses positional arguments.

    The patch changes also _PyObject_FastCallKeywords() and _PyObject_FastCallDict() to move the "slow" path creating temporary tuple and dict in a subfunction which is not inlined. The slow path requires more stack memory.

    Morecall, _PyObject_FastCallKeywords() and _PyObject_FastCallDict() are modified to call _PyObject_FastCall() if there is no keyword.

    The patch might make function calls without keyword arguments faster, I didn't check.

    Stack usage.

    $ ./python -c 'import _testcapi, sys; sys.setrecursionlimit(10**5); n=1000; s=_testcapi.meth_fastcall_stacksize(n); print("%.1f B/call" % (s/n))'
    • Reference: 832.8 B/call
    • Patch: 656.6 B/call (-176.2 B)

    I don't know why the stack usage is not an integer number of bytes?

    Combined with the issue bpo-29464 "Specialize FASTCALL for functions with positional-only parameters", the stack usage can be even more reduced by a few bytes.

    See the issue bpo-28870 for the previous work on reducing stack consumption.

    @vstinner vstinner added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 6, 2017
    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 6, 2017

    meth_fastcall_stacksize.patch: Patch adding
    _testcapi.meth_fastcall_stacksize() to measure the stack usage to call a METH_FASTCALL function.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    pyobject_fastcall-2.patch: More complete changes. Sorry, the patch also contains unrelated refactoring! It's a more advanced implementation which tries to reduce the depth of the C backtrace. For example, _PyObject_FastCall() is now inlined manually in _PyObject_FastCallDict(). PyObject_Call() is also rewritten. If the overall approach is validated, I will rewriten the patch differently to limit changes, or push some changes in multiple commits.

    Results of testcapi_stacksize.patch + stack_overflow_28870-sp.py (from issue bpo-28870).

    Reference:

    haypo@smithers$ ../default-ref/python stack_overflow_28870-sp.py
    test_python_call: 8586 calls before crash, stack: 976 bytes/call
    test_python_getitem: 9188 calls before crash, stack: 912 bytes/call
    test_python_iterator: 7936 calls before crash, stack: 1056 bytes/call

    => total: 25710 calls, 2944 bytes

    Patch:

    haypo@smithers$ ./python stack_overflow_28870-sp.py
    test_python_call: 9883 calls before crash, stack: 848 bytes/call (-128 B)
    test_python_getitem: 10476 calls before crash, stack: 800 bytes/call (- 112 B)
    test_python_iterator: 8878 calls before crash, stack: 944 bytes/call (- 112 B)

    => total: 29237 calls (+3616), 2592 bytes (- 352 B)

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    Oh, I forgot to rebase my local git branch: patch version 2 contains unrelated changes. Please see instead the path version 3 which was rebased.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    Benchmarks result. Some are slower (code placement issue?), the faster benchmark are not really much faster.

    haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-3_ref_e06af4027546.json -G --min-speed=4
    Slower (4):

    • scimark_lu: 352 ms +- 11 ms -> 374 ms +- 15 ms: 1.06x slower (+6%)
    • float: 238 ms +- 3 ms -> 251 ms +- 4 ms: 1.05x slower (+5%)
    • logging_silent: 558 ns +- 15 ns -> 585 ns +- 16 ns: 1.05x slower (+5%)
    • raytrace: 1.04 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.05x slower (+5%)

    Faster (5):

    • nbody: 233 ms +- 2 ms -> 216 ms +- 2 ms: 1.08x faster (-7%)
    • pickle_dict: 56.6 us +- 4.3 us -> 52.9 us +- 2.2 us: 1.07x faster (-6%)
    • nqueens: 217 ms +- 2 ms -> 205 ms +- 2 ms: 1.06x faster (-6%)
    • unpickle_pure_python: 670 us +- 11 us -> 642 us +- 8 us: 1.04x faster (-4%)
    • scimark_sor: 405 ms +- 11 ms -> 389 ms +- 10 ms: 1.04x faster (-4%)

    Benchmark hidden because not significant (55): (...)

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    Oh, pyobject_fastcall-3.patch contains a performance bug :-p PyObject_Call() "unpacks" the tuple and then recreates a new tuple to call functions in for functions other than Python and C functions.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    Ok, I fixed the PyObject_Call() bug, and tried a new approach to help the compiler for code placement: I moved all "call" functions into a new Objects/call.c file. It should help the compiler to inline more code, and I move functions which call themself closer: _PyObject_FastCall(), _PyCFunction_FastCall() and _PyFunction_FastCall() are now very close in the same file for example.

    Not only the stack usage is better, but the performance seems also better: see benchmark results below.

    Again, if you like the _PyObject_FastCall() idea (to reduce the stack consumption), I will split my big patch into smaller patches to have a "smoother" hg/git history.

    Benchmark results on speed-python, CPython compiled with LTO+PGO:

    haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-4_ref_e06af4027546.json -G --min-speed=4
    Faster (11):

    • sympy_expand: 949 ms +- 12 ms -> 878 ms +- 7 ms: 1.08x faster (-8%)
    • chameleon: 21.9 ms +- 0.2 ms -> 20.5 ms +- 0.3 ms: 1.07x faster (-7%)
    • sympy_str: 425 ms +- 5 ms -> 399 ms +- 5 ms: 1.07x faster (-6%)
    • sympy_integrate: 40.8 ms +- 0.3 ms -> 38.3 ms +- 0.4 ms: 1.06x faster (-6%)
    • sympy_sum: 192 ms +- 6 ms -> 181 ms +- 7 ms: 1.06x faster (-6%)
    • scimark_lu: 352 ms +- 11 ms -> 332 ms +- 13 ms: 1.06x faster (-6%)
    • xml_etree_generate: 208 ms +- 2 ms -> 197 ms +- 3 ms: 1.05x faster (-5%)
    • nbody: 233 ms +- 2 ms -> 222 ms +- 3 ms: 1.05x faster (-5%)
    • telco: 14.7 ms +- 0.3 ms -> 14.0 ms +- 0.3 ms: 1.05x faster (-5%)
    • scimark_fft: 662 ms +- 10 ms -> 633 ms +- 8 ms: 1.05x faster (-4%)
    • genshi_text: 71.0 ms +- 0.8 ms -> 68.3 ms +- 0.7 ms: 1.04x faster (-4%)

    Benchmark hidden because not significant (53): (...)

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    Oh, with pyobject_fastcall-4.patch, the stack usage of test_python_iterator is even better. The reference is 1056 bytes/call, pyobject_fastcall-2.patch was 944 bytes/call.

    haypo@smithers$ ./python ../default/stack_overflow_28870-sp.py
    test_python_call: 9883 calls before crash, stack: 848 bytes/call
    test_python_getitem: 10476 calls before crash, stack: 800 bytes/call
    test_python_iterator: 9189 calls before crash, stack: 912 bytes/call

    => total: 29548 calls, 2560 bytes

    @serhiy-storchaka
    Copy link
    Member

    Awesome! This looks fantastic! I need to check the patch very carefully to be sure that we haven't missed something important.

    Isn't the Python directory more appropriate place for call.c?

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    Serhiy Storchaka added the comment:

    Isn't the Python directory more appropriate place for call.c?

    I moved code from other .c files in Objects/, so for me it seems more
    natural to add the code in Objects/ as well. It seems like most of the
    code in Python/ is not "aware" of types defined in Objects/. But I
    don't have a strong opinion on the right directory.

    Objects/call.c is 1500 lines long. IMHO the code became big enough to
    justify to move it to a new file ;-)

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2017

    pyobject_fastcall-4.patch combines a lot of changes. I wrote it to experiment _PyObject_FastCall().

    I will rewrite these changes as a patch serie with smaller changes.

    The design of PyObject_FastCall*() is to be short and simple.

    Changes:

    • Add _PyObject_FastCall(), _PyFunction_FastCall(), _PyCFunction_FastCall(): similar to the _PyObject_FastCallKeywords() but without keyword arguments. Without keyword arguments, the checks on keyword arguments can be removed, and it allows to reduce the stack usage: one less parameter to C functions.

    • Move the slow path out of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in a new subfunction which is tagged with _Py_NO_INLINE to reduce the stack consumption of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in their fast paths, which are the most code paths.

    • Move all "call" functions into a new Objects/call.c function. This change should help code placement to enhance the usage of the CPU caches (especially the CPU L1 instruction cache). In my long experience with benchmarking last year, I notice huge performance differences caused by code placement. See my blog post:
      https://haypo.github.io/analysis-python-performance-issue.html
      Sadly, _Py_HOT_FUNCTION didn't fix the issue completely.

    • _PyObject_FastCallDict() and _PyObject_FastCallKeywords() "call" _PyObject_FastCall(). In fact, _PyObject_FastCall() is inlined manually in these functions, against, to minimize the stack usage. Similar change in _PyCFunction_FastCallDict() and PyCFunction_Call().

    • Since the slow path is moved to a subfunction, I removed _Py_NO_INLINE from _PyStack_AsTuple() to allow the compiler to inline it if it wants to ;-) Since the stack usage is better with the patch, it seems like this strange has no negative impact on the stack usage.

    • Optimize PyMethodDescr_Type (call _PyMethodDescr_FastCallKeywords()) in _PyObject_FastCallKeywords()

    • I moved Py_EnterRecursiveCall() closer to the final function call, to simplify error handling. It would be nice to fix the issue bpo-29306 before :-/

    • I had to copy/paste the null_error() function from Objects/abstract.c. I don't think that it's worth it to add a _PyErr_NullError() function shared by abstract.c and call.c. Compilers are even able to merge duplicated functions ;-)

    • A few other changes.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 8, 2017

    I splitted my big patch into smaller changes, see my pull request:
    #74

    If you prefer Rietveld ([Review] button), I also attached the squashed change: pyobject_fastcall-5.patch.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 8, 2017

    Crap, pyobject_fastcall-5.patch was not rebased correctly :-/ Please see pyobject_fastcall-6.patch instead.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 9, 2017

    New changeset 31342913fb1e by Victor Stinner in branch 'default':
    Fix PyCFunction_Call() performance issue
    https://hg.python.org/cpython/rev/31342913fb1e

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 9, 2017

    New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master':
    Fix PyCFunction_Call() performance issue
    e9807e3

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 10, 2017

    New changeset f23fa1f7b68f by Victor Stinner in branch 'default':
    Issue bpo-29465: Add Objects/call.c file
    https://hg.python.org/cpython/rev/f23fa1f7b68f

    @vstinner vstinner changed the title Add _PyObject_FastCall() to reduce stack consumption Modify _PyObject_FastCall() to reduce stack consumption Feb 23, 2017
    @vstinner
    Copy link
    Member Author

    Hum, code to call functions changed a lot recently, and other issues already enhanced the stack usage.

    I should measure the stack usage to check if it's still worth it.

    @vstinner
    Copy link
    Member Author

    I should measure the stack usage to check if it's still worth it.

    IMHO we gone far enough in optimizing the stack usage:
    https://vstinner.github.io/contrib-cpython-2017q1.html

    I close the issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants