New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify _PyObject_FastCall() to reduce stack consumption #73651
Comments
While testing issue bpo-29464 patch, I failed to see a major enhancement on the stack usage of fast calls without keyword arguments. The problem is that functions like _PyObject_FastCallKeywords() and _PyObject_FastCallDict() still have to pass a NULL argument for kwargs/kwnames, and have code to handle keyword arguments. Attached patch adds _PyObject_FastCall() to reduce the stack consumption. At the C level, keyword arguments are almost never used. For example, PyObject_CallFunctionObjArgs() is commonly used, whereas it only uses positional arguments. The patch changes also _PyObject_FastCallKeywords() and _PyObject_FastCallDict() to move the "slow" path creating temporary tuple and dict in a subfunction which is not inlined. The slow path requires more stack memory. Morecall, _PyObject_FastCallKeywords() and _PyObject_FastCallDict() are modified to call _PyObject_FastCall() if there is no keyword. The patch might make function calls without keyword arguments faster, I didn't check. Stack usage. $ ./python -c 'import _testcapi, sys; sys.setrecursionlimit(10**5); n=1000; s=_testcapi.meth_fastcall_stacksize(n); print("%.1f B/call" % (s/n))'
I don't know why the stack usage is not an integer number of bytes? Combined with the issue bpo-29464 "Specialize FASTCALL for functions with positional-only parameters", the stack usage can be even more reduced by a few bytes. See the issue bpo-28870 for the previous work on reducing stack consumption. |
meth_fastcall_stacksize.patch: Patch adding |
pyobject_fastcall-2.patch: More complete changes. Sorry, the patch also contains unrelated refactoring! It's a more advanced implementation which tries to reduce the depth of the C backtrace. For example, _PyObject_FastCall() is now inlined manually in _PyObject_FastCallDict(). PyObject_Call() is also rewritten. If the overall approach is validated, I will rewriten the patch differently to limit changes, or push some changes in multiple commits. Results of testcapi_stacksize.patch + stack_overflow_28870-sp.py (from issue bpo-28870). Reference: haypo@smithers$ ../default-ref/python stack_overflow_28870-sp.py => total: 25710 calls, 2944 bytes Patch: haypo@smithers$ ./python stack_overflow_28870-sp.py => total: 29237 calls (+3616), 2592 bytes (- 352 B) |
Oh, I forgot to rebase my local git branch: patch version 2 contains unrelated changes. Please see instead the path version 3 which was rebased. |
Benchmarks result. Some are slower (code placement issue?), the faster benchmark are not really much faster. haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-3_ref_e06af4027546.json -G --min-speed=4
Faster (5):
Benchmark hidden because not significant (55): (...) |
Oh, pyobject_fastcall-3.patch contains a performance bug :-p PyObject_Call() "unpacks" the tuple and then recreates a new tuple to call functions in for functions other than Python and C functions. |
Ok, I fixed the PyObject_Call() bug, and tried a new approach to help the compiler for code placement: I moved all "call" functions into a new Objects/call.c file. It should help the compiler to inline more code, and I move functions which call themself closer: _PyObject_FastCall(), _PyCFunction_FastCall() and _PyFunction_FastCall() are now very close in the same file for example. Not only the stack usage is better, but the performance seems also better: see benchmark results below. Again, if you like the _PyObject_FastCall() idea (to reduce the stack consumption), I will split my big patch into smaller patches to have a "smoother" hg/git history. Benchmark results on speed-python, CPython compiled with LTO+PGO: haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-4_ref_e06af4027546.json -G --min-speed=4
Benchmark hidden because not significant (53): (...) |
Oh, with pyobject_fastcall-4.patch, the stack usage of test_python_iterator is even better. The reference is 1056 bytes/call, pyobject_fastcall-2.patch was 944 bytes/call. haypo@smithers$ ./python ../default/stack_overflow_28870-sp.py => total: 29548 calls, 2560 bytes |
Awesome! This looks fantastic! I need to check the patch very carefully to be sure that we haven't missed something important. Isn't the Python directory more appropriate place for call.c? |
Serhiy Storchaka added the comment:
I moved code from other .c files in Objects/, so for me it seems more Objects/call.c is 1500 lines long. IMHO the code became big enough to |
pyobject_fastcall-4.patch combines a lot of changes. I wrote it to experiment _PyObject_FastCall(). I will rewrite these changes as a patch serie with smaller changes. The design of PyObject_FastCall*() is to be short and simple. Changes:
|
I splitted my big patch into smaller changes, see my pull request: If you prefer Rietveld ([Review] button), I also attached the squashed change: pyobject_fastcall-5.patch. |
Crap, pyobject_fastcall-5.patch was not rebased correctly :-/ Please see pyobject_fastcall-6.patch instead. |
New changeset 31342913fb1e by Victor Stinner in branch 'default': |
New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master': |
New changeset f23fa1f7b68f by Victor Stinner in branch 'default': |
Hum, code to call functions changed a lot recently, and other issues already enhanced the stack usage. I should measure the stack usage to check if it's still worth it. |
IMHO we gone far enough in optimizing the stack usage: I close the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: