msg287153 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-06 17:08 |
While testing issue #29464 patch, I failed to see a major enhancement on the stack usage of fast calls without keyword arguments.
The problem is that functions like _PyObject_FastCallKeywords() and _PyObject_FastCallDict() still have to pass a NULL argument for kwargs/kwnames, and have code to handle keyword arguments.
Attached patch adds _PyObject_FastCall() to reduce the stack consumption. At the C level, keyword arguments are almost never used. For example, PyObject_CallFunctionObjArgs() is commonly used, whereas it only uses positional arguments.
The patch changes also _PyObject_FastCallKeywords() and _PyObject_FastCallDict() to move the "slow" path creating temporary tuple and dict in a subfunction which is not inlined. The slow path requires more stack memory.
Morecall, _PyObject_FastCallKeywords() and _PyObject_FastCallDict() are modified to call _PyObject_FastCall() if there is no keyword.
The patch might make function calls without keyword arguments faster, I didn't check.
Stack usage.
$ ./python -c 'import _testcapi, sys; sys.setrecursionlimit(10**5); n=1000; s=_testcapi.meth_fastcall_stacksize(n); print("%.1f B/call" % (s/n))'
* Reference: 832.8 B/call
* Patch: 656.6 B/call (-176.2 B)
I don't know why the stack usage is not an integer number of bytes?
Combined with the issue #29464 "Specialize FASTCALL for functions with positional-only parameters", the stack usage can be even more reduced by a few bytes.
See the issue #28870 for the previous work on reducing stack consumption.
|
msg287154 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-06 17:11 |
meth_fastcall_stacksize.patch: Patch adding
_testcapi.meth_fastcall_stacksize() to measure the stack usage to call a METH_FASTCALL function.
|
msg287183 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 01:05 |
pyobject_fastcall-2.patch: More complete changes. Sorry, the patch also contains unrelated refactoring! It's a more advanced implementation which tries to reduce the depth of the C backtrace. For example, _PyObject_FastCall() is now inlined manually in _PyObject_FastCallDict(). PyObject_Call() is also rewritten. If the overall approach is validated, I will rewriten the patch differently to limit changes, or push some changes in multiple commits.
Results of testcapi_stacksize.patch + stack_overflow_28870-sp.py (from issue #28870).
Reference:
haypo@smithers$ ../default-ref/python stack_overflow_28870-sp.py
test_python_call: 8586 calls before crash, stack: 976 bytes/call
test_python_getitem: 9188 calls before crash, stack: 912 bytes/call
test_python_iterator: 7936 calls before crash, stack: 1056 bytes/call
=> total: 25710 calls, 2944 bytes
Patch:
haypo@smithers$ ./python stack_overflow_28870-sp.py
test_python_call: 9883 calls before crash, stack: 848 bytes/call (-128 B)
test_python_getitem: 10476 calls before crash, stack: 800 bytes/call (- 112 B)
test_python_iterator: 8878 calls before crash, stack: 944 bytes/call (- 112 B)
=> total: 29237 calls (+3616), 2592 bytes (- 352 B)
|
msg287184 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 01:12 |
Oh, I forgot to rebase my local git branch: patch version 2 contains unrelated changes. Please see instead the path version 3 which was rebased.
|
msg287204 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 08:17 |
Benchmarks result. Some are slower (code placement issue?), the faster benchmark are not really much faster.
haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-3_ref_e06af4027546.json -G --min-speed=4
Slower (4):
- scimark_lu: 352 ms +- 11 ms -> 374 ms +- 15 ms: 1.06x slower (+6%)
- float: 238 ms +- 3 ms -> 251 ms +- 4 ms: 1.05x slower (+5%)
- logging_silent: 558 ns +- 15 ns -> 585 ns +- 16 ns: 1.05x slower (+5%)
- raytrace: 1.04 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.05x slower (+5%)
Faster (5):
- nbody: 233 ms +- 2 ms -> 216 ms +- 2 ms: 1.08x faster (-7%)
- pickle_dict: 56.6 us +- 4.3 us -> 52.9 us +- 2.2 us: 1.07x faster (-6%)
- nqueens: 217 ms +- 2 ms -> 205 ms +- 2 ms: 1.06x faster (-6%)
- unpickle_pure_python: 670 us +- 11 us -> 642 us +- 8 us: 1.04x faster (-4%)
- scimark_sor: 405 ms +- 11 ms -> 389 ms +- 10 ms: 1.04x faster (-4%)
Benchmark hidden because not significant (55): (...)
|
msg287224 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 10:16 |
Oh, pyobject_fastcall-3.patch contains a performance bug :-p PyObject_Call() "unpacks" the tuple and then recreates a new tuple to call functions in for functions other than Python and C functions.
|
msg287234 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 14:46 |
Ok, I fixed the PyObject_Call() bug, and tried a new approach to help the compiler for code placement: I moved all "call" functions into a new Objects/call.c file. It should help the compiler to inline more code, and I move functions which call themself closer: _PyObject_FastCall(), _PyCFunction_FastCall() and _PyFunction_FastCall() are now very close in the same file for example.
Not only the stack usage is better, but the performance seems also better: see benchmark results below.
Again, if you like the _PyObject_FastCall() idea (to reduce the stack consumption), I will split my big patch into smaller patches to have a "smoother" hg/git history.
Benchmark results on speed-python, CPython compiled with LTO+PGO:
haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-4_ref_e06af4027546.json -G --min-speed=4
Faster (11):
- sympy_expand: 949 ms +- 12 ms -> 878 ms +- 7 ms: 1.08x faster (-8%)
- chameleon: 21.9 ms +- 0.2 ms -> 20.5 ms +- 0.3 ms: 1.07x faster (-7%)
- sympy_str: 425 ms +- 5 ms -> 399 ms +- 5 ms: 1.07x faster (-6%)
- sympy_integrate: 40.8 ms +- 0.3 ms -> 38.3 ms +- 0.4 ms: 1.06x faster (-6%)
- sympy_sum: 192 ms +- 6 ms -> 181 ms +- 7 ms: 1.06x faster (-6%)
- scimark_lu: 352 ms +- 11 ms -> 332 ms +- 13 ms: 1.06x faster (-6%)
- xml_etree_generate: 208 ms +- 2 ms -> 197 ms +- 3 ms: 1.05x faster (-5%)
- nbody: 233 ms +- 2 ms -> 222 ms +- 3 ms: 1.05x faster (-5%)
- telco: 14.7 ms +- 0.3 ms -> 14.0 ms +- 0.3 ms: 1.05x faster (-5%)
- scimark_fft: 662 ms +- 10 ms -> 633 ms +- 8 ms: 1.05x faster (-4%)
- genshi_text: 71.0 ms +- 0.8 ms -> 68.3 ms +- 0.7 ms: 1.04x faster (-4%)
Benchmark hidden because not significant (53): (...)
|
msg287236 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 15:05 |
Oh, with pyobject_fastcall-4.patch, the stack usage of test_python_iterator is even better. The reference is 1056 bytes/call, pyobject_fastcall-2.patch was 944 bytes/call.
haypo@smithers$ ./python ../default/stack_overflow_28870-sp.py
test_python_call: 9883 calls before crash, stack: 848 bytes/call
test_python_getitem: 10476 calls before crash, stack: 800 bytes/call
test_python_iterator: 9189 calls before crash, stack: 912 bytes/call
=> total: 29548 calls, 2560 bytes
|
msg287254 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-02-07 19:31 |
Awesome! This looks fantastic! I need to check the patch very carefully to be sure that we haven't missed something important.
Isn't the Python directory more appropriate place for call.c?
|
msg287257 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 21:49 |
Serhiy Storchaka added the comment:
> Isn't the Python directory more appropriate place for call.c?
I moved code from other .c files in Objects/, so for me it seems more
natural to add the code in Objects/ as well. It seems like most of the
code in Python/ is not "aware" of types defined in Objects/. But I
don't have a strong opinion on the right directory.
Objects/call.c is 1500 lines long. IMHO the code became big enough to
justify to move it to a new file ;-)
|
msg287261 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-07 22:42 |
pyobject_fastcall-4.patch combines a lot of changes. I wrote it to experiment _PyObject_FastCall().
I will rewrite these changes as a patch serie with smaller changes.
The design of PyObject_FastCall*() is to be short and simple.
Changes:
* Add _PyObject_FastCall(), _PyFunction_FastCall(), _PyCFunction_FastCall(): similar to the _PyObject_FastCallKeywords() but without keyword arguments. Without keyword arguments, the checks on keyword arguments can be removed, and it allows to reduce the stack usage: one less parameter to C functions.
* Move the slow path out of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in a new subfunction which is tagged with _Py_NO_INLINE to reduce the stack consumption of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in their fast paths, which are the most code paths.
* Move all "call" functions into a new Objects/call.c function. This change should help code placement to enhance the usage of the CPU caches (especially the CPU L1 instruction cache). In my long experience with benchmarking last year, I notice huge performance differences caused by code placement. See my blog post:
https://haypo.github.io/analysis-python-performance-issue.html
Sadly, _Py_HOT_FUNCTION didn't fix the issue completely.
* _PyObject_FastCallDict() and _PyObject_FastCallKeywords() "call" _PyObject_FastCall(). In fact, _PyObject_FastCall() is inlined manually in these functions, against, to minimize the stack usage. Similar change in _PyCFunction_FastCallDict() and PyCFunction_Call().
* Since the slow path is moved to a subfunction, I removed _Py_NO_INLINE from _PyStack_AsTuple() to allow the compiler to inline it if it wants to ;-) Since the stack usage is better with the patch, it seems like this strange has no negative impact on the stack usage.
* Optimize PyMethodDescr_Type (call _PyMethodDescr_FastCallKeywords()) in _PyObject_FastCallKeywords()
* I moved Py_EnterRecursiveCall() closer to the final function call, to simplify error handling. It would be nice to fix the issue #29306 before :-/
* I had to copy/paste the null_error() function from Objects/abstract.c. I don't think that it's worth it to add a _PyErr_NullError() function shared by abstract.c and call.c. Compilers are even able to merge duplicated functions ;-)
* A few other changes.
|
msg287358 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-08 17:35 |
I splitted my big patch into smaller changes, see my pull request:
https://github.com/python/cpython/pull/74
If you prefer Rietveld ([Review] button), I also attached the squashed change: pyobject_fastcall-5.patch.
|
msg287360 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-08 17:38 |
Crap, pyobject_fastcall-5.patch was not rebased correctly :-/ Please see pyobject_fastcall-6.patch instead.
|
msg287373 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2017-02-09 01:03 |
New changeset 31342913fb1e by Victor Stinner in branch 'default':
Fix PyCFunction_Call() performance issue
https://hg.python.org/cpython/rev/31342913fb1e
|
msg287377 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2017-02-09 02:00 |
New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master':
Fix PyCFunction_Call() performance issue
https://github.com/python/cpython/commit/e9807e307f1f561a2dfe3e9f97a38444528dba86
|
msg287531 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2017-02-10 13:15 |
New changeset f23fa1f7b68f by Victor Stinner in branch 'default':
Issue #29465: Add Objects/call.c file
https://hg.python.org/cpython/rev/f23fa1f7b68f
|
msg288526 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-24 15:23 |
Hum, code to call functions changed a lot recently, and other issues already enhanced the stack usage.
I should measure the stack usage to check if it's still worth it.
|
msg325823 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-09-19 23:21 |
> I should measure the stack usage to check if it's still worth it.
IMHO we gone far enough in optimizing the stack usage:
https://vstinner.github.io/contrib-cpython-2017q1.html
I close the issue.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:42 | admin | set | github: 73651 |
2018-09-19 23:21:23 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg325823
stage: patch review -> resolved |
2017-12-13 10:28:28 | vstinner | set | pull_requests:
- pull_request25 |
2017-02-24 15:23:19 | vstinner | set | messages:
+ msg288526 |
2017-02-23 15:55:50 | vstinner | set | title: Add _PyObject_FastCall() to reduce stack consumption -> Modify _PyObject_FastCall() to reduce stack consumption |
2017-02-10 13:15:39 | python-dev | set | messages:
+ msg287531 |
2017-02-09 02:00:23 | python-dev | set | messages:
+ msg287377 |
2017-02-09 01:03:06 | python-dev | set | nosy:
+ python-dev messages:
+ msg287373
|
2017-02-08 17:38:08 | vstinner | set | files:
+ pyobject_fastcall-6.patch
messages:
+ msg287360 |
2017-02-08 17:35:47 | vstinner | set | files:
+ pyobject_fastcall-5.patch
messages:
+ msg287358 |
2017-02-08 17:33:47 | vstinner | set | pull_requests:
+ pull_request25 |
2017-02-07 22:42:44 | vstinner | set | messages:
+ msg287261 |
2017-02-07 21:49:25 | vstinner | set | messages:
+ msg287257 |
2017-02-07 19:31:17 | serhiy.storchaka | set | messages:
+ msg287254 stage: patch review |
2017-02-07 15:05:00 | vstinner | set | messages:
+ msg287236 |
2017-02-07 14:46:05 | vstinner | set | files:
+ pyobject_fastcall-4.patch
messages:
+ msg287234 |
2017-02-07 10:16:13 | vstinner | set | messages:
+ msg287224 |
2017-02-07 08:17:32 | vstinner | set | messages:
+ msg287204 |
2017-02-07 01:12:06 | vstinner | set | files:
+ pyobject_fastcall-3.patch
messages:
+ msg287184 |
2017-02-07 01:05:56 | vstinner | set | files:
+ pyobject_fastcall-2.patch
messages:
+ msg287183 |
2017-02-06 17:11:05 | vstinner | set | files:
+ meth_fastcall_stacksize.patch
messages:
+ msg287154 |
2017-02-06 17:08:27 | vstinner | create | |