classification
Title: Modify _PyObject_FastCall() to reduce stack consumption
Type: resource usage Stage: patch review
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: haypo, inada.naoki, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-02-06 17:08 by haypo, last changed 2017-02-24 15:23 by haypo.

Files
File name Uploaded Description Edit
pyobject_fastcall.patch haypo, 2017-02-06 17:08 review
meth_fastcall_stacksize.patch haypo, 2017-02-06 17:11 review
pyobject_fastcall-2.patch haypo, 2017-02-07 01:05 review
pyobject_fastcall-3.patch haypo, 2017-02-07 01:12 review
pyobject_fastcall-4.patch haypo, 2017-02-07 14:46 review
pyobject_fastcall-5.patch haypo, 2017-02-08 17:35 review
pyobject_fastcall-6.patch haypo, 2017-02-08 17:38 review
Pull Requests
URL Status Linked Edit
PR 74 haypo, 2017-02-08 17:33
Messages (17)
msg287153 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 17:08
While testing issue #29464 patch, I failed to see a major enhancement on the stack usage of fast calls without keyword arguments.

The problem is that functions like _PyObject_FastCallKeywords() and _PyObject_FastCallDict() still have to pass a NULL argument for kwargs/kwnames, and have code to handle keyword arguments.


Attached patch adds _PyObject_FastCall() to reduce the stack consumption. At the C level, keyword arguments are almost never used. For example, PyObject_CallFunctionObjArgs() is commonly used, whereas it only uses positional arguments.

The patch changes also _PyObject_FastCallKeywords() and _PyObject_FastCallDict() to move the "slow" path creating temporary tuple and dict in a subfunction which is not inlined. The slow path requires more stack memory.

Morecall, _PyObject_FastCallKeywords() and _PyObject_FastCallDict() are modified to call _PyObject_FastCall() if there is no keyword.

The patch might make function calls without keyword arguments faster, I didn't check.


Stack usage.

$ ./python -c 'import _testcapi, sys; sys.setrecursionlimit(10**5); n=1000; s=_testcapi.meth_fastcall_stacksize(n); print("%.1f B/call" % (s/n))'

* Reference: 832.8 B/call
* Patch: 656.6 B/call (-176.2 B)

I don't know why the stack usage is not an integer number of bytes?


Combined with the issue #29464 "Specialize FASTCALL for functions with positional-only parameters", the stack usage can be even more reduced by a few bytes.


See the issue #28870 for the previous work on reducing stack consumption.
msg287154 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 17:11
meth_fastcall_stacksize.patch: Patch adding 
_testcapi.meth_fastcall_stacksize() to measure the stack usage to call a METH_FASTCALL function.
msg287183 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 01:05
pyobject_fastcall-2.patch: More complete changes. Sorry, the patch also contains unrelated refactoring! It's a more advanced implementation which tries to reduce the depth of the C backtrace. For example, _PyObject_FastCall() is now inlined manually in _PyObject_FastCallDict(). PyObject_Call() is also rewritten. If the overall approach is validated, I will rewriten the patch differently to limit changes, or push some changes in multiple commits.

Results of testcapi_stacksize.patch + stack_overflow_28870-sp.py (from issue #28870).

Reference:

haypo@smithers$ ../default-ref/python stack_overflow_28870-sp.py 
test_python_call: 8586 calls before crash, stack: 976 bytes/call
test_python_getitem: 9188 calls before crash, stack: 912 bytes/call
test_python_iterator: 7936 calls before crash, stack: 1056 bytes/call

=> total: 25710 calls, 2944 bytes

Patch:

haypo@smithers$ ./python stack_overflow_28870-sp.py 
test_python_call: 9883 calls before crash, stack: 848 bytes/call (-128 B)
test_python_getitem: 10476 calls before crash, stack: 800 bytes/call (- 112 B)
test_python_iterator: 8878 calls before crash, stack: 944 bytes/call (- 112 B)

=> total: 29237 calls (+3616), 2592 bytes (- 352 B)
msg287184 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 01:12
Oh, I forgot to rebase my local git branch: patch version 2 contains unrelated changes. Please see instead the path version 3 which was rebased.
msg287204 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 08:17
Benchmarks result. Some are slower (code placement issue?), the faster benchmark are not really much faster.

haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-3_ref_e06af4027546.json -G  --min-speed=4
Slower (4):
- scimark_lu: 352 ms +- 11 ms -> 374 ms +- 15 ms: 1.06x slower (+6%)
- float: 238 ms +- 3 ms -> 251 ms +- 4 ms: 1.05x slower (+5%)
- logging_silent: 558 ns +- 15 ns -> 585 ns +- 16 ns: 1.05x slower (+5%)
- raytrace: 1.04 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.05x slower (+5%)

Faster (5):
- nbody: 233 ms +- 2 ms -> 216 ms +- 2 ms: 1.08x faster (-7%)
- pickle_dict: 56.6 us +- 4.3 us -> 52.9 us +- 2.2 us: 1.07x faster (-6%)
- nqueens: 217 ms +- 2 ms -> 205 ms +- 2 ms: 1.06x faster (-6%)
- unpickle_pure_python: 670 us +- 11 us -> 642 us +- 8 us: 1.04x faster (-4%)
- scimark_sor: 405 ms +- 11 ms -> 389 ms +- 10 ms: 1.04x faster (-4%)

Benchmark hidden because not significant (55): (...)
msg287224 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 10:16
Oh, pyobject_fastcall-3.patch contains a performance bug :-p PyObject_Call() "unpacks" the tuple and then recreates a new tuple to call functions in for functions other than Python and C functions.
msg287234 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 14:46
Ok, I fixed the PyObject_Call() bug, and tried a new approach to help the compiler for code placement: I moved all "call" functions into a new Objects/call.c file. It should help the compiler to inline more code, and I move functions which call themself closer: _PyObject_FastCall(), _PyCFunction_FastCall() and _PyFunction_FastCall() are now very close in the same file for example.

Not only the stack usage is better, but the performance seems also better: see benchmark results below.

Again, if you like the _PyObject_FastCall() idea (to reduce the stack consumption), I will split my big patch into smaller patches to have a "smoother" hg/git history.

Benchmark results on speed-python, CPython compiled with LTO+PGO:

haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-4_ref_e06af4027546.json -G  --min-speed=4
Faster (11):
- sympy_expand: 949 ms +- 12 ms -> 878 ms +- 7 ms: 1.08x faster (-8%)
- chameleon: 21.9 ms +- 0.2 ms -> 20.5 ms +- 0.3 ms: 1.07x faster (-7%)
- sympy_str: 425 ms +- 5 ms -> 399 ms +- 5 ms: 1.07x faster (-6%)
- sympy_integrate: 40.8 ms +- 0.3 ms -> 38.3 ms +- 0.4 ms: 1.06x faster (-6%)
- sympy_sum: 192 ms +- 6 ms -> 181 ms +- 7 ms: 1.06x faster (-6%)
- scimark_lu: 352 ms +- 11 ms -> 332 ms +- 13 ms: 1.06x faster (-6%)
- xml_etree_generate: 208 ms +- 2 ms -> 197 ms +- 3 ms: 1.05x faster (-5%)
- nbody: 233 ms +- 2 ms -> 222 ms +- 3 ms: 1.05x faster (-5%)
- telco: 14.7 ms +- 0.3 ms -> 14.0 ms +- 0.3 ms: 1.05x faster (-5%)
- scimark_fft: 662 ms +- 10 ms -> 633 ms +- 8 ms: 1.05x faster (-4%)
- genshi_text: 71.0 ms +- 0.8 ms -> 68.3 ms +- 0.7 ms: 1.04x faster (-4%)

Benchmark hidden because not significant (53): (...)
msg287236 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 15:05
Oh, with pyobject_fastcall-4.patch, the stack usage of test_python_iterator is even better. The reference is 1056 bytes/call,  pyobject_fastcall-2.patch  was 944 bytes/call.

haypo@smithers$ ./python ../default/stack_overflow_28870-sp.py 
test_python_call: 9883 calls before crash, stack: 848 bytes/call
test_python_getitem: 10476 calls before crash, stack: 800 bytes/call
test_python_iterator: 9189 calls before crash, stack: 912 bytes/call

=> total: 29548 calls, 2560 bytes
msg287254 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-07 19:31
Awesome! This looks fantastic! I need to check the patch very carefully to be sure that we haven't missed something important.

Isn't the Python directory more appropriate place for call.c?
msg287257 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 21:49
Serhiy Storchaka added the comment:
> Isn't the Python directory more appropriate place for call.c?

I moved code from other .c files in Objects/, so for me it seems more
natural to add the code in Objects/ as well. It seems like most of the
code in Python/ is not "aware" of types defined in Objects/. But I
don't have a strong opinion on the right directory.

Objects/call.c is 1500 lines long. IMHO the code became big enough to
justify to move it to a new file ;-)
msg287261 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-07 22:42
pyobject_fastcall-4.patch combines a lot of changes. I wrote it to experiment _PyObject_FastCall().

I will rewrite these changes as a patch serie with smaller changes.

The design of PyObject_FastCall*() is to be short and simple.

Changes:

* Add _PyObject_FastCall(), _PyFunction_FastCall(), _PyCFunction_FastCall(): similar to the _PyObject_FastCallKeywords() but without keyword arguments. Without keyword arguments, the checks on keyword arguments can be removed, and it allows to reduce the stack usage: one less parameter to C functions.

* Move the slow path out of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in a new subfunction which is tagged with _Py_NO_INLINE to reduce the stack consumption of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in their fast paths, which are the most code paths.

* Move all "call" functions into a new Objects/call.c function. This change should help code placement to enhance the usage of the CPU caches (especially the CPU L1 instruction cache). In my long experience with benchmarking last year, I notice huge performance differences caused by code placement. See my blog post:
https://haypo.github.io/analysis-python-performance-issue.html
Sadly, _Py_HOT_FUNCTION didn't fix the issue completely.

* _PyObject_FastCallDict() and _PyObject_FastCallKeywords() "call" _PyObject_FastCall(). In fact, _PyObject_FastCall() is inlined manually in these functions, against, to minimize the stack usage. Similar change in _PyCFunction_FastCallDict() and PyCFunction_Call().

* Since the slow path is moved to a subfunction, I removed _Py_NO_INLINE from _PyStack_AsTuple() to allow the compiler to inline it if it wants to ;-) Since the stack usage is better with the patch, it seems like this strange has no negative impact on the stack usage.

* Optimize PyMethodDescr_Type (call _PyMethodDescr_FastCallKeywords()) in _PyObject_FastCallKeywords()

* I moved Py_EnterRecursiveCall() closer to the final function call, to simplify error handling. It would be nice to fix the issue #29306 before :-/

* I had to copy/paste the null_error() function from Objects/abstract.c. I don't think that it's worth it to add a _PyErr_NullError() function shared by abstract.c and call.c. Compilers are even able to merge duplicated functions ;-)

* A few other changes.
msg287358 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-08 17:35
I splitted my big patch into smaller changes, see my pull request:
https://github.com/python/cpython/pull/74

If you prefer Rietveld ([Review] button), I also attached the squashed change: pyobject_fastcall-5.patch.
msg287360 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-08 17:38
Crap,  pyobject_fastcall-5.patch was not rebased correctly :-/ Please see pyobject_fastcall-6.patch instead.
msg287373 - (view) Author: Roundup Robot (python-dev) Date: 2017-02-09 01:03
New changeset 31342913fb1e by Victor Stinner in branch 'default':
Fix PyCFunction_Call() performance issue
https://hg.python.org/cpython/rev/31342913fb1e
msg287377 - (view) Author: Roundup Robot (python-dev) Date: 2017-02-09 02:00
New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master':
Fix PyCFunction_Call() performance issue
https://github.com/python/cpython/commit/e9807e307f1f561a2dfe3e9f97a38444528dba86
msg287531 - (view) Author: Roundup Robot (python-dev) Date: 2017-02-10 13:15
New changeset f23fa1f7b68f by Victor Stinner in branch 'default':
Issue #29465: Add Objects/call.c file
https://hg.python.org/cpython/rev/f23fa1f7b68f
msg288526 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-24 15:23
Hum, code to call functions changed a lot recently, and other issues already enhanced the stack usage.

I should measure the stack usage to check if it's still worth it.
History
Date User Action Args
2017-02-24 15:23:19hayposetmessages: + msg288526
2017-02-23 15:55:50hayposettitle: Add _PyObject_FastCall() to reduce stack consumption -> Modify _PyObject_FastCall() to reduce stack consumption
2017-02-10 13:15:39python-devsetmessages: + msg287531
2017-02-09 02:00:23python-devsetmessages: + msg287377
2017-02-09 01:03:06python-devsetnosy: + python-dev
messages: + msg287373
2017-02-08 17:38:08hayposetfiles: + pyobject_fastcall-6.patch

messages: + msg287360
2017-02-08 17:35:47hayposetfiles: + pyobject_fastcall-5.patch

messages: + msg287358
2017-02-08 17:33:47hayposetpull_requests: + pull_request25
2017-02-07 22:42:44hayposetmessages: + msg287261
2017-02-07 21:49:25hayposetmessages: + msg287257
2017-02-07 19:31:17serhiy.storchakasetmessages: + msg287254
stage: patch review
2017-02-07 15:05:00hayposetmessages: + msg287236
2017-02-07 14:46:05hayposetfiles: + pyobject_fastcall-4.patch

messages: + msg287234
2017-02-07 10:16:13hayposetmessages: + msg287224
2017-02-07 08:17:32hayposetmessages: + msg287204
2017-02-07 01:12:06hayposetfiles: + pyobject_fastcall-3.patch

messages: + msg287184
2017-02-07 01:05:56hayposetfiles: + pyobject_fastcall-2.patch

messages: + msg287183
2017-02-06 17:11:05hayposetfiles: + meth_fastcall_stacksize.patch

messages: + msg287154
2017-02-06 17:08:27haypocreate