classification
Title: Reduce C stack consumption in function calls
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-01-10 17:50 by haypo, last changed 2017-02-01 17:10 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
less_stack.patch haypo, 2017-01-10 17:50
bench_recursion.py serhiy.storchaka, 2017-01-10 19:25
Messages (7)
msg285135 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-10 17:50
Attached patch reduce C stack consumption in function calls. It's the follow-up of the issue #28870.


Reference (rev a30cdf366c02):

test_python_call: 7175 calls before crash, stack: 1168 bytes/call
test_python_getitem: 6235 calls before crash, stack: 1344 bytes/call
test_python_iterator: 5344 calls before crash, stack: 1568 bytes/call

=> total: 18754 calls, 4080 bytes


With "Inline call_function() in ceval.c":

test_python_call: 7936 calls before crash, stack: 1056 bytes/call
test_python_getitem: 6387 calls before crash, stack: 1312 bytes/call
test_python_iterator: 5755 calls before crash, stack: 1456 bytes/call

=> total: 20078 calls, 3824 bytes


With inline and "_PY_FASTCALL_SMALL_STACK: 5 arg (40 B) => 3 arg (24 B)":

test_python_call: 8058 calls before crash, stack: 1040 bytes/call
test_python_getitem: 6630 calls before crash, stack: 1264 bytes/call
test_python_iterator: 5952 calls before crash, stack: 1408 bytes/call

=> total: 20640 calls, 3712 bytes


I applied testcapi_stack_pointer.patch and run stack_overflow_28870-sp.py of the issue #28870 to produce these statistics.

With the patch, Python 3.7 is still not as good as Python 3.5 (msg285109), but it's a first enhancement.
msg285147 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-10 19:25
$ ./python -m perf timeit -s "from bench_recursion import test_python_call as test" -- "test(1000)"
Python 2.7:  5.10 ms +- 0.37 ms
Python 3.4:  4.38 ms +- 0.28 ms
Python 3.5:  4.19 ms +- 0.26 ms
Python 3.6:  3.93 ms +- 0.32 ms
Python 3.7:  3.26 ms +- 0.27 ms

$ ./python -m perf timeit -s "from bench_recursion import test_python_getitem as test" -- "test(1000)"
Python 2.7:  4.09 ms +- 0.26 ms
Python 3.4:  4.60 ms +- 0.23 ms
Python 3.5:  4.35 ms +- 0.28 ms
Python 3.6:  4.05 ms +- 0.34 ms
Python 3.7:  3.23 ms +- 0.23 ms

$ ./python -m perf timeit -s "from bench_recursion import test_python_iterator as test" -- "test(1000)"
Python 2.7:  7.85 ms +- 0.66 ms
Python 3.4:  9.31 ms +- 0.55 ms
Python 3.5:  9.83 ms +- 0.71 ms
Python 3.6:  8.99 ms +- 0.66 ms
Python 3.7:  8.58 ms +- 0.73 ms
msg285160 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-10 21:34
Oh wow! I'm impressed that Python 3 is better at each release! On 2 tests, Python 3.7 is faster than Python 2.7, but on test_python_iterator Python 3.7 is still slower. It seems like this specific test became much slower (+19%) on Python 3.4 compared to 2.7.

I guess that your benchmark is on unpatched Python.

I don't think that less_stack.patch has an impact on performances, but I guess because I'm curisous. It seems like it's a little bit faster. At least, it's not slower ;-)

test_python_call: Median +- std dev: [ref] 509 us +- 11 us -> [patch] 453 us +- 49 us: 1.12x faster (-11%)
test_python_getitem: Median +- std dev: [ref] 485 us +- 13 us -> [patch] 470 us +- 23 us: 1.03x faster (-3%)
test_python_iterator: Median +- std dev: [ref] 1.15 ms +- 0.05 ms -> [patch] 1.12 ms +- 0.07 ms: 1.03x faster (-3%)
msg285163 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-10 23:00
I didn't provide results with less_stack.patch because they were almost the same, just 1-3% faster. That might be just a random noise or compiler artifact. But may be an effect of inlining call_function().

Could you run full Python benchmarks? Decreasing the size of small stack doesn't impact a performance in these cases, but may impact a performance of calls with larger number of arguments. AFAIK the size of some small stacks already was decreased from 8 to 5.
msg285164 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-10 23:03
I plan to run a benchmark when all my patches to reduce the stack consumption will be ready. I'm still trying all the various options to reduce the stack consumption. I'm trying to avoid hacks and reduce the number of changes. I'm already better than Python 2.7 and 3.5 on my local branch.
msg285171 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-11 00:28
New changeset 8481c379e2da by Victor Stinner in branch 'default':
Inline call_function()
https://hg.python.org/cpython/rev/8481c379e2da
msg286658 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-01 17:10
Victor: "I plan to run a benchmark when all my patches to reduce the stack consumption will be ready."

msg285200 of issue #28870: "I also ran the reliable performance benchmark suite with LTO+PGO. There is no significant performance change on these benchmarks (...)"

less_stack.patch:

-#define _PY_FASTCALL_SMALL_STACK 5
+#define _PY_FASTCALL_SMALL_STACK 3

With the issue #28870, reducing _PY_FASTCALL_SMALL_STACK value is no more needed. Larger _PY_FASTCALL_SMALL_STACK means better performances, so I prefer to keep the value 5 (arguments).

The main change, inline call_function(), was merged, so I close the issue.
History
Date User Action Args
2017-02-01 17:10:46hayposetstatus: open -> closed
resolution: fixed
messages: + msg286658
2017-01-11 00:28:06python-devsetnosy: + python-dev
messages: + msg285171
2017-01-10 23:03:47hayposetmessages: + msg285164
2017-01-10 23:00:51serhiy.storchakasetmessages: + msg285163
2017-01-10 21:34:55hayposetmessages: + msg285160
2017-01-10 19:25:41serhiy.storchakasetfiles: + bench_recursion.py

messages: + msg285147
2017-01-10 17:50:46haypocreate