classification
Title: Reduce stack consumption of PyObject_CallFunctionObjArgs() and like
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: python-dev, serhiy.storchaka, vstinner, xiang.zhang
Priority: normal Keywords: patch

Created on 2016-12-04 22:50 by serhiy.storchaka, last changed 2017-02-06 15:00 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
PyObject_CallFunctionObjArgs.patch serhiy.storchaka, 2016-12-04 22:50 review
less_stack.patch vstinner, 2016-12-15 13:30 review
alloca.patch vstinner, 2016-12-15 13:45 review
subfunc.patch vstinner, 2016-12-15 13:49 review
testcapi_stacksize.patch vstinner, 2017-01-03 01:31 review
no_small_stack.patch vstinner, 2017-01-03 01:40 review
stack_overflow_28870.py vstinner, 2017-01-09 17:10
testcapi_stack_pointer.patch vstinner, 2017-01-10 11:40 review
stack_overflow_28870-sp.py vstinner, 2017-01-10 11:45
no_small_stack-2.patch vstinner, 2017-01-10 11:55
bench_recursion-2.py vstinner, 2017-01-11 00:50
Messages (35)
msg282374 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-04 22:50
Following patch I wrote in attempt to decrease a stack consumption of PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs() and _PyObject_CallMethodIdObjArgs(). But it doesn't affect a stack consumption. I still didn't measured what performance effect it has. Seems it makes a code a little cleaner.
msg282379 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-04 23:30
What do you think of using alloca() instead of an "PyObject *small_stack[5];" which has a fixed size?

Note: About your patch, try to avoid _PyObject_CallArg1() if you care of the usage of the C stack, see issue #28858.
msg282380 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-04 23:38
> But it doesn't affect a stack consumption.

How do you check the stack consumption of PyObject_CallFunctionObjArgs()?
msg282384 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-05 05:52
> What do you think of using alloca() instead of an "PyObject *small_stack[5];" which has a fixed size?

alloca() is not in POSIX.1. I afraid it would make CPython less portable.

> Note: About your patch, try to avoid _PyObject_CallArg1() if you care of the usage of the C stack, see issue #28858.

I don't understand how can I avoid it.

> How do you check the stack consumption of PyObject_CallFunctionObjArgs()?

Using a script from issue28858.
msg283287 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-12-15 11:55
New changeset 71876e4abce4 by Victor Stinner in branch 'default':
Add _PY_FASTCALL_SMALL_STACK constant
https://hg.python.org/cpython/rev/71876e4abce4
msg283297 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-15 13:30
I reworked abstract.c to prepare work for this issue:

* change 455169e87bb3: Add _PyObject_CallFunctionVa() helper
* change 6e748eb79038: Add _PyObject_VaCallFunctionObjArgs() private function
* change 71876e4abce4: Add _PY_FASTCALL_SMALL_STACK constant


I wrote a function _testcapi to measure the consumption of the C code. I was surprised by the results: calling PyObject_CallFunctionObjArgs(func, arg1, arg2, NULL) consumes 560 bytes! I measured on a Python compiled in release mode.

Attached less_stack.patch rewrites _PyObject_VaCallFunctionObjArgs(), it reduces the stack consumption from 560 bytes to 384 bytes (-176 bytes!).

Changes:

* Remove "va_list countva" variable: the va_list variable itself, va_copy(), etc. consume stack memory. First I tried to move code to a subfunction, it helps. With my patch, it's even simpler.

* Reduce _PY_FASTCALL_SMALL_STACK from 5 to 3. Stack usage is not directly _PY_FASTCALL_SMALL_STACK*sizeof(PyObject*), it's much more, probably because of complex memory alignement rules.

* Use Py_LOCAL_INLINE(). It seems like depending on the size of the object_vacall() function body, the function is inlined or not. If it's not inlined, the stack usage increases from 384 bytes to 544 bytes!? Use Py_LOCAL_INLINE() to force inlining.


Effect of _PY_FASTCALL_SMALL_STACK:

* 1: 368 bytes
* 2: 384 bytes
* 3: 384 bytes -- value chosen in my patch
* 4: 400 bytes
* 5: 416 bytes
msg283298 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-15 13:33
I don't propose to add _testcapi.pyobjectl_callfunctionobjargs_stacksize(). It's just to test the patch. I'm using it with:

$./python -c 'import _testcapi; n=100; print(_testcapi.pyobjectl_callfunctionobjargs_stacksize(n) / (n+1))'
384.0

The value of n has no impact on the stack, it gives the same value with n=0.
msg283302 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-15 13:45
I also tried to use alloca(): see attached alloca.patch. But the result is quite bad: 528 bytes of stack memory per call. I only attach the patch to discuss the issue, but I now dislike the option: the result is bad, it's less portable and more dangerous.
msg283303 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-15 13:49
I also tried Serhiy's approach, split the function into subfunctions, but the result is not as good as expected: 496 bytes. See attached subfunc.patch.
msg283309 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-15 14:26
For comparison, Python 3.5 (before fast calls) uses 448 bytes of C stack per call. Python 3.5 uses a tuple allocated in the heap memory.
msg283336 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-15 16:28
I have tested all three patches with the stack_overflow.py script. The only affected are recursive Python implementations of __call__, __getitem__ and __iter__.

                        unpatched   less_stack  alloca      subfunc

test_python_call        9696        9876        9880        9876
test_python_getitem     9884        10264       9880        10688
test_python_iterator    7812        8052        8312        8872
msg284524 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-03 01:31
testcapi_stacksize.patch: add _testcapi.pyobjectl_callfunctionobjargs_stacksize(), function used to measure the stack consumption.
msg284527 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-03 01:40
no_small_stack.patch: And now something completely different, a patch to remove the "small stack" alllocated on the C stack, always use the heap memory. FYI I created no_small_stack.patch from less_stack.patch.

As expected, the stack usage is lower:

* less_stack.patch: 384 bytes/call
* no_small_stack.patch: 368 bytes/call

I didn't check the performance of no_small_stack.patch yet.
msg284528 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-03 01:42
In Python 3.5, PyObject_CallFunctionObjArgs() calls objargs_mktuple() which uses Py_VA_COPY(countva, va) and creates a tuple. The tuple constructor uses a free list to reduce the cost of heap memory allocations.
msg285055 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-09 17:10
I modified Serhiy's stack_overflow.py of #28858:
* re-run each test 10 tests and show the maximum depth
* only test: ['test_python_call', 'test_python_getitem', 'test_python_iterator']

Maximum number of Python calls before a crash.

(*) Reference (unpatched): 560 bytes/call

test_python_call 7172
test_python_getitem 6232
test_python_iterator 5344
=> total: 18 838

(1) no_small_stack.patch: 368 bytes/call

test_python_call 7172 (=)
test_python_getitem 6544 (+312)
test_python_iterator 5572 (+228)
=> total: 19 288

(2) less_stack.patch: 384 bytes/call

test_python_call 7272 (+100)
test_python_getitem 6384 (+152)
test_python_iterator 5456 (+112)
=> total: 19 112

(3) subfunc.patch: 496 bytes

test_python_call 7272 (+100)
test_python_getitem 6712 (+480)
test_python_iterator 6020 (+678)
=> total: 20 004

(4) alloca.patch: 528 bytes/call

test_python_call 7272 (+100)
test_python_getitem 6464 (+232)
test_python_iterator 5752 (+408)
=> total: 19 488

Patched sorted by bytes/call, from best to worst: no_small_stack.patch (368) > less_stack.patch (384) > subfunc.patch (496) > alloca.patch (528) > reference (560).

Patched sorted by number of calls before crash: subfunc.patch (20 004) > alloca.patch (19 488) > no_small_stack.patch (19 288) > less_stack.patch (19 112) > reference (18 838).

I expected a correlation between the measure bytes/call measured by testcapi_stacksize.patch and the number of calls before a crash, but I fail to see an obvious correlation :-/

Maybe the compiler is smarter than what I would expect and emits efficient code to be able to use less stack memory?

Maybe the Linux kernel does weird things which makes the behaviour on stack-overflow non-obvious :-)

At least, I would expect that no_small_stack.patch would be the clear winner, since it has the smallest usage of C stack.
msg285057 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-09 17:28
Impact of the _PY_FASTCALL_SMALL_STACK constant:

* _PY_FASTCALL_SMALL_STACK=1: 528 bytes/call

test_python_call 7376
test_python_getitem 6544
test_python_iterator 5572
=> total: 19 492

* _PY_FASTCALL_SMALL_STACK=3: 528 bytes/call

test_python_call 7272
test_python_getitem 6464
test_python_iterator 5512
=> total: 19 248

* _PY_FASTCALL_SMALL_STACK=5 (current value): 560 bytes/call

test_python_call 7172
test_python_getitem 6232
test_python_iterator 5344
=> total: 19 636

* _PY_FASTCALL_SMALL_STACK=10: 592 bytes/call

test_python_call 6984
test_python_getitem 5952
test_python_iterator 5132
=> total: 18 068

Increasing _PY_FASTCALL_SMALL_STACK has a clear effect on the total. Total decreases when _PY_FASTCALL_SMALL_STACK increases.


---

no_small_stack.patch with _PY_FASTCALL_SMALL_STACK=3: 368 bytes/call

test_python_call 7272
test_python_getitem 6628
test_python_iterator 5632
=> total: 19 532
msg285060 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-09 17:45
I'm not sure that the result of pyobjectl_callfunctionobjargs_stacksize() has direct relation to stack consumption in test_python_call, test_python_getitem and test_python_iterator. Try to measure the stack consumption in these cases. This can be done with _testcapi helper that just returns the value of stack pointer. Run all three tests with fixed level of recursion and measure the difference between stack pointers.

Would be nice also measure a performance effect of the patches.
msg285105 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 11:40
testcapi_stack_pointer.patch: add _testcapi.stack_pointer() function.
msg285106 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 11:45
stack_overflow_28870-sp.py: script using testcapi_stack_pointer.patch to compute the usage of the C stack. Results of this script.

(*) Reference

test_python_call: 7175 calls before crash, stack: 1168 bytes/call
test_python_getitem: 6235 calls before crash, stack: 1344 bytes/call
test_python_iterator: 5344 calls before crash, stack: 1568 bytes/call

=> total: 18754 calls, 4080 bytes

(1) no_small_stack.patch

test_python_call: 7175 calls before crash, stack: 1168 bytes/call
test_python_getitem: 6547 calls before crash, stack: 1280 bytes/call
test_python_iterator: 5572 calls before crash, stack: 1504 bytes/call

=> total: 19294 calls, 3952 bytes

test_python_call is clearly not impacted by no_small_stack.patch.

test_python_call loops on method_call():

method_call()
=> _PyObject_Call_Prepend()
=> _PyObject_FastCallDict()
=> _PyFunction_FastCallDict()
=> _PyEval_EvalCodeWithName()
=> PyEval_EvalFrameEx()
=> _PyEval_EvalFrameDefault()
=> call_function()
=> _PyObject_FastCallKeywords()
=> slot_tp_call()
=> PyObject_Call()
=> method_call()
=> (...)

_PyObject_Call_Prepend() is in the middle of the chain. This function uses a "small stack" of _PY_FASTCALL_SMALL_STACK "PyObject*" items. We can clearly see the impact of modifying _PY_FASTCALL_SMALL_STACK on the maximum number of 
test_python_call calls before crash in msg285057.
msg285107 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 11:55
no_small_stack-2.patch: Remove all "small_stack" buffers.

Reference

test_python_call: 7175 calls before crash, stack: 1168 bytes/call
test_python_getitem: 6235 calls before crash, stack: 1344 bytes/call
test_python_iterator: 5344 calls before crash, stack: 1568 bytes/call

=> total: 18754 calls, 4080 bytes

no_small_stack.patch

test_python_call: 7482 calls (+307) before crash, stack: 1120 bytes/call (-48)
test_python_getitem: 6715 calls (+480) before crash, stack: 1248 bytes/call (-96)
test_python_iterator: 5693 calls (+349) before crash, stack: 1472 bytes/call (-96)

=> total: 19890 calls (+1136), 3840 bytes (-240)

The total gain is the removal of 5 small buffers of 48 bytes: 240 bytes.
msg285108 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 12:08
> no_small_stack.patch:

Oops, you should read no_small_stack-2.patch in my previous message ;-)
msg285109 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 12:15
Python 3.5 (revision 8125d9a8152b), before all fastcall changes:

test_python_call: 8314 calls before crash, stack: 1008 bytes/call
test_python_getitem: 7483 calls before crash, stack: 1120 bytes/call
test_python_iterator: 6802 calls before crash, stack: 1232 bytes/call

=> total: 22599 calls, 3360 bytes
msg285110 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-10 12:23
What are results with 3.4? There were several issues about stack overflow in 3.5 (issue25222, issue28179, issue28913).
msg285113 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 14:09
Python 3.4 (rev 6340c9fcc111):

test_python_call: 9700 calls before crash, stack: 864 bytes/call
test_python_getitem: 8314 calls before crash, stack: 1008 bytes/call
test_python_iterator: 7818 calls before crash, stack: 1072 bytes/call

=> total: 25832 calls, 2944 bytes

Python 2.7 (rev 0d4e0a736688):

test_python_call: 6162 calls before crash, stack: 1360 bytes/call
test_python_getitem: 5952 calls before crash, stack: 1408 bytes/call
test_python_iterator: 5885 calls before crash, stack: 1424 bytes/call

=> total: 17999 calls, 4192 bytes

Nice. At least, Python 3.7 is better than Python 2.7 (4080 bytes <
4192 bytes) :-) Python 3.4 stack usage was very low, and lower than
Python 3.5.
msg285122 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 15:03
no_small_stack-2.patch has a very bad impact on performances:

haypo@speed-python$ python3 -m perf compare_to 2017-01-04_12-02-default-ee1390c9b585.json no_small_stack-2_refee1390c9b585.json -G --min-speed=5

Slower (59):
- telco: 15.7 ms +- 0.5 ms -> 23.4 ms +- 0.3 ms: 1.49x slower (+49%)
- scimark_sor: 393 ms +- 6 ms -> 579 ms +- 10 ms: 1.47x slower (+47%)
- json_loads: 56.9 us +- 0.9 us -> 83.1 us +- 2.4 us: 1.46x slower (+46%)
- unpickle_pure_python: 698 us +- 10 us -> 984 us +- 10 us: 1.41x slower (+41%)
- scimark_lu: 424 ms +- 22 ms -> 585 ms +- 33 ms: 1.38x slower (+38%)
- chameleon: 22.4 ms +- 0.2 ms -> 30.8 ms +- 0.3 ms: 1.38x slower (+38%)
- xml_etree_generate: 212 ms +- 3 ms -> 291 ms +- 4 ms: 1.37x slower (+37%)
- xml_etree_process: 177 ms +- 3 ms -> 240 ms +- 3 ms: 1.35x slower (+35%)
- raytrace: 1.04 sec +- 0.01 sec -> 1.40 sec +- 0.02 sec: 1.35x slower (+35%)
- logging_simple: 27.9 us +- 0.4 us -> 37.4 us +- 0.5 us: 1.34x slower (+34%)
- pickle_pure_python: 1.02 ms +- 0.01 ms -> 1.37 ms +- 0.02 ms: 1.34x slower (+34%)
- logging_format: 33.3 us +- 0.4 us -> 44.5 us +- 0.7 us: 1.34x slower (+34%)
- xml_etree_iterparse: 195 ms +- 5 ms -> 259 ms +- 7 ms: 1.32x slower (+32%)
- chaos: 236 ms +- 3 ms -> 306 ms +- 3 ms: 1.30x slower (+30%)
- regex_compile: 380 ms +- 3 ms -> 494 ms +- 5 ms: 1.30x slower (+30%)
- pathlib: 42.3 ms +- 0.5 ms -> 55.0 ms +- 0.6 ms: 1.30x slower (+30%)
- django_template: 364 ms +- 5 ms -> 471 ms +- 4 ms: 1.29x slower (+29%)
- call_method: 11.2 ms +- 0.2 ms -> 14.4 ms +- 0.2 ms: 1.29x slower (+29%)
- hexiom: 18.4 ms +- 0.2 ms -> 23.7 ms +- 0.2 ms: 1.29x slower (+29%)
- call_method_slots: 11.0 ms +- 0.3 ms -> 14.1 ms +- 0.1 ms: 1.28x slower (+28%)
- richards: 147 ms +- 4 ms -> 188 ms +- 5 ms: 1.28x slower (+28%)
- html5lib: 207 ms +- 7 ms -> 262 ms +- 6 ms: 1.27x slower (+27%)
- genshi_text: 71.5 ms +- 1.3 ms -> 90.3 ms +- 1.1 ms: 1.26x slower (+26%)
- deltablue: 14.2 ms +- 0.2 ms -> 17.9 ms +- 0.4 ms: 1.26x slower (+26%)
- genshi_xml: 164 ms +- 2 ms -> 207 ms +- 3 ms: 1.26x slower (+26%)
- sympy_str: 429 ms +- 5 ms -> 539 ms +- 4 ms: 1.25x slower (+25%)
- go: 493 ms +- 5 ms -> 619 ms +- 7 ms: 1.25x slower (+25%)
- mako: 35.4 ms +- 1.5 ms -> 44.2 ms +- 1.2 ms: 1.25x slower (+25%)
- sympy_expand: 959 ms +- 10 ms -> 1.19 sec +- 0.01 sec: 1.24x slower (+24%)
- nqueens: 215 ms +- 2 ms -> 268 ms +- 1 ms: 1.24x slower (+24%)
(...)

Benchmark ran on speed-python with PGO+LTO, Linux configured for benchmarks using python3 -m perf system tune.
msg285123 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-10 15:06
Thus Python 3.6 stack usage is about 20% larger than Python 3.5 and about 40% larger than Python 3.4. This is significant. :-(

no_small_stack-2.patch decreases it only by 6% (with possible performance loss).
msg285124 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 15:09
> no_small_stack-2.patch decreases it only by 6% (with possible performance loss).

Yeah, if we want to come back to Python 3.4 efficiency, we need to find the other functions which now uses more stack memory ;-) The discussed "small stack" buffers are only responsible of 96 bytes, not a big deal compared to the total of 4080 bytes.
msg285128 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 16:02
Stack used by each C function of test_python_call.

3.4:

(a) method_call: 64

(b) PyObject_Call: 48
(b) function_call: 160
(b) PyEval_EvalCodeEx: 176

(c) PyEval_EvalFrameEx: 256
(c) call_function: 0
(c) do_call: 0
(c) PyObject_Call: 48

(d) slot_tp_call: 64
(d) PyObject_Call: 48

=> total: 864


default:

(a) method_call: 80

(b) _PyObject_FastCallDict: 64
(b) _PyFunction_FastCallDict: 208
(b) _PyEval_EvalCodeWithName: 176

(c) _PyEval_EvalFrameDefault: 320
(c) call_function: 80
(c) _PyObject_FastCallKeywords: 80

(d) slot_tp_call: 64
(d) PyObject_Call: 48

=> total: 1120


Groups of functions, 3.4 => default:

(a) 64 => 80 (+16)
(b) 384 => 448 (+64)
(c) 304 => 480 (+176)
(d) 112 => 112 (=)


I used gdb:

(gdb) set $last=0
(gdb) define size
> print $last - (uintptr_t)$rsp
> set $last = (uintptr_t)$rsp
> down
(gdb) up
(gdb) up
(gdb) up
(... until a first method_call ...)
(gdb) size
(gdb) size
...
msg285136 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 17:51
I created the issue #29227 "Reduce C stack consumption in function calls" which contains a first simple patch with a significant effect on the C stack.
msg285137 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-10 17:57
It seems like subfunc.patch approach using the "no inline" attribute helps.
msg285169 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 00:20
I pushed 3 changes:

* rev b9404639a18c: Issue #29233: call_method() now uses _PyObject_FastCall()
* rev 8481c379e2da: Issue #29227: inline call_function()
* rev 6478e6d0476f: Issue #29234: disable _PyStack_AsTuple() inlining


Before (rev a30cdf366c02):

test_python_call: 7175 calls before crash, stack: 1168 bytes/call
test_python_getitem: 6235 calls before crash, stack: 1344 bytes/call
test_python_iterator: 5344 calls before crash, stack: 1568 bytes/call

=> total: 18754 calls, 4080 bytes


With these 3 changes (rev 6478e6d0476f):

test_python_call: 8587 calls before crash, stack: 976 bytes/call
test_python_getitem: 9189 calls before crash, stack: 912 bytes/call
test_python_iterator: 7936 calls before crash, stack: 1056 bytes/call

=> total: 25712 calls, 2944 bytes


The default branch is now as good as Python 3.4, in term of stack consumption, and Python 3.4 was the Python version which used the least stack memory according to my tests.

I didn't touch _PY_FASTCALL_SMALL_STACK value, it's still 5 arguments (40 bytes). So my changes should not impact performances.
msg285173 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 00:50
Result of attached bench_recursion-2.py comparing before/after the 3 changes reducing the stack consumption:

test_python_call: Median +- std dev: [a30cdf366c02] 512 us +- 12 us -> [6478e6d0476f] 467 us +- 21 us: 1.10x faster (-9%)
test_python_getitem: Median +- std dev: [a30cdf366c02] 485 us +- 26 us -> [6478e6d0476f] 437 us +- 18 us: 1.11x faster (-10%)
test_python_iterator: Median +- std dev: [a30cdf366c02] 1.15 ms +- 0.04 ms -> [6478e6d0476f] 1.03 ms +- 0.06 ms: 1.12x faster (-10%)

At least, it doesn't seem to be slower. Maybe the speedup comes from call_function() inlining. This function was probably already inlined when using PGO build.

The script was written by Serhiy in the issue #29227, I modified it to use the Runner.timeit() API for convenience.
msg285192 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-11 06:51
Awesome! You are great Victor!
msg285200 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-11 08:04
I also ran the reliable performance benchmark suite with LTO+PGO. There is no significant performance change on these benchmarks:
https://speed.python.org/changes/?rev=b9404639a18c&exe=5&env=speed-python

The largest change is on scimark_lu (-13%), but there was an hiccup on the previous change which is probably a small unstability in the benchmark. It's not a speedup of these changes.

The second largest change is on spectral_norm: +9%. But this benchmark is known to be unstable, there was already a small peak previously. Again, I don't think that it's related to the changes.
msg286657 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-01 17:05
"The default branch is now as good as Python 3.4, in term of stack consumption, and Python 3.4 was the Python version which used the least stack memory according to my tests."

I consider that the initial issue is now fixed, so I close the issue.

Thanks Serhiy for the tests, reviews, ideas and obvious the bug report ;-) I never looked at the stack usage before.
History
Date User Action Args
2017-02-06 15:00:00vstinnersetstatus: open -> closed
resolution: fixed
stage: resolved
2017-02-01 17:05:28vstinnersetmessages: + msg286657
2017-01-11 08:04:18vstinnersetmessages: + msg285200
2017-01-11 06:51:48serhiy.storchakasetmessages: + msg285192
2017-01-11 00:50:29vstinnersetfiles: + bench_recursion-2.py

messages: + msg285173
2017-01-11 00:20:39vstinnersetmessages: + msg285169
2017-01-10 17:57:16vstinnersetmessages: + msg285137
2017-01-10 17:51:18vstinnersetmessages: + msg285136
2017-01-10 16:02:07vstinnersetmessages: + msg285128
2017-01-10 15:09:44vstinnersetmessages: + msg285124
2017-01-10 15:06:19serhiy.storchakasetmessages: + msg285123
2017-01-10 15:03:59vstinnersetmessages: + msg285122
2017-01-10 14:09:34vstinnersetmessages: + msg285113
2017-01-10 12:23:56serhiy.storchakasetmessages: + msg285110
2017-01-10 12:15:10vstinnersetmessages: + msg285109
2017-01-10 12:08:38vstinnersetmessages: + msg285108
2017-01-10 11:55:07vstinnersetfiles: + no_small_stack-2.patch

messages: + msg285107
2017-01-10 11:45:51vstinnersetfiles: + stack_overflow_28870-sp.py

messages: + msg285106
2017-01-10 11:40:26vstinnersetfiles: + testcapi_stack_pointer.patch

messages: + msg285105
2017-01-09 17:45:03serhiy.storchakasetmessages: + msg285060
2017-01-09 17:28:49vstinnersetmessages: + msg285057
2017-01-09 17:10:22vstinnersetfiles: + stack_overflow_28870.py

messages: + msg285055
2017-01-09 10:56:13xiang.zhangsetnosy: + xiang.zhang
2017-01-03 01:42:39vstinnersetmessages: + msg284528
2017-01-03 01:40:46vstinnersetfiles: + no_small_stack.patch

messages: + msg284527
2017-01-03 01:31:35vstinnersetfiles: + testcapi_stacksize.patch

messages: + msg284524
2016-12-15 16:28:33serhiy.storchakasetmessages: + msg283336
2016-12-15 14:26:23vstinnersetmessages: + msg283309
2016-12-15 13:49:04vstinnersetfiles: + subfunc.patch

messages: + msg283303
2016-12-15 13:45:37vstinnersetfiles: + alloca.patch

messages: + msg283302
2016-12-15 13:33:59vstinnersetmessages: + msg283298
2016-12-15 13:31:26vstinnersettitle: Refactor PyObject_CallFunctionObjArgs() and like -> Reduce stack consumption of PyObject_CallFunctionObjArgs() and like
2016-12-15 13:30:48vstinnersetfiles: + less_stack.patch

messages: + msg283297
2016-12-15 11:55:10python-devsetnosy: + python-dev
messages: + msg283287
2016-12-05 05:52:58serhiy.storchakasetmessages: + msg282384
2016-12-04 23:38:10vstinnersetmessages: + msg282380
2016-12-04 23:30:50vstinnersetmessages: + msg282379
2016-12-04 22:50:13serhiy.storchakacreate