Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce stack consumption of PyObject_CallFunctionObjArgs() and like #73056

Closed
serhiy-storchaka opened this issue Dec 4, 2016 · 35 comments
Closed
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

BPO 28870
Nosy @vstinner, @serhiy-storchaka, @zhangyangyu
Files
  • PyObject_CallFunctionObjArgs.patch
  • less_stack.patch
  • alloca.patch
  • subfunc.patch
  • testcapi_stacksize.patch
  • no_small_stack.patch
  • stack_overflow_28870.py
  • testcapi_stack_pointer.patch
  • stack_overflow_28870-sp.py
  • no_small_stack-2.patch
  • bench_recursion-2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-02-06.15:00:00.815>
    created_at = <Date 2016-12-04.22:50:13.895>
    labels = ['interpreter-core', 'type-feature', '3.7']
    title = 'Reduce stack consumption of PyObject_CallFunctionObjArgs() and like'
    updated_at = <Date 2017-02-06.15:00:00.814>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2017-02-06.15:00:00.814>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-02-06.15:00:00.815>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2016-12-04.22:50:13.895>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['45758', '45915', '45917', '45918', '46119', '46120', '46230', '46238', '46239', '46240', '46249']
    hgrepos = []
    issue_num = 28870
    keywords = ['patch']
    message_count = 35.0
    messages = ['282374', '282379', '282380', '282384', '283287', '283297', '283298', '283302', '283303', '283309', '283336', '284524', '284527', '284528', '285055', '285057', '285060', '285105', '285106', '285107', '285108', '285109', '285110', '285113', '285122', '285123', '285124', '285128', '285136', '285137', '285169', '285173', '285192', '285200', '286657']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'python-dev', 'serhiy.storchaka', 'xiang.zhang']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue28870'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    Following patch I wrote in attempt to decrease a stack consumption of PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs() and _PyObject_CallMethodIdObjArgs(). But it doesn't affect a stack consumption. I still didn't measured what performance effect it has. Seems it makes a code a little cleaner.

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Dec 4, 2016
    @vstinner
    Copy link
    Member

    vstinner commented Dec 4, 2016

    What do you think of using alloca() instead of an "PyObject *small_stack[5];" which has a fixed size?

    Note: About your patch, try to avoid _PyObject_CallArg1() if you care of the usage of the C stack, see issue bpo-28858.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 4, 2016

    But it doesn't affect a stack consumption.

    How do you check the stack consumption of PyObject_CallFunctionObjArgs()?

    @serhiy-storchaka
    Copy link
    Member Author

    What do you think of using alloca() instead of an "PyObject *small_stack[5];" which has a fixed size?

    alloca() is not in POSIX.1. I afraid it would make CPython less portable.

    Note: About your patch, try to avoid _PyObject_CallArg1() if you care of the usage of the C stack, see issue bpo-28858.

    I don't understand how can I avoid it.

    How do you check the stack consumption of PyObject_CallFunctionObjArgs()?

    Using a script from bpo-28858.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 15, 2016

    New changeset 71876e4abce4 by Victor Stinner in branch 'default':
    Add _PY_FASTCALL_SMALL_STACK constant
    https://hg.python.org/cpython/rev/71876e4abce4

    @vstinner
    Copy link
    Member

    I reworked abstract.c to prepare work for this issue:

    • change 455169e87bb3: Add _PyObject_CallFunctionVa() helper
    • change 6e748eb79038: Add _PyObject_VaCallFunctionObjArgs() private function
    • change 71876e4abce4: Add _PY_FASTCALL_SMALL_STACK constant

    I wrote a function _testcapi to measure the consumption of the C code. I was surprised by the results: calling PyObject_CallFunctionObjArgs(func, arg1, arg2, NULL) consumes 560 bytes! I measured on a Python compiled in release mode.

    Attached less_stack.patch rewrites _PyObject_VaCallFunctionObjArgs(), it reduces the stack consumption from 560 bytes to 384 bytes (-176 bytes!).

    Changes:

    • Remove "va_list countva" variable: the va_list variable itself, va_copy(), etc. consume stack memory. First I tried to move code to a subfunction, it helps. With my patch, it's even simpler.

    • Reduce _PY_FASTCALL_SMALL_STACK from 5 to 3. Stack usage is not directly _PY_FASTCALL_SMALL_STACK*sizeof(PyObject*), it's much more, probably because of complex memory alignement rules.

    • Use Py_LOCAL_INLINE(). It seems like depending on the size of the object_vacall() function body, the function is inlined or not. If it's not inlined, the stack usage increases from 384 bytes to 544 bytes!? Use Py_LOCAL_INLINE() to force inlining.

    Effect of _PY_FASTCALL_SMALL_STACK:

    • 1: 368 bytes
    • 2: 384 bytes
    • 3: 384 bytes -- value chosen in my patch
    • 4: 400 bytes
    • 5: 416 bytes

    @vstinner vstinner changed the title Refactor PyObject_CallFunctionObjArgs() and like Reduce stack consumption of PyObject_CallFunctionObjArgs() and like Dec 15, 2016
    @vstinner
    Copy link
    Member

    I don't propose to add _testcapi.pyobjectl_callfunctionobjargs_stacksize(). It's just to test the patch. I'm using it with:

    $./python -c 'import _testcapi; n=100; print(_testcapi.pyobjectl_callfunctionobjargs_stacksize(n) / (n+1))'
    384.0

    The value of n has no impact on the stack, it gives the same value with n=0.

    @vstinner
    Copy link
    Member

    I also tried to use alloca(): see attached alloca.patch. But the result is quite bad: 528 bytes of stack memory per call. I only attach the patch to discuss the issue, but I now dislike the option: the result is bad, it's less portable and more dangerous.

    @vstinner
    Copy link
    Member

    I also tried Serhiy's approach, split the function into subfunctions, but the result is not as good as expected: 496 bytes. See attached subfunc.patch.

    @vstinner
    Copy link
    Member

    For comparison, Python 3.5 (before fast calls) uses 448 bytes of C stack per call. Python 3.5 uses a tuple allocated in the heap memory.

    @serhiy-storchaka
    Copy link
    Member Author

    I have tested all three patches with the stack_overflow.py script. The only affected are recursive Python implementations of __call__, __getitem__ and __iter__.

                        unpatched   less_stack  alloca      subfunc
    

    test_python_call 9696 9876 9880 9876
    test_python_getitem 9884 10264 9880 10688
    test_python_iterator 7812 8052 8312 8872

    @vstinner
    Copy link
    Member

    vstinner commented Jan 3, 2017

    testcapi_stacksize.patch: add _testcapi.pyobjectl_callfunctionobjargs_stacksize(), function used to measure the stack consumption.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 3, 2017

    no_small_stack.patch: And now something completely different, a patch to remove the "small stack" alllocated on the C stack, always use the heap memory. FYI I created no_small_stack.patch from less_stack.patch.

    As expected, the stack usage is lower:

    • less_stack.patch: 384 bytes/call
    • no_small_stack.patch: 368 bytes/call

    I didn't check the performance of no_small_stack.patch yet.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 3, 2017

    In Python 3.5, PyObject_CallFunctionObjArgs() calls objargs_mktuple() which uses Py_VA_COPY(countva, va) and creates a tuple. The tuple constructor uses a free list to reduce the cost of heap memory allocations.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 9, 2017

    I modified Serhiy's stack_overflow.py of bpo-28858:

    • re-run each test 10 tests and show the maximum depth
    • only test: ['test_python_call', 'test_python_getitem', 'test_python_iterator']

    Maximum number of Python calls before a crash.

    (*) Reference (unpatched): 560 bytes/call

    test_python_call 7172
    test_python_getitem 6232
    test_python_iterator 5344
    => total: 18 838

    (1) no_small_stack.patch: 368 bytes/call

    test_python_call 7172 (=)
    test_python_getitem 6544 (+312)
    test_python_iterator 5572 (+228)
    => total: 19 288

    (2) less_stack.patch: 384 bytes/call

    test_python_call 7272 (+100)
    test_python_getitem 6384 (+152)
    test_python_iterator 5456 (+112)
    => total: 19 112

    (3) subfunc.patch: 496 bytes

    test_python_call 7272 (+100)
    test_python_getitem 6712 (+480)
    test_python_iterator 6020 (+678)
    => total: 20 004

    (4) alloca.patch: 528 bytes/call

    test_python_call 7272 (+100)
    test_python_getitem 6464 (+232)
    test_python_iterator 5752 (+408)
    => total: 19 488

    Patched sorted by bytes/call, from best to worst: no_small_stack.patch (368) > less_stack.patch (384) > subfunc.patch (496) > alloca.patch (528) > reference (560).

    Patched sorted by number of calls before crash: subfunc.patch (20 004) > alloca.patch (19 488) > no_small_stack.patch (19 288) > less_stack.patch (19 112) > reference (18 838).

    I expected a correlation between the measure bytes/call measured by testcapi_stacksize.patch and the number of calls before a crash, but I fail to see an obvious correlation :-/

    Maybe the compiler is smarter than what I would expect and emits efficient code to be able to use less stack memory?

    Maybe the Linux kernel does weird things which makes the behaviour on stack-overflow non-obvious :-)

    At least, I would expect that no_small_stack.patch would be the clear winner, since it has the smallest usage of C stack.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 9, 2017

    Impact of the _PY_FASTCALL_SMALL_STACK constant:

    • _PY_FASTCALL_SMALL_STACK=1: 528 bytes/call

    test_python_call 7376
    test_python_getitem 6544
    test_python_iterator 5572
    => total: 19 492

    • _PY_FASTCALL_SMALL_STACK=3: 528 bytes/call

    test_python_call 7272
    test_python_getitem 6464
    test_python_iterator 5512
    => total: 19 248

    • _PY_FASTCALL_SMALL_STACK=5 (current value): 560 bytes/call

    test_python_call 7172
    test_python_getitem 6232
    test_python_iterator 5344
    => total: 19 636

    • _PY_FASTCALL_SMALL_STACK=10: 592 bytes/call

    test_python_call 6984
    test_python_getitem 5952
    test_python_iterator 5132
    => total: 18 068

    Increasing _PY_FASTCALL_SMALL_STACK has a clear effect on the total. Total decreases when _PY_FASTCALL_SMALL_STACK increases.

    ---

    no_small_stack.patch with _PY_FASTCALL_SMALL_STACK=3: 368 bytes/call

    test_python_call 7272
    test_python_getitem 6628
    test_python_iterator 5632
    => total: 19 532

    @serhiy-storchaka
    Copy link
    Member Author

    I'm not sure that the result of pyobjectl_callfunctionobjargs_stacksize() has direct relation to stack consumption in test_python_call, test_python_getitem and test_python_iterator. Try to measure the stack consumption in these cases. This can be done with _testcapi helper that just returns the value of stack pointer. Run all three tests with fixed level of recursion and measure the difference between stack pointers.

    Would be nice also measure a performance effect of the patches.

    @vstinner
    Copy link
    Member

    testcapi_stack_pointer.patch: add _testcapi.stack_pointer() function.

    @vstinner
    Copy link
    Member

    stack_overflow_28870-sp.py: script using testcapi_stack_pointer.patch to compute the usage of the C stack. Results of this script.

    (*) Reference

    test_python_call: 7175 calls before crash, stack: 1168 bytes/call
    test_python_getitem: 6235 calls before crash, stack: 1344 bytes/call
    test_python_iterator: 5344 calls before crash, stack: 1568 bytes/call

    => total: 18754 calls, 4080 bytes

    (1) no_small_stack.patch

    test_python_call: 7175 calls before crash, stack: 1168 bytes/call
    test_python_getitem: 6547 calls before crash, stack: 1280 bytes/call
    test_python_iterator: 5572 calls before crash, stack: 1504 bytes/call

    => total: 19294 calls, 3952 bytes

    test_python_call is clearly not impacted by no_small_stack.patch.

    test_python_call loops on method_call():

    method_call()
    => _PyObject_Call_Prepend()
    => _PyObject_FastCallDict()
    => _PyFunction_FastCallDict()
    => _PyEval_EvalCodeWithName()
    => PyEval_EvalFrameEx()
    => _PyEval_EvalFrameDefault()
    => call_function()
    => _PyObject_FastCallKeywords()
    => slot_tp_call()
    => PyObject_Call()
    => method_call()
    => (...)

    _PyObject_Call_Prepend() is in the middle of the chain. This function uses a "small stack" of _PY_FASTCALL_SMALL_STACK "PyObject*" items. We can clearly see the impact of modifying _PY_FASTCALL_SMALL_STACK on the maximum number of
    test_python_call calls before crash in msg285057.

    @vstinner
    Copy link
    Member

    no_small_stack-2.patch: Remove all "small_stack" buffers.

    Reference

    test_python_call: 7175 calls before crash, stack: 1168 bytes/call
    test_python_getitem: 6235 calls before crash, stack: 1344 bytes/call
    test_python_iterator: 5344 calls before crash, stack: 1568 bytes/call

    => total: 18754 calls, 4080 bytes

    no_small_stack.patch

    test_python_call: 7482 calls (+307) before crash, stack: 1120 bytes/call (-48)
    test_python_getitem: 6715 calls (+480) before crash, stack: 1248 bytes/call (-96)
    test_python_iterator: 5693 calls (+349) before crash, stack: 1472 bytes/call (-96)

    => total: 19890 calls (+1136), 3840 bytes (-240)

    The total gain is the removal of 5 small buffers of 48 bytes: 240 bytes.

    @vstinner
    Copy link
    Member

    no_small_stack.patch:

    Oops, you should read no_small_stack-2.patch in my previous message ;-)

    @vstinner
    Copy link
    Member

    Python 3.5 (revision 8125d9a8152b), before all fastcall changes:

    test_python_call: 8314 calls before crash, stack: 1008 bytes/call
    test_python_getitem: 7483 calls before crash, stack: 1120 bytes/call
    test_python_iterator: 6802 calls before crash, stack: 1232 bytes/call

    => total: 22599 calls, 3360 bytes

    @serhiy-storchaka
    Copy link
    Member Author

    What are results with 3.4? There were several issues about stack overflow in 3.5 (bpo-25222, bpo-28179, bpo-28913).

    @vstinner
    Copy link
    Member

    Python 3.4 (rev 6340c9fcc111):

    test_python_call: 9700 calls before crash, stack: 864 bytes/call
    test_python_getitem: 8314 calls before crash, stack: 1008 bytes/call
    test_python_iterator: 7818 calls before crash, stack: 1072 bytes/call

    => total: 25832 calls, 2944 bytes

    Python 2.7 (rev 0d4e0a736688):

    test_python_call: 6162 calls before crash, stack: 1360 bytes/call
    test_python_getitem: 5952 calls before crash, stack: 1408 bytes/call
    test_python_iterator: 5885 calls before crash, stack: 1424 bytes/call

    => total: 17999 calls, 4192 bytes

    Nice. At least, Python 3.7 is better than Python 2.7 (4080 bytes <
    4192 bytes) :-) Python 3.4 stack usage was very low, and lower than
    Python 3.5.

    @vstinner
    Copy link
    Member

    no_small_stack-2.patch has a very bad impact on performances:

    haypo@speed-python$ python3 -m perf compare_to 2017-01-04_12-02-default-ee1390c9b585.json no_small_stack-2_refee1390c9b585.json -G --min-speed=5

    Slower (59):

    • telco: 15.7 ms +- 0.5 ms -> 23.4 ms +- 0.3 ms: 1.49x slower (+49%)
    • scimark_sor: 393 ms +- 6 ms -> 579 ms +- 10 ms: 1.47x slower (+47%)
    • json_loads: 56.9 us +- 0.9 us -> 83.1 us +- 2.4 us: 1.46x slower (+46%)
    • unpickle_pure_python: 698 us +- 10 us -> 984 us +- 10 us: 1.41x slower (+41%)
    • scimark_lu: 424 ms +- 22 ms -> 585 ms +- 33 ms: 1.38x slower (+38%)
    • chameleon: 22.4 ms +- 0.2 ms -> 30.8 ms +- 0.3 ms: 1.38x slower (+38%)
    • xml_etree_generate: 212 ms +- 3 ms -> 291 ms +- 4 ms: 1.37x slower (+37%)
    • xml_etree_process: 177 ms +- 3 ms -> 240 ms +- 3 ms: 1.35x slower (+35%)
    • raytrace: 1.04 sec +- 0.01 sec -> 1.40 sec +- 0.02 sec: 1.35x slower (+35%)
    • logging_simple: 27.9 us +- 0.4 us -> 37.4 us +- 0.5 us: 1.34x slower (+34%)
    • pickle_pure_python: 1.02 ms +- 0.01 ms -> 1.37 ms +- 0.02 ms: 1.34x slower (+34%)
    • logging_format: 33.3 us +- 0.4 us -> 44.5 us +- 0.7 us: 1.34x slower (+34%)
    • xml_etree_iterparse: 195 ms +- 5 ms -> 259 ms +- 7 ms: 1.32x slower (+32%)
    • chaos: 236 ms +- 3 ms -> 306 ms +- 3 ms: 1.30x slower (+30%)
    • regex_compile: 380 ms +- 3 ms -> 494 ms +- 5 ms: 1.30x slower (+30%)
    • pathlib: 42.3 ms +- 0.5 ms -> 55.0 ms +- 0.6 ms: 1.30x slower (+30%)
    • django_template: 364 ms +- 5 ms -> 471 ms +- 4 ms: 1.29x slower (+29%)
    • call_method: 11.2 ms +- 0.2 ms -> 14.4 ms +- 0.2 ms: 1.29x slower (+29%)
    • hexiom: 18.4 ms +- 0.2 ms -> 23.7 ms +- 0.2 ms: 1.29x slower (+29%)
    • call_method_slots: 11.0 ms +- 0.3 ms -> 14.1 ms +- 0.1 ms: 1.28x slower (+28%)
    • richards: 147 ms +- 4 ms -> 188 ms +- 5 ms: 1.28x slower (+28%)
    • html5lib: 207 ms +- 7 ms -> 262 ms +- 6 ms: 1.27x slower (+27%)
    • genshi_text: 71.5 ms +- 1.3 ms -> 90.3 ms +- 1.1 ms: 1.26x slower (+26%)
    • deltablue: 14.2 ms +- 0.2 ms -> 17.9 ms +- 0.4 ms: 1.26x slower (+26%)
    • genshi_xml: 164 ms +- 2 ms -> 207 ms +- 3 ms: 1.26x slower (+26%)
    • sympy_str: 429 ms +- 5 ms -> 539 ms +- 4 ms: 1.25x slower (+25%)
    • go: 493 ms +- 5 ms -> 619 ms +- 7 ms: 1.25x slower (+25%)
    • mako: 35.4 ms +- 1.5 ms -> 44.2 ms +- 1.2 ms: 1.25x slower (+25%)
    • sympy_expand: 959 ms +- 10 ms -> 1.19 sec +- 0.01 sec: 1.24x slower (+24%)
    • nqueens: 215 ms +- 2 ms -> 268 ms +- 1 ms: 1.24x slower (+24%)
      (...)

    Benchmark ran on speed-python with PGO+LTO, Linux configured for benchmarks using python3 -m perf system tune.

    @serhiy-storchaka
    Copy link
    Member Author

    Thus Python 3.6 stack usage is about 20% larger than Python 3.5 and about 40% larger than Python 3.4. This is significant. :-(

    no_small_stack-2.patch decreases it only by 6% (with possible performance loss).

    @vstinner
    Copy link
    Member

    no_small_stack-2.patch decreases it only by 6% (with possible performance loss).

    Yeah, if we want to come back to Python 3.4 efficiency, we need to find the other functions which now uses more stack memory ;-) The discussed "small stack" buffers are only responsible of 96 bytes, not a big deal compared to the total of 4080 bytes.

    @vstinner
    Copy link
    Member

    Stack used by each C function of test_python_call.

    3.4:

    (a) method_call: 64

    (b) PyObject_Call: 48
    (b) function_call: 160
    (b) PyEval_EvalCodeEx: 176

    (c) PyEval_EvalFrameEx: 256
    (c) call_function: 0
    (c) do_call: 0
    (c) PyObject_Call: 48

    (d) slot_tp_call: 64
    (d) PyObject_Call: 48

    => total: 864

    default:

    (a) method_call: 80

    (b) _PyObject_FastCallDict: 64
    (b) _PyFunction_FastCallDict: 208
    (b) _PyEval_EvalCodeWithName: 176

    (c) _PyEval_EvalFrameDefault: 320
    (c) call_function: 80
    (c) _PyObject_FastCallKeywords: 80

    (d) slot_tp_call: 64
    (d) PyObject_Call: 48

    => total: 1120

    Groups of functions, 3.4 => default:

    (a) 64 => 80 (+16)
    (b) 384 => 448 (+64)
    (c) 304 => 480 (+176)
    (d) 112 => 112 (=)

    I used gdb:

    (gdb) set $last=0
    (gdb) define size

    print $last - (uintptr_t)$rsp
    set $last = (uintptr_t)$rsp
    down
    (gdb) up
    (gdb) up
    (gdb) up
    (... until a first method_call ...)
    (gdb) size
    (gdb) size
    ...

    @vstinner
    Copy link
    Member

    I created the issue bpo-29227 "Reduce C stack consumption in function calls" which contains a first simple patch with a significant effect on the C stack.

    @vstinner
    Copy link
    Member

    It seems like subfunc.patch approach using the "no inline" attribute helps.

    @vstinner
    Copy link
    Member

    I pushed 3 changes:

    • rev b9404639a18c: Issue bpo-29233: call_method() now uses _PyObject_FastCall()
    • rev 8481c379e2da: Issue bpo-29227: inline call_function()
    • rev 6478e6d0476f: Issue bpo-29234: disable _PyStack_AsTuple() inlining

    Before (rev a30cdf366c02):

    test_python_call: 7175 calls before crash, stack: 1168 bytes/call
    test_python_getitem: 6235 calls before crash, stack: 1344 bytes/call
    test_python_iterator: 5344 calls before crash, stack: 1568 bytes/call

    => total: 18754 calls, 4080 bytes

    With these 3 changes (rev 6478e6d0476f):

    test_python_call: 8587 calls before crash, stack: 976 bytes/call
    test_python_getitem: 9189 calls before crash, stack: 912 bytes/call
    test_python_iterator: 7936 calls before crash, stack: 1056 bytes/call

    => total: 25712 calls, 2944 bytes

    The default branch is now as good as Python 3.4, in term of stack consumption, and Python 3.4 was the Python version which used the least stack memory according to my tests.

    I didn't touch _PY_FASTCALL_SMALL_STACK value, it's still 5 arguments (40 bytes). So my changes should not impact performances.

    @vstinner
    Copy link
    Member

    Result of attached bench_recursion-2.py comparing before/after the 3 changes reducing the stack consumption:

    test_python_call: Median +- std dev: [a30cdf366c02] 512 us +- 12 us -> [6478e6d0476f] 467 us +- 21 us: 1.10x faster (-9%)
    test_python_getitem: Median +- std dev: [a30cdf366c02] 485 us +- 26 us -> [6478e6d0476f] 437 us +- 18 us: 1.11x faster (-10%)
    test_python_iterator: Median +- std dev: [a30cdf366c02] 1.15 ms +- 0.04 ms -> [6478e6d0476f] 1.03 ms +- 0.06 ms: 1.12x faster (-10%)

    At least, it doesn't seem to be slower. Maybe the speedup comes from call_function() inlining. This function was probably already inlined when using PGO build.

    The script was written by Serhiy in the issue bpo-29227, I modified it to use the Runner.timeit() API for convenience.

    @serhiy-storchaka
    Copy link
    Member Author

    Awesome! You are great Victor!

    @vstinner
    Copy link
    Member

    I also ran the reliable performance benchmark suite with LTO+PGO. There is no significant performance change on these benchmarks:
    https://speed.python.org/changes/?rev=b9404639a18c&exe=5&env=speed-python

    The largest change is on scimark_lu (-13%), but there was an hiccup on the previous change which is probably a small unstability in the benchmark. It's not a speedup of these changes.

    The second largest change is on spectral_norm: +9%. But this benchmark is known to be unstable, there was already a small peak previously. Again, I don't think that it's related to the changes.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 1, 2017

    "The default branch is now as good as Python 3.4, in term of stack consumption, and Python 3.4 was the Python version which used the least stack memory according to my tests."

    I consider that the initial issue is now fixed, so I close the issue.

    Thanks Serhiy for the tests, reviews, ideas and obvious the bug report ;-) I never looked at the stack usage before.

    @vstinner vstinner closed this as completed Feb 6, 2017
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants