classification
Title: [C API] Move usage of tp_vectorcall_offset from public headers to the internal C API
Type: Stage: resolved
Components: C API Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: vstinner
Priority: normal Keywords: patch

Created on 2021-10-11 22:18 by vstinner, last changed 2021-10-15 00:42 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
test_bench.patch vstinner, 2021-10-12 06:58
bench.py vstinner, 2021-10-12 06:59
bench_no_args.patch vstinner, 2021-10-12 07:38
bench_no_args_inline.py vstinner, 2021-10-12 07:38
bench_no_args_public.py vstinner, 2021-10-12 07:38
test_bench2.patch vstinner, 2021-10-12 20:18
bench2.py vstinner, 2021-10-12 20:18
sys_call.patch vstinner, 2021-10-13 21:54
stack_overflow-4.py vstinner, 2021-10-13 22:03
Pull Requests
URL Status Linked Edit
PR 28890 merged vstinner, 2021-10-11 22:18
PR 28891 merged vstinner, 2021-10-11 22:19
PR 28893 merged vstinner, 2021-10-11 23:04
PR 28895 merged vstinner, 2021-10-12 00:58
Messages (10)
msg403695 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-11 22:18
The public C API should avoid accessing directly PyTypeObject members: see bpo-40170.

I propose to move static inline functions to the internal C API, and only expose opaque function calls to the public C API.
msg403696 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-11 22:18
New changeset fb8f208a4ddb38eedee71f9ecd0f22058802dab1 by Victor Stinner in branch 'main':
bpo-45439: _PyObject_Call() only checks tp_vectorcall_offset once (GH-28890)
https://github.com/python/cpython/commit/fb8f208a4ddb38eedee71f9ecd0f22058802dab1
msg403698 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-11 22:42
New changeset ce3489cfdb9f0e050bdc45ce5d3902c2577ea683 by Victor Stinner in branch 'main':
bpo-45439: Rename _PyObject_CallNoArg() to _PyObject_CallNoArgs() (GH-28891)
https://github.com/python/cpython/commit/ce3489cfdb9f0e050bdc45ce5d3902c2577ea683
msg403708 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-12 06:38
New changeset d943d19172aa93ce88bade15b9f23a0ce3bc72ff by Victor Stinner in branch 'main':
bpo-45439: Move _PyObject_CallNoArgs() to pycore_call.h (GH-28895)
https://github.com/python/cpython/commit/d943d19172aa93ce88bade15b9f23a0ce3bc72ff
msg403768 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-13 00:28
I should also check again the stack consumption. Old issues:

* bpo-29465: Modify _PyObject_FastCall() to reduce stack consumption
* bpo-29234: Disable inlining of _PyStack_AsTuple() to reduce the stack consumption
* bpo-29227: Reduce C stack consumption in function calls
* bpo-28858: Fastcall uses more C stack

See also: "Stack consumption" of https://vstinner.github.io/contrib-cpython-2017q1.html
msg403770 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-13 00:31
5 years ago, I added _PyObject_CallArg1() (similar to PyObject_CallOneArg()) and then I removed it since it consumed more stack memory than existing function, whereas I added _PyObject_CallArg1() to reduce the stack consumption.

commit 7bfb42d5b7721ca26e33050d025fec5c43c00058
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Dec 5 17:04:32 2016 +0100

    Issue #28858: Remove _PyObject_CallArg1() macro
    
    Replace
       _PyObject_CallArg1(func, arg)
    with
       PyObject_CallFunctionObjArgs(func, arg, NULL)
    
    Using the _PyObject_CallArg1() macro increases the usage of the C stack, which
    was unexpected and unwanted. PyObject_CallFunctionObjArgs() doesn't have this
    issue.
msg403877 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-13 22:03
I measured the stack consumption using attached sys_call.patch and stack_overflow-4.py.

Using gcc -O3, the stack consumption with PR 28893 is *way better* on the 6 benchmarks (6 ways to call functions), especially:

PyObject_CallOneArg(): 624 bytes/call => 528 bytes/call (-96 bytes)
PyObject_CallNoArg(): 608 bytes/call => 512 bytes/call (-96 bytes)
_PyObject_CallNoArg(): 608 bytes/call => 512 bytes/call (-96 bytes)


Python built in release mode with gcc -O3:

   ./configure && make

=== ref ===

$ ./python stack_overflow-4.py
test_python_call: 10070 calls before crash, stack: 832 bytes/call
test_python_getitem: 16894 calls before crash, stack: 496 bytes/call
test_python_iterator: 12773 calls before crash, stack: 656 bytes/call
test_callonearg: 13428 calls before crash, stack: 624 bytes/call
test_callnoargs: 13779 calls before crash, stack: 608 bytes/call
test_callnoargs_inline: 13782 calls before crash, stack: 608 bytes/call

=> total: 80726 calls, 3824 bytes


=== PR ===

$ ./python stack_overflow-4.py
test_python_call: 11901 calls before crash, stack: 704 bytes/call
test_python_getitem: 18703 calls before crash, stack: 448 bytes/call
test_python_iterator: 14961 calls before crash, stack: 560 bytes/call
test_callonearg: 15868 calls before crash, stack: 528 bytes/call
test_callnoargs: 16366 calls before crash, stack: 512 bytes/call
test_callnoargs_inline: 16365 calls before crash, stack: 512 bytes/call

=> total: 94164 calls, 3264 bytes
msg403878 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-13 22:37
Using LTO, the PR 28893 *increases* the stack memory usage. It's the opposite :-)


PyObject_CallOneArg(): 672 bytes/call => 688 bytes/call (+16 bytes)
PyObject_CallNoArg(): 640 bytes/call => 672 bytes/call (+32 bytes)
_PyObject_CallNoArg(): 640 bytes/call => 672 bytes/call (+32 bytes)


clang with LTO:

    ./configure --with-lto CC=clang LD=lld LDFLAGS="-fuse-ld=lld"
    make

=== ref ===

$ ./python stack_overflow-4.py
test_python_call: 9187 calls before crash, stack: 912 bytes/call
test_python_getitem: 15868 calls before crash, stack: 528 bytes/call
test_python_iterator: 11901 calls before crash, stack: 704 bytes/call
test_callonearg: 12468 calls before crash, stack: 672 bytes/call
test_callnoargs: 13091 calls before crash, stack: 640 bytes/call
test_callnoargs_inline: 13092 calls before crash, stack: 640 bytes/call

=> total: 75607 calls, 4096 bytes

=== PR ===

$ ./python stack_overflow-4.py
test_python_call: 9186 calls before crash, stack: 912 bytes/call
test_python_getitem: 15400 calls before crash, stack: 544 bytes/call
test_python_iterator: 11384 calls before crash, stack: 736 bytes/call
test_callonearg: 12177 calls before crash, stack: 688 bytes/call
test_callnoargs: 12468 calls before crash, stack: 672 bytes/call
test_callnoargs_inline: 12467 calls before crash, stack: 672 bytes/call

=> total: 73082 calls, 4224 bytes
msg403937 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-14 19:53
New changeset 3cc56c828d2d8f8659ea49447234bf0d2b87cd64 by Victor Stinner in branch 'main':
bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h (GH-28893)
https://github.com/python/cpython/commit/3cc56c828d2d8f8659ea49447234bf0d2b87cd64
msg403939 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-14 19:59
I decided to merge my PR to address https://bugs.python.org/issue45439 initial issue: "[C API] Move usage of **tp_vectorcall_offset** from public headers to the internal C API".

Last years, I added `tstate` parameters to internal C functions. The agreement was that only internal functions should use it, and indirectly that this `tstate` parameter should be hidden. I'm now sure exactly, but `tstate` started to pop up in `Include/cpython/abstract.h` around "call" functions. This PR fix this issue.

About the impact on performances: well, it's really hard to draw a clear conclusion. Inlining, LTO and PGO give different results on runtime performance and stack memory usage.

IMO the fact that public C API functions are now regular functions should not prevent us to continue (micro) optimizing Python. We can always add a variant to the internal C API using an API a little bit different (e.g. add `tstate` parameter) or defined as a static inline function, rather than a regular function.

The unclear part is if PyObject_CallOneArg() (regular function call) is faster than _PyObject_CallOneArg() (static inline function, inlined). The performance may depend if it's called in the Python executable or in a dynamic library (PLT indirection which may be avoided by `gcc -fno-semantic-interposition`).

Well, happy hacking and let's continue *continuous* benchmarking Python!
History
Date User Action Args
2021-10-15 00:42:16vstinnersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-10-14 19:59:41vstinnersetmessages: + msg403939
2021-10-14 19:53:12vstinnersetmessages: + msg403937
2021-10-13 22:37:51vstinnersetmessages: + msg403878
2021-10-13 22:03:50vstinnersetfiles: + stack_overflow-4.py

messages: + msg403877
2021-10-13 21:54:14vstinnersetfiles: + sys_call.patch
2021-10-13 00:31:56vstinnersetmessages: + msg403770
2021-10-13 00:28:53vstinnersetmessages: + msg403768
2021-10-12 20:18:48vstinnersetfiles: + bench2.py
2021-10-12 20:18:41vstinnersetfiles: + test_bench2.patch
2021-10-12 07:38:16vstinnersetfiles: + bench_no_args_public.py
2021-10-12 07:38:10vstinnersetfiles: + bench_no_args_inline.py
2021-10-12 07:38:04vstinnersetfiles: + bench_no_args.patch
2021-10-12 06:59:00vstinnersetfiles: + bench.py
2021-10-12 06:58:51vstinnersetfiles: + test_bench.patch
2021-10-12 06:38:27vstinnersetmessages: + msg403708
2021-10-12 00:58:49vstinnersetpull_requests: + pull_request27190
2021-10-11 23:04:51vstinnersetpull_requests: + pull_request27188
2021-10-11 22:42:26vstinnersetmessages: + msg403698
2021-10-11 22:19:46vstinnersetpull_requests: + pull_request27186
2021-10-11 22:18:36vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request27185
2021-10-11 22:18:33vstinnersetmessages: + msg403696
2021-10-11 22:18:05vstinnercreate