This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use _PyObject_Vectorcall in Modules/_ctypes/callbacks.c
Type: performance Stage: resolved
Components: ctypes Versions: Python 3.11
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: corona10 Nosy List: corona10, erlendaasland, hydroflask, vstinner
Priority: normal Keywords: patch

Created on 2022-01-10 04:01 by hydroflask, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
bench_callback.py corona10, 2022-02-05 06:15
bench_callback_v2.py hydroflask, 2022-02-07 06:45
bench_callback_v3.py corona10, 2022-02-07 07:08
Pull Requests
URL Status Linked Edit
PR 31138 merged corona10, 2022-02-05 05:47
PR 31188 merged vstinner, 2022-02-07 12:37
PR 31209 merged corona10, 2022-02-08 04:59
PR 31224 merged corona10, 2022-02-08 22:54
PR 31249 merged corona10, 2022-02-10 09:47
PR 31272 merged corona10, 2022-02-11 08:26
Messages (39)
msg410185 - (view) Author: (hydroflask) Date: 2022-01-10 04:01
_CallPythonObject() in Modules/_ctypes/callbacks.c still uses the old API (PyObject_CallObject()) to invoke methods. It should use _PyObject_Vectorcall(). Creating an issue for this since this method is a little more performance sensitive than normal Python library code.
msg410190 - (view) Author: Kumar Aditya (kumaraditya) * (Python triager) Date: 2022-01-10 06:51
Targeting Python 3.11 as it a performance improvement.
msg412689 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-06 23:49
@hydroflask

It does not show impressive performance enhancement, so I decided not to change it
See PR 31138
msg412692 - (view) Author: (hydroflask) Date: 2022-02-07 00:19
@corona10

Thank you for looking into it.

The benchmark you did is not reflective of my use-case. I have a codebase that I have completely converted into a C extension using mypyc. So the ctypes callback does not call into interpreted Python code but actually another C-extension. I have annotated types aggressively to avoid invoking malloc(). In this use-case the extra overhead from the tuple allocation would be significant.

Additionally your benchmark may not be as micro as intended. You call some math in the callback function which may have made the savings from avoiding the tuple allocation seem more like noise.

Since you already did the work I still think it's worth putting it in since PyObject_Vectorcall is an established API and it would have been used anyway if this were written from scratch.
msg412699 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-07 01:03
@hydroflask

I am one of the big fans of applying vectorcall :)

> The benchmark you did is not reflective of my use-case. I have a codebase that I have completely converted into a C extension using mypyc.

Can you provide me a minimal benchmarkable script(if you need to include mypyc it's okay) to persuade other core dev?
msg412710 - (view) Author: (hydroflask) Date: 2022-02-07 06:45
Okay I put together a benchmark that better exemplifies my use case and results are in favor of adding patch.

When bench_callback_v2.py is not compiled with mypyc I see a ~10% reduction in runtime:

    $ unpatched-env/bin/python bench_callback_v2.py
    1.1262263769749552
    $ patched-env/bin/python bench_callback_v2.py
    1.0174838998354971

When bench_callback_v2.py is compiled with mypyc, I am getting ~6% reduction in runtime:

    $ unpatched-env/bin/python -c "import bench_callback_v2"
    1.0056699379347265
    $ patched-env/bin/python -c "import bench_callback_v2"
    0.9415687420405447

bench_callback_v2.py is attached.
msg412711 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-07 06:54
@vstinner

I got a similar result from hydroflask's benchmark.
I think that is worth reopening this issue WDYT?
msg412712 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-07 07:00
@hydroflask

Would you like to run your benchmark with my latest PR?
msg412713 - (view) Author: (hydroflask) Date: 2022-02-07 07:05
Vanilla bench_callback_v2.py, not compiled with mypyc:

    $ env/bin/python bench_callback_v2.py
    1.114047016017139
    $ env2/bin/python bench_callback_v2.py
    0.9644024870358407

~13% reduction in runtime. Nice job!
msg412714 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-07 07:08
With v3 benchmark

Mean +- std dev: [base] 1.32 us +- 0.02 us -> [opt] 1.18 us +- 0.02 us: 1.13x faster
msg412744 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-07 13:45
A benchmark on calling C functions using ctypes sounds better than a benchmark calling Python functions. For Python functions, Python objects are converted to C types, and then C types are convered back to Python objects, the Python result is converted to a C type, and then the C type is converted back to a Python object for the final result...

It would be nice to have a benchmark on different number of arguments. Like: 0, 1, 3, 5, 8, 10.
msg412745 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-07 13:53
New changeset 4cce1352bb47babaeefb68fcfcc48ffa073745c3 by Victor Stinner in branch 'main':
bpo-46323: _ctypes.CFuncPtr fails if _argtypes_ is too long (GH-31188)
https://github.com/python/cpython/commit/4cce1352bb47babaeefb68fcfcc48ffa073745c3
msg412759 - (view) Author: (hydroflask) Date: 2022-02-07 15:32
> A benchmark on calling C functions using ctypes sounds better than a benchmark calling Python functions.

Calling C functions from Python is not the code path handled by _CallPythonObject() so no difference in run-time would theoretically be observed by that benchmark for this patch. This bug report pertains to code paths where a C function calls back into a Python function. A practical example is using Python with an event loop library written in C.
msg412760 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-07 15:59
Oh ok, thanks for the clarification :-)
msg412765 - (view) Author: (hydroflask) Date: 2022-02-07 16:20
Just to clarify further, the original benchmark by corona10 did indeed lead to `_CallPythonObject` being invoked but it was not quite the normal use-case. In the original benchmark it invoked the generated ctypes thunk via Python so as vstinner said it was doing this:

    Python -> converters -> thunk-> _CallPythonObject -> converters-> Python

Namely using `PyCFuncPtr_call` to invoke the thunk generated by `_ctypes_alloc_callback`. Normally when invoking C functions via `PyCFuncPtr_call` it looks like this:

    Python -> converters -> C_function

In the original benchmark setup no significant reduction in runtime was observed by this patch.  I noticed that it was not a typical use of `_CallPythonObject`, where instead it would be a top-level C function calling back into Python. Something like this:

    C -> thunk -> _CallPythonObject() -> Python

The benchmark I provided exercises that use case and the >=10% reduction in runtime was observed. Thanks to both corona10 and vstinner, I appreciate their effort in this issue.
msg412806 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 02:59
@vstinner

PR 31188 cause reference leak, please check

Raised RLIMIT_NOFILE: 256 -> 1024
0:00:00 load avg: 5.38 Run tests sequentially
0:00:00 load avg: 5.38 [1/1] test_ctypes
beginning 6 repetitions
123456
......
test_ctypes leaked [1028, 1026, 1028] references, sum=3082
test_ctypes leaked [2, 1, 3] memory blocks, sum=6
test_ctypes failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_ctypes

Total duration: 3.2 sec
Tests result: FAILURE
msg412807 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 03:01
No leak after if I revert PR 31188 on my local
msg412810 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 04:59
@vstinner

PR 31209 is for fixing the leak.
msg412812 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 05:22
New changeset e959dd9f5c1d8865d4e10c49eb30ee0a4fe2735d by Dong-hee Na in branch 'main':
bpo-46323 Fix ref leak if ctypes.CFuncPtr raises an error. (GH-31209)
https://github.com/python/cpython/commit/e959dd9f5c1d8865d4e10c49eb30ee0a4fe2735d
msg412837 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 13:09
New changeset b5527688aae11d0b5af58176267a9943576e71e5 by Dong-hee Na in branch 'main':
bpo-46323: Use PyObject_Vectorcall while calling ctypes callback function (GH-31138)
https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5
msg412839 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 13:25
@hydroflask

All patches are merged!
Thanks for all your suggestions :)
msg412872 - (view) Author: (hydroflask) Date: 2022-02-08 21:06
Two things, one is a nit pick the other is more serious. I think vstinner mentioned this in his review of your patch but on this line:

https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR180

Instead of using PySequence_Fast_ITEMS(), you can just use PyTuple_GET_ITEM() since you know the converters are a tuple. In terms of runtime efficiency it doesn't change anything but is consistent with this line: https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR157

Though on second thought I think PySequence_Fast_ITEMS() is probably a better API overall in terms of efficiency if PyTuple_GET_ITEM() would eventually become a real function call given the long term push toward a stable API.

The other issue is more serious, you are always allocating an array of CTYPES_MAX_ARGCOUNT pointers on the stack on every C callback. This could cause stack overflows in situations where a relatively deep set of callbacks are invoked. This usage of CTYPES_MAX_ARGCOUNT differs its usage in callproc.c, where in that case `alloca()` is used to allocate the specific number of array entries on the stack. To avoid an unintended stack overflow I would use alloca() or if you don't like alloca() I would only allocate a relatively small constant number on the stack.
msg412878 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 23:08
@hydroflask

Sound reasonable, I submitted a new patch to use alloca which is already used in _ctypes module.
msg412879 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-08 23:08
And it even better from performance view
msg412882 - (view) Author: (hydroflask) Date: 2022-02-09 00:15
I don't have github so I can't comment there but just as a nitpick, alloca() never returns NULL, so you don't have to check that. I know that is inconsistent with its use in callproc.c so for consistency's sake the NULL check should stay but I would remove both checks in a future change.

More importantly you can completely avoid the args_stack variable since alloca() has the same end affect. I would also add an assert that the number of args is <= CTYPES_MAX_ARGCOUNT, that should be the case since GH 31188
msg412884 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-09 00:39
I suggest to use PyTuple_GET_ITEM(), but PySequence_Fast_ITEMS() is more effecient. It's hard to beat an array in C.
msg412936 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-09 18:10
New changeset d18120cd67b4297f78bfc9bf7b36774cf0bf15f2 by Dong-hee Na in branch 'main':
bpo-46323: Reduce stack usage of ctypes python callback function. (GH-31224)
https://github.com/python/cpython/commit/d18120cd67b4297f78bfc9bf7b36774cf0bf15f2
msg412940 - (view) Author: (hydroflask) Date: 2022-02-09 20:17
@corona10

I really hope I am not being annoying at this point :) One final nit, in this line:

https://github.com/python/cpython/pull/31224/files#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR168

You do not have to check if nargs > 0. alloca() can handle the case when nargs == 0. The usage of alloca() in callproc.c does not check for nargs > 0 either.

Otherwise thanks for this enhancement!
msg412941 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-09 20:23
I failed to find the doc about alloca(0). It seems like the existing _ctypes_callproc() function *can* call alloca(0) if argtuple is empty and pIunk is 0 (pIunk is specific to Windows).
msg412942 - (view) Author: (hydroflask) Date: 2022-02-09 20:31
For reference, here are MSDN, Linux, OpenBSD, and GCC docs on alloca:

https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/alloca?view=msvc-170

https://www.man7.org/linux/man-pages/man3/alloca.3.html

https://man.openbsd.org/alloca.3

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_002a_005f_005fbuiltin_005falloca
msg412981 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-10 08:18
@hydroflask @vstinner

Okay let's remove if statement, I will send a patch soon.
msg412989 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-10 10:10
New changeset db052851a70fd95d047c6263fc16a75e4d47b3ed by Dong-hee Na in branch 'main':
bpo-46323: Allow alloca(0) for python callback function of ctypes (GH-31249)
https://github.com/python/cpython/commit/db052851a70fd95d047c6263fc16a75e4d47b3ed
msg413018 - (view) Author: (hydroflask) Date: 2022-02-10 18:05
Thanks again everyone, very much appreciated.
msg413044 - (view) Author: (hydroflask) Date: 2022-02-11 08:15
Was reviewing the code and noticed that a double-free was introduced in the recent changes:

https://github.com/python/cpython/blob/dd76b3f7d332dd6eced5cbc2ad2adfc397700b3d/Modules/_ctypes/callbacks.c#L192

That line should have been removed in https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5 but it was missed!
msg413045 - (view) Author: (hydroflask) Date: 2022-02-11 08:21
Another bug, the array returned by alloca() should be zero-initialized. If an early return happens because of an intermediate error, Py_DECREF() will be called on uninitialized memory. Also it should be Py_XDECREF() I think.
msg413046 - (view) Author: (hydroflask) Date: 2022-02-11 08:24
Ignore the previous comment :)
msg413048 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-11 08:37
> Was reviewing the code and noticed that a double-free was introduced in the recent changes:

Thanks, that looks like my mistake
msg413049 - (view) Author: (hydroflask) Date: 2022-02-11 08:46
Easy one to make, might be worth adding a test that would have exercised that code path.
msg413050 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2022-02-11 08:49
New changeset 0ac5372bf6b937ed44a8f9c4e402d024fcd80870 by Dong-hee Na in branch 'main':
bpo-46323: Fix double-free issue for borrowed refs (GH-31272)
https://github.com/python/cpython/commit/0ac5372bf6b937ed44a8f9c4e402d024fcd80870
History
Date User Action Args
2022-04-11 14:59:54adminsetgithub: 90481
2022-02-11 08:49:08corona10setmessages: + msg413050
2022-02-11 08:46:13hydroflasksetmessages: + msg413049
2022-02-11 08:37:50corona10setmessages: + msg413048
2022-02-11 08:26:42corona10setpull_requests: + pull_request29434
2022-02-11 08:24:31hydroflasksetmessages: + msg413046
2022-02-11 08:21:24hydroflasksetmessages: + msg413045
2022-02-11 08:15:42hydroflasksetmessages: + msg413044
2022-02-10 18:05:49hydroflasksetmessages: + msg413018
2022-02-10 10:10:11corona10setmessages: + msg412989
2022-02-10 09:47:13corona10setpull_requests: + pull_request29418
2022-02-10 08:18:12corona10setmessages: + msg412981
2022-02-09 20:31:20hydroflasksetmessages: + msg412942
2022-02-09 20:23:45vstinnersetmessages: + msg412941
2022-02-09 20:17:07hydroflasksetmessages: + msg412940
2022-02-09 18:10:20corona10setmessages: + msg412936
2022-02-09 00:39:04vstinnersetmessages: + msg412884
2022-02-09 00:15:07hydroflasksetmessages: + msg412882
2022-02-08 23:08:31corona10setmessages: + msg412879
2022-02-08 23:08:14corona10setmessages: + msg412878
2022-02-08 22:54:16corona10setpull_requests: + pull_request29395
2022-02-08 21:06:57hydroflasksetmessages: + msg412872
2022-02-08 13:25:11corona10setstatus: open -> closed

messages: + msg412839
stage: patch review -> resolved
2022-02-08 13:09:29corona10setmessages: + msg412837
2022-02-08 05:22:37corona10setmessages: + msg412812
2022-02-08 04:59:55corona10setmessages: + msg412810
2022-02-08 04:59:03corona10setpull_requests: + pull_request29379
2022-02-08 03:01:45corona10setmessages: + msg412807
2022-02-08 02:59:15corona10setmessages: + msg412806
2022-02-07 16:20:33hydroflasksetmessages: + msg412765
2022-02-07 15:59:09vstinnersetmessages: + msg412760
2022-02-07 15:32:31hydroflasksetmessages: + msg412759
2022-02-07 13:53:23vstinnersetmessages: + msg412745
2022-02-07 13:45:45vstinnersetmessages: + msg412744
2022-02-07 12:37:59vstinnersetpull_requests: + pull_request29359
2022-02-07 07:10:37corona10setstage: resolved -> patch review
2022-02-07 07:10:32corona10setstatus: closed -> open
resolution: wont fix ->
2022-02-07 07:08:29corona10setmessages: + msg412714
2022-02-07 07:08:12corona10setfiles: + bench_callback_v3.py
2022-02-07 07:05:50hydroflasksetmessages: + msg412713
2022-02-07 07:00:19corona10setmessages: + msg412712
2022-02-07 06:54:50corona10setnosy: + vstinner, erlendaasland
messages: + msg412711
2022-02-07 06:45:49hydroflasksetfiles: + bench_callback_v2.py

messages: + msg412710
2022-02-07 01:03:55corona10setmessages: + msg412699
2022-02-07 00:19:02hydroflasksetmessages: + msg412692
2022-02-06 23:49:34corona10setstatus: open -> closed
resolution: wont fix
messages: + msg412689

stage: patch review -> resolved
2022-02-05 06:15:20corona10setfiles: + bench_callback.py
2022-02-05 05:47:24corona10setkeywords: + patch
stage: patch review
pull_requests: + pull_request29316
2022-02-05 05:12:57corona10setassignee: corona10

nosy: + corona10
2022-01-18 11:22:16kumaradityasetnosy: - kumaraditya
2022-01-10 11:42:50vstinnersetnosy: - vstinner
2022-01-10 06:51:18kumaradityasetnosy: + vstinner, kumaraditya

messages: + msg410190
versions: - Python 3.7, Python 3.8, Python 3.9, Python 3.10
2022-01-10 04:01:21hydroflaskcreate