New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use _PyObject_Vectorcall in Modules/_ctypes/callbacks.c #90481
Comments
_CallPythonObject() in Modules/_ctypes/callbacks.c still uses the old API (PyObject_CallObject()) to invoke methods. It should use _PyObject_Vectorcall(). Creating an issue for this since this method is a little more performance sensitive than normal Python library code. |
Targeting Python 3.11 as it a performance improvement. |
It does not show impressive performance enhancement, so I decided not to change it |
Thank you for looking into it. The benchmark you did is not reflective of my use-case. I have a codebase that I have completely converted into a C extension using mypyc. So the ctypes callback does not call into interpreted Python code but actually another C-extension. I have annotated types aggressively to avoid invoking malloc(). In this use-case the extra overhead from the tuple allocation would be significant. Additionally your benchmark may not be as micro as intended. You call some math in the callback function which may have made the savings from avoiding the tuple allocation seem more like noise. Since you already did the work I still think it's worth putting it in since PyObject_Vectorcall is an established API and it would have been used anyway if this were written from scratch. |
Calling C functions from Python is not the code path handled by _CallPythonObject() so no difference in run-time would theoretically be observed by that benchmark for this patch. This bug report pertains to code paths where a C function calls back into a Python function. A practical example is using Python with an event loop library written in C. |
Oh ok, thanks for the clarification :-) |
Just to clarify further, the original benchmark by corona10 did indeed lead to
Namely using
In the original benchmark setup no significant reduction in runtime was observed by this patch. I noticed that it was not a typical use of
The benchmark I provided exercises that use case and the >=10% reduction in runtime was observed. Thanks to both corona10 and vstinner, I appreciate their effort in this issue. |
PR 31188 cause reference leak, please check Raised RLIMIT_NOFILE: 256 -> 1024 == Tests result: FAILURE == 1 test failed: Total duration: 3.2 sec |
No leak after if I revert PR 31188 on my local |
All patches are merged! |
Two things, one is a nit pick the other is more serious. I think vstinner mentioned this in his review of your patch but on this line: b552768#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR180 Instead of using PySequence_Fast_ITEMS(), you can just use PyTuple_GET_ITEM() since you know the converters are a tuple. In terms of runtime efficiency it doesn't change anything but is consistent with this line: b552768#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR157 Though on second thought I think PySequence_Fast_ITEMS() is probably a better API overall in terms of efficiency if PyTuple_GET_ITEM() would eventually become a real function call given the long term push toward a stable API. The other issue is more serious, you are always allocating an array of CTYPES_MAX_ARGCOUNT pointers on the stack on every C callback. This could cause stack overflows in situations where a relatively deep set of callbacks are invoked. This usage of CTYPES_MAX_ARGCOUNT differs its usage in callproc.c, where in that case |
Sound reasonable, I submitted a new patch to use alloca which is already used in _ctypes module. |
And it even better from performance view |
I don't have github so I can't comment there but just as a nitpick, alloca() never returns NULL, so you don't have to check that. I know that is inconsistent with its use in callproc.c so for consistency's sake the NULL check should stay but I would remove both checks in a future change. More importantly you can completely avoid the args_stack variable since alloca() has the same end affect. I would also add an assert that the number of args is <= CTYPES_MAX_ARGCOUNT, that should be the case since GH 31188 |
I suggest to use PyTuple_GET_ITEM(), but PySequence_Fast_ITEMS() is more effecient. It's hard to beat an array in C. |
I really hope I am not being annoying at this point :) One final nit, in this line: You do not have to check if nargs > 0. alloca() can handle the case when nargs == 0. The usage of alloca() in callproc.c does not check for nargs > 0 either. Otherwise thanks for this enhancement! |
I failed to find the doc about alloca(0). It seems like the existing _ctypes_callproc() function *can* call alloca(0) if argtuple is empty and pIunk is 0 (pIunk is specific to Windows). |
For reference, here are MSDN, Linux, OpenBSD, and GCC docs on alloca: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/alloca?view=msvc-170 https://www.man7.org/linux/man-pages/man3/alloca.3.html https://man.openbsd.org/alloca.3 https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_002a_005f_005fbuiltin_005falloca |
Okay let's remove if statement, I will send a patch soon. |
Thanks again everyone, very much appreciated. |
Was reviewing the code and noticed that a double-free was introduced in the recent changes: cpython/Modules/_ctypes/callbacks.c Line 192 in dd76b3f
That line should have been removed in b552768 but it was missed! |
Another bug, the array returned by alloca() should be zero-initialized. If an early return happens because of an intermediate error, Py_DECREF() will be called on uninitialized memory. Also it should be Py_XDECREF() I think. |
Ignore the previous comment :) |
Thanks, that looks like my mistake |
Easy one to make, might be worth adding a test that would have exercised that code path. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: