msg410185 - (view) |
Author: (hydroflask) |
Date: 2022-01-10 04:01 |
_CallPythonObject() in Modules/_ctypes/callbacks.c still uses the old API (PyObject_CallObject()) to invoke methods. It should use _PyObject_Vectorcall(). Creating an issue for this since this method is a little more performance sensitive than normal Python library code.
|
msg410190 - (view) |
Author: Kumar Aditya (kumaraditya) * |
Date: 2022-01-10 06:51 |
Targeting Python 3.11 as it a performance improvement.
|
msg412689 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-06 23:49 |
@hydroflask
It does not show impressive performance enhancement, so I decided not to change it
See PR 31138
|
msg412692 - (view) |
Author: (hydroflask) |
Date: 2022-02-07 00:19 |
@corona10
Thank you for looking into it.
The benchmark you did is not reflective of my use-case. I have a codebase that I have completely converted into a C extension using mypyc. So the ctypes callback does not call into interpreted Python code but actually another C-extension. I have annotated types aggressively to avoid invoking malloc(). In this use-case the extra overhead from the tuple allocation would be significant.
Additionally your benchmark may not be as micro as intended. You call some math in the callback function which may have made the savings from avoiding the tuple allocation seem more like noise.
Since you already did the work I still think it's worth putting it in since PyObject_Vectorcall is an established API and it would have been used anyway if this were written from scratch.
|
msg412699 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-07 01:03 |
@hydroflask
I am one of the big fans of applying vectorcall :)
> The benchmark you did is not reflective of my use-case. I have a codebase that I have completely converted into a C extension using mypyc.
Can you provide me a minimal benchmarkable script(if you need to include mypyc it's okay) to persuade other core dev?
|
msg412710 - (view) |
Author: (hydroflask) |
Date: 2022-02-07 06:45 |
Okay I put together a benchmark that better exemplifies my use case and results are in favor of adding patch.
When bench_callback_v2.py is not compiled with mypyc I see a ~10% reduction in runtime:
$ unpatched-env/bin/python bench_callback_v2.py
1.1262263769749552
$ patched-env/bin/python bench_callback_v2.py
1.0174838998354971
When bench_callback_v2.py is compiled with mypyc, I am getting ~6% reduction in runtime:
$ unpatched-env/bin/python -c "import bench_callback_v2"
1.0056699379347265
$ patched-env/bin/python -c "import bench_callback_v2"
0.9415687420405447
bench_callback_v2.py is attached.
|
msg412711 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-07 06:54 |
@vstinner
I got a similar result from hydroflask's benchmark.
I think that is worth reopening this issue WDYT?
|
msg412712 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-07 07:00 |
@hydroflask
Would you like to run your benchmark with my latest PR?
|
msg412713 - (view) |
Author: (hydroflask) |
Date: 2022-02-07 07:05 |
Vanilla bench_callback_v2.py, not compiled with mypyc:
$ env/bin/python bench_callback_v2.py
1.114047016017139
$ env2/bin/python bench_callback_v2.py
0.9644024870358407
~13% reduction in runtime. Nice job!
|
msg412714 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-07 07:08 |
With v3 benchmark
Mean +- std dev: [base] 1.32 us +- 0.02 us -> [opt] 1.18 us +- 0.02 us: 1.13x faster
|
msg412744 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2022-02-07 13:45 |
A benchmark on calling C functions using ctypes sounds better than a benchmark calling Python functions. For Python functions, Python objects are converted to C types, and then C types are convered back to Python objects, the Python result is converted to a C type, and then the C type is converted back to a Python object for the final result...
It would be nice to have a benchmark on different number of arguments. Like: 0, 1, 3, 5, 8, 10.
|
msg412745 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2022-02-07 13:53 |
New changeset 4cce1352bb47babaeefb68fcfcc48ffa073745c3 by Victor Stinner in branch 'main':
bpo-46323: _ctypes.CFuncPtr fails if _argtypes_ is too long (GH-31188)
https://github.com/python/cpython/commit/4cce1352bb47babaeefb68fcfcc48ffa073745c3
|
msg412759 - (view) |
Author: (hydroflask) |
Date: 2022-02-07 15:32 |
> A benchmark on calling C functions using ctypes sounds better than a benchmark calling Python functions.
Calling C functions from Python is not the code path handled by _CallPythonObject() so no difference in run-time would theoretically be observed by that benchmark for this patch. This bug report pertains to code paths where a C function calls back into a Python function. A practical example is using Python with an event loop library written in C.
|
msg412760 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2022-02-07 15:59 |
Oh ok, thanks for the clarification :-)
|
msg412765 - (view) |
Author: (hydroflask) |
Date: 2022-02-07 16:20 |
Just to clarify further, the original benchmark by corona10 did indeed lead to `_CallPythonObject` being invoked but it was not quite the normal use-case. In the original benchmark it invoked the generated ctypes thunk via Python so as vstinner said it was doing this:
Python -> converters -> thunk-> _CallPythonObject -> converters-> Python
Namely using `PyCFuncPtr_call` to invoke the thunk generated by `_ctypes_alloc_callback`. Normally when invoking C functions via `PyCFuncPtr_call` it looks like this:
Python -> converters -> C_function
In the original benchmark setup no significant reduction in runtime was observed by this patch. I noticed that it was not a typical use of `_CallPythonObject`, where instead it would be a top-level C function calling back into Python. Something like this:
C -> thunk -> _CallPythonObject() -> Python
The benchmark I provided exercises that use case and the >=10% reduction in runtime was observed. Thanks to both corona10 and vstinner, I appreciate their effort in this issue.
|
msg412806 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 02:59 |
@vstinner
PR 31188 cause reference leak, please check
Raised RLIMIT_NOFILE: 256 -> 1024
0:00:00 load avg: 5.38 Run tests sequentially
0:00:00 load avg: 5.38 [1/1] test_ctypes
beginning 6 repetitions
123456
......
test_ctypes leaked [1028, 1026, 1028] references, sum=3082
test_ctypes leaked [2, 1, 3] memory blocks, sum=6
test_ctypes failed (reference leak)
== Tests result: FAILURE ==
1 test failed:
test_ctypes
Total duration: 3.2 sec
Tests result: FAILURE
|
msg412807 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 03:01 |
No leak after if I revert PR 31188 on my local
|
msg412810 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 04:59 |
@vstinner
PR 31209 is for fixing the leak.
|
msg412812 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 05:22 |
New changeset e959dd9f5c1d8865d4e10c49eb30ee0a4fe2735d by Dong-hee Na in branch 'main':
bpo-46323 Fix ref leak if ctypes.CFuncPtr raises an error. (GH-31209)
https://github.com/python/cpython/commit/e959dd9f5c1d8865d4e10c49eb30ee0a4fe2735d
|
msg412837 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 13:09 |
New changeset b5527688aae11d0b5af58176267a9943576e71e5 by Dong-hee Na in branch 'main':
bpo-46323: Use PyObject_Vectorcall while calling ctypes callback function (GH-31138)
https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5
|
msg412839 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 13:25 |
@hydroflask
All patches are merged!
Thanks for all your suggestions :)
|
msg412872 - (view) |
Author: (hydroflask) |
Date: 2022-02-08 21:06 |
Two things, one is a nit pick the other is more serious. I think vstinner mentioned this in his review of your patch but on this line:
https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR180
Instead of using PySequence_Fast_ITEMS(), you can just use PyTuple_GET_ITEM() since you know the converters are a tuple. In terms of runtime efficiency it doesn't change anything but is consistent with this line: https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR157
Though on second thought I think PySequence_Fast_ITEMS() is probably a better API overall in terms of efficiency if PyTuple_GET_ITEM() would eventually become a real function call given the long term push toward a stable API.
The other issue is more serious, you are always allocating an array of CTYPES_MAX_ARGCOUNT pointers on the stack on every C callback. This could cause stack overflows in situations where a relatively deep set of callbacks are invoked. This usage of CTYPES_MAX_ARGCOUNT differs its usage in callproc.c, where in that case `alloca()` is used to allocate the specific number of array entries on the stack. To avoid an unintended stack overflow I would use alloca() or if you don't like alloca() I would only allocate a relatively small constant number on the stack.
|
msg412878 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 23:08 |
@hydroflask
Sound reasonable, I submitted a new patch to use alloca which is already used in _ctypes module.
|
msg412879 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-08 23:08 |
And it even better from performance view
|
msg412882 - (view) |
Author: (hydroflask) |
Date: 2022-02-09 00:15 |
I don't have github so I can't comment there but just as a nitpick, alloca() never returns NULL, so you don't have to check that. I know that is inconsistent with its use in callproc.c so for consistency's sake the NULL check should stay but I would remove both checks in a future change.
More importantly you can completely avoid the args_stack variable since alloca() has the same end affect. I would also add an assert that the number of args is <= CTYPES_MAX_ARGCOUNT, that should be the case since GH 31188
|
msg412884 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2022-02-09 00:39 |
I suggest to use PyTuple_GET_ITEM(), but PySequence_Fast_ITEMS() is more effecient. It's hard to beat an array in C.
|
msg412936 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-09 18:10 |
New changeset d18120cd67b4297f78bfc9bf7b36774cf0bf15f2 by Dong-hee Na in branch 'main':
bpo-46323: Reduce stack usage of ctypes python callback function. (GH-31224)
https://github.com/python/cpython/commit/d18120cd67b4297f78bfc9bf7b36774cf0bf15f2
|
msg412940 - (view) |
Author: (hydroflask) |
Date: 2022-02-09 20:17 |
@corona10
I really hope I am not being annoying at this point :) One final nit, in this line:
https://github.com/python/cpython/pull/31224/files#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR168
You do not have to check if nargs > 0. alloca() can handle the case when nargs == 0. The usage of alloca() in callproc.c does not check for nargs > 0 either.
Otherwise thanks for this enhancement!
|
msg412941 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2022-02-09 20:23 |
I failed to find the doc about alloca(0). It seems like the existing _ctypes_callproc() function *can* call alloca(0) if argtuple is empty and pIunk is 0 (pIunk is specific to Windows).
|
msg412942 - (view) |
Author: (hydroflask) |
Date: 2022-02-09 20:31 |
For reference, here are MSDN, Linux, OpenBSD, and GCC docs on alloca:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/alloca?view=msvc-170
https://www.man7.org/linux/man-pages/man3/alloca.3.html
https://man.openbsd.org/alloca.3
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_002a_005f_005fbuiltin_005falloca
|
msg412981 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-10 08:18 |
@hydroflask @vstinner
Okay let's remove if statement, I will send a patch soon.
|
msg412989 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-10 10:10 |
New changeset db052851a70fd95d047c6263fc16a75e4d47b3ed by Dong-hee Na in branch 'main':
bpo-46323: Allow alloca(0) for python callback function of ctypes (GH-31249)
https://github.com/python/cpython/commit/db052851a70fd95d047c6263fc16a75e4d47b3ed
|
msg413018 - (view) |
Author: (hydroflask) |
Date: 2022-02-10 18:05 |
Thanks again everyone, very much appreciated.
|
msg413044 - (view) |
Author: (hydroflask) |
Date: 2022-02-11 08:15 |
Was reviewing the code and noticed that a double-free was introduced in the recent changes:
https://github.com/python/cpython/blob/dd76b3f7d332dd6eced5cbc2ad2adfc397700b3d/Modules/_ctypes/callbacks.c#L192
That line should have been removed in https://github.com/python/cpython/commit/b5527688aae11d0b5af58176267a9943576e71e5 but it was missed!
|
msg413045 - (view) |
Author: (hydroflask) |
Date: 2022-02-11 08:21 |
Another bug, the array returned by alloca() should be zero-initialized. If an early return happens because of an intermediate error, Py_DECREF() will be called on uninitialized memory. Also it should be Py_XDECREF() I think.
|
msg413046 - (view) |
Author: (hydroflask) |
Date: 2022-02-11 08:24 |
Ignore the previous comment :)
|
msg413048 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-11 08:37 |
> Was reviewing the code and noticed that a double-free was introduced in the recent changes:
Thanks, that looks like my mistake
|
msg413049 - (view) |
Author: (hydroflask) |
Date: 2022-02-11 08:46 |
Easy one to make, might be worth adding a test that would have exercised that code path.
|
msg413050 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2022-02-11 08:49 |
New changeset 0ac5372bf6b937ed44a8f9c4e402d024fcd80870 by Dong-hee Na in branch 'main':
bpo-46323: Fix double-free issue for borrowed refs (GH-31272)
https://github.com/python/cpython/commit/0ac5372bf6b937ed44a8f9c4e402d024fcd80870
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:54 | admin | set | github: 90481 |
2022-02-11 08:49:08 | corona10 | set | messages:
+ msg413050 |
2022-02-11 08:46:13 | hydroflask | set | messages:
+ msg413049 |
2022-02-11 08:37:50 | corona10 | set | messages:
+ msg413048 |
2022-02-11 08:26:42 | corona10 | set | pull_requests:
+ pull_request29434 |
2022-02-11 08:24:31 | hydroflask | set | messages:
+ msg413046 |
2022-02-11 08:21:24 | hydroflask | set | messages:
+ msg413045 |
2022-02-11 08:15:42 | hydroflask | set | messages:
+ msg413044 |
2022-02-10 18:05:49 | hydroflask | set | messages:
+ msg413018 |
2022-02-10 10:10:11 | corona10 | set | messages:
+ msg412989 |
2022-02-10 09:47:13 | corona10 | set | pull_requests:
+ pull_request29418 |
2022-02-10 08:18:12 | corona10 | set | messages:
+ msg412981 |
2022-02-09 20:31:20 | hydroflask | set | messages:
+ msg412942 |
2022-02-09 20:23:45 | vstinner | set | messages:
+ msg412941 |
2022-02-09 20:17:07 | hydroflask | set | messages:
+ msg412940 |
2022-02-09 18:10:20 | corona10 | set | messages:
+ msg412936 |
2022-02-09 00:39:04 | vstinner | set | messages:
+ msg412884 |
2022-02-09 00:15:07 | hydroflask | set | messages:
+ msg412882 |
2022-02-08 23:08:31 | corona10 | set | messages:
+ msg412879 |
2022-02-08 23:08:14 | corona10 | set | messages:
+ msg412878 |
2022-02-08 22:54:16 | corona10 | set | pull_requests:
+ pull_request29395 |
2022-02-08 21:06:57 | hydroflask | set | messages:
+ msg412872 |
2022-02-08 13:25:11 | corona10 | set | status: open -> closed
messages:
+ msg412839 stage: patch review -> resolved |
2022-02-08 13:09:29 | corona10 | set | messages:
+ msg412837 |
2022-02-08 05:22:37 | corona10 | set | messages:
+ msg412812 |
2022-02-08 04:59:55 | corona10 | set | messages:
+ msg412810 |
2022-02-08 04:59:03 | corona10 | set | pull_requests:
+ pull_request29379 |
2022-02-08 03:01:45 | corona10 | set | messages:
+ msg412807 |
2022-02-08 02:59:15 | corona10 | set | messages:
+ msg412806 |
2022-02-07 16:20:33 | hydroflask | set | messages:
+ msg412765 |
2022-02-07 15:59:09 | vstinner | set | messages:
+ msg412760 |
2022-02-07 15:32:31 | hydroflask | set | messages:
+ msg412759 |
2022-02-07 13:53:23 | vstinner | set | messages:
+ msg412745 |
2022-02-07 13:45:45 | vstinner | set | messages:
+ msg412744 |
2022-02-07 12:37:59 | vstinner | set | pull_requests:
+ pull_request29359 |
2022-02-07 07:10:37 | corona10 | set | stage: resolved -> patch review |
2022-02-07 07:10:32 | corona10 | set | status: closed -> open resolution: wont fix -> |
2022-02-07 07:08:29 | corona10 | set | messages:
+ msg412714 |
2022-02-07 07:08:12 | corona10 | set | files:
+ bench_callback_v3.py |
2022-02-07 07:05:50 | hydroflask | set | messages:
+ msg412713 |
2022-02-07 07:00:19 | corona10 | set | messages:
+ msg412712 |
2022-02-07 06:54:50 | corona10 | set | nosy:
+ vstinner, erlendaasland messages:
+ msg412711
|
2022-02-07 06:45:49 | hydroflask | set | files:
+ bench_callback_v2.py
messages:
+ msg412710 |
2022-02-07 01:03:55 | corona10 | set | messages:
+ msg412699 |
2022-02-07 00:19:02 | hydroflask | set | messages:
+ msg412692 |
2022-02-06 23:49:34 | corona10 | set | status: open -> closed resolution: wont fix messages:
+ msg412689
stage: patch review -> resolved |
2022-02-05 06:15:20 | corona10 | set | files:
+ bench_callback.py |
2022-02-05 05:47:24 | corona10 | set | keywords:
+ patch stage: patch review pull_requests:
+ pull_request29316 |
2022-02-05 05:12:57 | corona10 | set | assignee: corona10
nosy:
+ corona10 |
2022-01-18 11:22:16 | kumaraditya | set | nosy:
- kumaraditya
|
2022-01-10 11:42:50 | vstinner | set | nosy:
- vstinner
|
2022-01-10 06:51:18 | kumaraditya | set | nosy:
+ vstinner, kumaraditya
messages:
+ msg410190 versions:
- Python 3.7, Python 3.8, Python 3.9, Python 3.10 |
2022-01-10 04:01:21 | hydroflask | create | |