Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use _PyObject_Vectorcall in Modules/_ctypes/callbacks.c #90481

Closed
hydroflask mannequin opened this issue Jan 10, 2022 · 39 comments
Closed

Use _PyObject_Vectorcall in Modules/_ctypes/callbacks.c #90481

hydroflask mannequin opened this issue Jan 10, 2022 · 39 comments
Assignees
Labels
3.11 bug and security fixes performance Performance or resource usage topic-ctypes

Comments

@hydroflask
Copy link
Mannequin

hydroflask mannequin commented Jan 10, 2022

BPO 46323
Nosy @vstinner, @corona10, @erlend-aasland
PRs
  • bpo-46323: Use PyObject_Vectorcall while calling ctypes callback function #31138
  • bpo-46323: _ctypes.CFuncPtr fails if _argtypes_ is too long #31188
  • bpo-46323 Fix ref leak if ctypes.CFuncPtr raises an error. #31209
  • bpo-46323: Reduce stack usage of ctypes python callback function. #31224
  • bpo-46323: Allow alloca(0) for python callback function of ctypes #31249
  • bpo-46323: Fix double-free issue for borrowed refs #31272
  • Files
  • bench_callback.py
  • bench_callback_v2.py
  • bench_callback_v3.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/corona10'
    closed_at = <Date 2022-02-08.13:25:11.271>
    created_at = <Date 2022-01-10.04:01:21.420>
    labels = ['ctypes', '3.11', 'performance']
    title = 'Use _PyObject_Vectorcall in Modules/_ctypes/callbacks.c'
    updated_at = <Date 2022-02-11.08:49:08.498>
    user = 'https://bugs.python.org/hydroflask'

    bugs.python.org fields:

    activity = <Date 2022-02-11.08:49:08.498>
    actor = 'corona10'
    assignee = 'corona10'
    closed = True
    closed_date = <Date 2022-02-08.13:25:11.271>
    closer = 'corona10'
    components = ['ctypes']
    creation = <Date 2022-01-10.04:01:21.420>
    creator = 'hydroflask'
    dependencies = []
    files = ['50604', '50609', '50610']
    hgrepos = []
    issue_num = 46323
    keywords = ['patch']
    message_count = 39.0
    messages = ['410185', '410190', '412689', '412692', '412699', '412710', '412711', '412712', '412713', '412714', '412744', '412745', '412759', '412760', '412765', '412806', '412807', '412810', '412812', '412837', '412839', '412872', '412878', '412879', '412882', '412884', '412936', '412940', '412941', '412942', '412981', '412989', '413018', '413044', '413045', '413046', '413048', '413049', '413050']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'hydroflask', 'corona10', 'erlendaasland']
    pr_nums = ['31138', '31188', '31209', '31224', '31249', '31272']
    priority = 'normal'
    resolution = None
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46323'
    versions = ['Python 3.11']

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Jan 10, 2022

    _CallPythonObject() in Modules/_ctypes/callbacks.c still uses the old API (PyObject_CallObject()) to invoke methods. It should use _PyObject_Vectorcall(). Creating an issue for this since this method is a little more performance sensitive than normal Python library code.

    @hydroflask hydroflask mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes 3.11 bug and security fixes topic-ctypes performance Performance or resource usage labels Jan 10, 2022
    @kumaraditya303
    Copy link
    Contributor

    Targeting Python 3.11 as it a performance improvement.

    @kumaraditya303 kumaraditya303 removed 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels Jan 10, 2022
    @corona10
    Copy link
    Member

    corona10 commented Feb 6, 2022

    @HydroFlask

    It does not show impressive performance enhancement, so I decided not to change it
    See PR 31138

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 7, 2022

    @corona10

    Thank you for looking into it.

    The benchmark you did is not reflective of my use-case. I have a codebase that I have completely converted into a C extension using mypyc. So the ctypes callback does not call into interpreted Python code but actually another C-extension. I have annotated types aggressively to avoid invoking malloc(). In this use-case the extra overhead from the tuple allocation would be significant.

    Additionally your benchmark may not be as micro as intended. You call some math in the callback function which may have made the savings from avoiding the tuple allocation seem more like noise.

    Since you already did the work I still think it's worth putting it in since PyObject_Vectorcall is an established API and it would have been used anyway if this were written from scratch.

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 7, 2022

    A benchmark on calling C functions using ctypes sounds better than a benchmark calling Python functions.

    Calling C functions from Python is not the code path handled by _CallPythonObject() so no difference in run-time would theoretically be observed by that benchmark for this patch. This bug report pertains to code paths where a C function calls back into a Python function. A practical example is using Python with an event loop library written in C.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 7, 2022

    Oh ok, thanks for the clarification :-)

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 7, 2022

    Just to clarify further, the original benchmark by corona10 did indeed lead to _CallPythonObject being invoked but it was not quite the normal use-case. In the original benchmark it invoked the generated ctypes thunk via Python so as vstinner said it was doing this:

    Python -> converters -> thunk-> _CallPythonObject -> converters-> Python
    

    Namely using PyCFuncPtr_call to invoke the thunk generated by _ctypes_alloc_callback. Normally when invoking C functions via PyCFuncPtr_call it looks like this:

    Python -> converters -> C_function
    

    In the original benchmark setup no significant reduction in runtime was observed by this patch. I noticed that it was not a typical use of _CallPythonObject, where instead it would be a top-level C function calling back into Python. Something like this:

    C -> thunk -> _CallPythonObject() -> Python
    

    The benchmark I provided exercises that use case and the >=10% reduction in runtime was observed. Thanks to both corona10 and vstinner, I appreciate their effort in this issue.

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    @vstinner

    PR 31188 cause reference leak, please check

    Raised RLIMIT_NOFILE: 256 -> 1024
    0:00:00 load avg: 5.38 Run tests sequentially
    0:00:00 load avg: 5.38 [1/1] test_ctypes
    beginning 6 repetitions
    123456
    ......
    test_ctypes leaked [1028, 1026, 1028] references, sum=3082
    test_ctypes leaked [2, 1, 3] memory blocks, sum=6
    test_ctypes failed (reference leak)

    == Tests result: FAILURE ==

    1 test failed:
    test_ctypes

    Total duration: 3.2 sec
    Tests result: FAILURE

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    No leak after if I revert PR 31188 on my local

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    @vstinner

    PR 31209 is for fixing the leak.

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    New changeset e959dd9 by Dong-hee Na in branch 'main':
    bpo-46323 Fix ref leak if ctypes.CFuncPtr raises an error. (GH-31209)
    e959dd9

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    New changeset b552768 by Dong-hee Na in branch 'main':
    bpo-46323: Use PyObject_Vectorcall while calling ctypes callback function (GH-31138)
    b552768

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    @HydroFlask

    All patches are merged!
    Thanks for all your suggestions :)

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 8, 2022

    Two things, one is a nit pick the other is more serious. I think vstinner mentioned this in his review of your patch but on this line:

    b552768#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR180

    Instead of using PySequence_Fast_ITEMS(), you can just use PyTuple_GET_ITEM() since you know the converters are a tuple. In terms of runtime efficiency it doesn't change anything but is consistent with this line: b552768#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR157

    Though on second thought I think PySequence_Fast_ITEMS() is probably a better API overall in terms of efficiency if PyTuple_GET_ITEM() would eventually become a real function call given the long term push toward a stable API.

    The other issue is more serious, you are always allocating an array of CTYPES_MAX_ARGCOUNT pointers on the stack on every C callback. This could cause stack overflows in situations where a relatively deep set of callbacks are invoked. This usage of CTYPES_MAX_ARGCOUNT differs its usage in callproc.c, where in that case alloca() is used to allocate the specific number of array entries on the stack. To avoid an unintended stack overflow I would use alloca() or if you don't like alloca() I would only allocate a relatively small constant number on the stack.

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    @HydroFlask

    Sound reasonable, I submitted a new patch to use alloca which is already used in _ctypes module.

    @corona10
    Copy link
    Member

    corona10 commented Feb 8, 2022

    And it even better from performance view

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 9, 2022

    I don't have github so I can't comment there but just as a nitpick, alloca() never returns NULL, so you don't have to check that. I know that is inconsistent with its use in callproc.c so for consistency's sake the NULL check should stay but I would remove both checks in a future change.

    More importantly you can completely avoid the args_stack variable since alloca() has the same end affect. I would also add an assert that the number of args is <= CTYPES_MAX_ARGCOUNT, that should be the case since GH 31188

    @vstinner
    Copy link
    Member

    vstinner commented Feb 9, 2022

    I suggest to use PyTuple_GET_ITEM(), but PySequence_Fast_ITEMS() is more effecient. It's hard to beat an array in C.

    @corona10
    Copy link
    Member

    corona10 commented Feb 9, 2022

    New changeset d18120c by Dong-hee Na in branch 'main':
    bpo-46323: Reduce stack usage of ctypes python callback function. (GH-31224)
    d18120c

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 9, 2022

    @corona10

    I really hope I am not being annoying at this point :) One final nit, in this line:

    https://github.com/python/cpython/pull/31224/files#diff-706e65ee28911740bf638707e19578b8182e57c6a8a9a4a91105d825f95a139dR168

    You do not have to check if nargs > 0. alloca() can handle the case when nargs == 0. The usage of alloca() in callproc.c does not check for nargs > 0 either.

    Otherwise thanks for this enhancement!

    @vstinner
    Copy link
    Member

    vstinner commented Feb 9, 2022

    I failed to find the doc about alloca(0). It seems like the existing _ctypes_callproc() function *can* call alloca(0) if argtuple is empty and pIunk is 0 (pIunk is specific to Windows).

    @corona10
    Copy link
    Member

    @HydroFlask @vstinner

    Okay let's remove if statement, I will send a patch soon.

    @corona10
    Copy link
    Member

    New changeset db05285 by Dong-hee Na in branch 'main':
    bpo-46323: Allow alloca(0) for python callback function of ctypes (GH-31249)
    db05285

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 10, 2022

    Thanks again everyone, very much appreciated.

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 11, 2022

    Was reviewing the code and noticed that a double-free was introduced in the recent changes:

    Py_DECREF(cnv);

    That line should have been removed in b552768 but it was missed!

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 11, 2022

    Another bug, the array returned by alloca() should be zero-initialized. If an early return happens because of an intermediate error, Py_DECREF() will be called on uninitialized memory. Also it should be Py_XDECREF() I think.

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 11, 2022

    Ignore the previous comment :)

    @corona10
    Copy link
    Member

    Was reviewing the code and noticed that a double-free was introduced in the recent changes:

    Thanks, that looks like my mistake

    @hydroflask
    Copy link
    Mannequin Author

    hydroflask mannequin commented Feb 11, 2022

    Easy one to make, might be worth adding a test that would have exercised that code path.

    @corona10
    Copy link
    Member

    New changeset 0ac5372 by Dong-hee Na in branch 'main':
    bpo-46323: Fix double-free issue for borrowed refs (GH-31272)
    0ac5372

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 bug and security fixes performance Performance or resource usage topic-ctypes
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants