Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function locals and evaluation stack should be stored in a contiguous, per-thread stack #88198

Closed
markshannon opened this issue May 4, 2021 · 10 comments
Assignees
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@markshannon
Copy link
Member

BPO 44032
Nosy @vstinner, @markshannon, @pablogsal, @erlend-aasland, @Fidget-Spinner
PRs
  • bpo-44032: Move data stack to thread from FrameObject. #26076
  • bpo-44032: Defer clearing last thread state until last GC has been run. #26285
  • bpo-44032: Fix downcast conversion in frameobject.c #26291
  • bpo-44032: Move pointer to code object from frame-object to frame specials array. #26771
  • Files
  • fix.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/markshannon'
    closed_at = <Date 2021-05-25.09:56:55.025>
    created_at = <Date 2021-05-04.14:54:31.350>
    labels = ['interpreter-core', '3.11', 'performance']
    title = 'Function locals and evaluation stack should be stored in a contiguous, per-thread stack'
    updated_at = <Date 2022-02-23.15:14:24.569>
    user = 'https://github.com/markshannon'

    bugs.python.org fields:

    activity = <Date 2022-02-23.15:14:24.569>
    actor = 'vstinner'
    assignee = 'Mark.Shannon'
    closed = True
    closed_date = <Date 2021-05-25.09:56:55.025>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2021-05-04.14:54:31.350>
    creator = 'Mark.Shannon'
    dependencies = []
    files = ['50058']
    hgrepos = []
    issue_num = 44032
    keywords = ['patch']
    message_count = 10.0
    messages = ['392906', '394104', '394105', '394133', '394140', '394141', '394306', '396037', '413721', '413799']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'Mark.Shannon', 'pablogsal', 'erlendaasland', 'kj']
    pr_nums = ['26076', '26285', '26291', '26771']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue44032'
    versions = ['Python 3.11']

    @markshannon
    Copy link
    Member Author

    Currently, the local variables (inc. cell and free vars) for functions and the evaluation stack are kept in per-activation chunks in the frame object. This is not a good design for modern hardware.
    The local variables and stack are amongst the hottest memory in the VM and should be stored in a contiguous stack in memory.

    Allocating a per-thread stack would improve memory locality considerably, and pave the way to allocating heap objects for function activations lazily, only when needed for debugging and introspection.

    Generators would still need heap allocated locals and stack, but that would be no worse than currently.

    @markshannon markshannon self-assigned this May 4, 2021
    @markshannon markshannon added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 4, 2021
    @markshannon markshannon self-assigned this May 4, 2021
    @markshannon markshannon added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 4, 2021
    @markshannon markshannon changed the title Store locals and evaluation stack should be stored in a contiguous, per-thread stack Function locals and evaluation stack should be stored in a contiguous, per-thread stack May 4, 2021
    @markshannon markshannon changed the title Store locals and evaluation stack should be stored in a contiguous, per-thread stack Function locals and evaluation stack should be stored in a contiguous, per-thread stack May 4, 2021
    @erlend-aasland
    Copy link
    Contributor

    Using Pablo's (or Victor's) reproducer from bpo-44184, I'm getting a crash, apparently because _PyThreadState_PushLocals() is called after PyThreadState_Clear().

    In Python/pystate.c interpreter_clear(), we're first calling PyThreadState_Clear() (line 295), and a few lines later (line 326) we're running _PyGC_CollectNoFail(). Clearing tstate _after_ the last GC collect resolves the issue (see attached patch), but it might introduce other problems; I'm not at all familiar with this part of the codebase.

    FYI, git HEAD is at b11a951 on main.

    =================================================================
    ==22475==ERROR: AddressSanitizer: heap-use-after-free on address 0x629000000220 at pc 0x00010c985aa8 bp 0x7ffee3bab620 sp 0x7ffee3bab618
    WRITE of size 8 at 0x629000000220 thread T0
    #0 0x10c985aa7 in _PyThreadState_PushLocals pystate.c:2030
    #1 0x10c7fc8b6 in _PyEval_Vector ceval.c:5164
    #2 0x10c397ebe in _PyFunction_Vectorcall call.c:342
    #3 0x10c5591b2 in _PyObject_VectorcallTstate abstract.h:114
    #4 0x10c558f6b in PyObject_CallOneArg abstract.h:184
    #5 0x10c5586c9 in call_unbound_noarg typeobject.c:1625
    #6 0x10c57d578 in slot_tp_finalize typeobject.c:7762
    #7 0x10ca2ea0b in finalize_garbage gcmodule.c:982
    #8 0x10ca2694d in gc_collect_main gcmodule.c:1287
    #9 0x10ca25f9c in _PyGC_CollectNoFail gcmodule.c:2123
    #10 0x10c97abed in interpreter_clear pystate.c:326
    #11 0x10c97ae81 in _PyInterpreterState_Clear pystate.c:358
    #12 0x10c9697d9 in finalize_interp_clear pylifecycle.c:1634
    #13 0x10c96925b in Py_FinalizeEx pylifecycle.c:1812
    #14 0x10ca1f143 in Py_RunMain main.c:668
    #15 0x10ca203e4 in pymain_main main.c:696
    #16 0x10ca20694 in Py_BytesMain main.c:720
    #17 0x10c055a91 in main python.c:15
    #18 0x7fff20582f3c in start+0x0 (libdyld.dylib:x86_64+0x15f3c)

    0x629000000220 is located 32 bytes inside of 16384-byte region [0x629000000200,0x629000004200)
    freed by thread T0 here:
    #0 0x10d518609 in wrap_free+0xa9 (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x48609)
    #1 0x10c50a7fc in _PyObject_ArenaFree obmalloc.c:174
    #2 0x10c5098e7 in _PyObject_VirtualFree obmalloc.c:564
    #3 0x10c9808cf in PyThreadState_Clear pystate.c:904
    #4 0x10c97a178 in interpreter_clear pystate.c:295
    #5 0x10c97ae81 in _PyInterpreterState_Clear pystate.c:358
    #6 0x10c9697d9 in finalize_interp_clear pylifecycle.c:1634
    #7 0x10c96925b in Py_FinalizeEx pylifecycle.c:1812
    #8 0x10ca1f143 in Py_RunMain main.c:668
    #9 0x10ca203e4 in pymain_main main.c:696
    #10 0x10ca20694 in Py_BytesMain main.c:720
    #11 0x10c055a91 in main python.c:15
    #12 0x7fff20582f3c in start+0x0 (libdyld.dylib:x86_64+0x15f3c)

    previously allocated by thread T0 here:
    #0 0x10d5184c0 in wrap_malloc+0xa0 (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x484c0)
    #1 0x10c50a7d8 in _PyObject_ArenaMalloc obmalloc.c:168
    #2 0x10c5098af in _PyObject_VirtualAlloc obmalloc.c:558
    #3 0x10c985bb7 in allocate_chunk pystate.c:617
    #4 0x10c97d6ee in new_threadstate pystate.c:678
    #5 0x10c97cb49 in PyThreadState_New pystate.c:706
    #6 0x10c96e26b in pycore_create_interpreter pylifecycle.c:614
    #7 0x10c96ca7c in pyinit_config pylifecycle.c:834
    #8 0x10c967c3a in pyinit_core pylifecycle.c:1003
    #9 0x10c967429 in Py_InitializeFromConfig pylifecycle.c:1188
    #10 0x10ca2409f in pymain_init main.c:66
    #11 0x10ca201ee in pymain_main main.c:687
    #12 0x10ca20694 in Py_BytesMain main.c:720
    #13 0x10c055a91 in main python.c:15
    #14 0x7fff20582f3c in start+0x0 (libdyld.dylib:x86_64+0x15f3c)

    SUMMARY: AddressSanitizer: heap-use-after-free pystate.c:2030 in _PyThreadState_PushLocals
    Shadow bytes around the buggy address:
    0x1c51fffffff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x1c5200000000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x1c5200000010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x1c5200000020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    0x1c5200000030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    =>0x1c5200000040: fd fd fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd
    0x1c5200000050: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x1c5200000060: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x1c5200000070: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x1c5200000080: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    0x1c5200000090: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
    Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable: 00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone: fa
    Freed heap region: fd
    Stack left redzone: f1
    Stack mid redzone: f2
    Stack right redzone: f3
    Stack after return: f5
    Stack use after scope: f8
    Global redzone: f9
    Global init order: f6
    Poisoned by user: f7
    Container overflow: fc
    Array cookie: ac
    Intra object redzone: bb
    ASan internal: fe
    Left alloca redzone: ca
    Right alloca redzone: cb
    Shadow gap: cc
    ==22475==ABORTING

    @erlend-aasland
    Copy link
    Contributor

    cc. Pablo, Victor

    @pablogsal
    Copy link
    Member

    Mark, can you check the failure? Otherwise is going to start failing on the buildbots when we add the test case for the issue Erlend mentions.

    @markshannon
    Copy link
    Member Author

    What's the test case, exactly?

    ref.py for the other issue doesn't crash if I change "func.py" to "ref.py"
    otherwise it just complains that "func.py" doesn't exist.

    @pablogsal
    Copy link
    Member

    Is the test added here:

    #26274

    import ast
    import builtins
    import sys
    import os
    
    tree = ast.parse("pass")
    x = [tree]
    x.append(x)
    
    # Put the cycle somewhere to survive until the *last* GC collection
    def callback(*args):
        pass
    callback.a = x
    os.register_at_fork(after_in_child=callback)

    If this doesn't work maybe Erlend may help you reproducing this.

    @vstinner
    Copy link
    Member

    bpo-44184 has been fixed, it was unrelated. Mark fixed a bug in commit af5d497. It seems like all issues are solved, I close the issue.

    Mark: please set the version field to 3.11 for work in 3.11.

    Commits:

    commit b11a951
    Author: Mark Shannon <mark@hotpy.org>
    Date: Fri May 21 10:57:35 2021 +0100

    bpo-44032: Move data stack to thread from FrameObject. (GH-26076)
    
    * Remove 'zombie' frames. We won't need them once we are allocating fixed-size frames.
    
    * Add co_nlocalplus field to code object to avoid recomputing size of locals + frees + cells.
    
    * Move locals, cells and freevars out of frame object into separate memory buffer.
    
    * Use per-threadstate allocated memory chunks for local variables.
    
    * Move globals and builtins from frame object to per-thread stack.
    
    * Move (slow) locals frame object to per-thread stack.
    
    * Move internal frame functions to internal header.
    

    commit 498383c
    Author: Pablo Galindo <Pablogsal@gmail.com>
    Date: Fri May 21 19:15:39 2021 +0100

    bpo-44032: Fix downcast conversion in frameobject.c (GH-26291)
    

    commit af5d497
    Author: Mark Shannon <mark@hotpy.org>
    Date: Mon May 24 16:22:02 2021 +0100

    bpo-44032: Delay deletion of stack chunks until thread state is deleted. (GH-26285)
    

    @vstinner vstinner added the 3.11 only security fixes label May 25, 2021
    @vstinner vstinner added the 3.11 only security fixes label May 25, 2021
    @markshannon
    Copy link
    Member Author

    New changeset 0982ded by Mark Shannon in branch 'main':
    bpo-44032: Move pointer to code object from frame-object to frame specials array. (GH-26771)
    0982ded

    @vstinner
    Copy link
    Member

    Removing the PyFrameObject.f_code member (commit b11a951) broke by gevent project:

    @vstinner
    Copy link
    Member

    I created bpo-46836: "[C API] Move PyFrameObject to the internal C API".

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants