Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon thread is crashing in PyEval_RestoreThread() while the main thread is exiting the process #84058

Closed
vstinner opened this issue Mar 6, 2020 · 20 comments
Labels
3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@vstinner
Copy link
Member

vstinner commented Mar 6, 2020

BPO 39877
Nosy @ncoghlan, @vstinner, @ericsnowcurrently, @pablogsal, @nanjekyejoannah
PRs
  • bpo-39877: Fix PyEval_RestoreThread() for daemon threads #18808
  • bpo-39877: Fix PyEval_RestoreThread() for daemon threads #18811
  • bpo-39877: _PyRuntimeState.finalizing becomes atomic #18816
  • bpo-19466: Py_Finalize() clears daemon threads earlier #18848
  • bpo-39877: take_gil() now exits the thread if finalizing #18854
  • bpo-39877: Remove useless PyEval_InitThreads() calls #18883
  • bpo-39877: Py_Initialize() pass tstate to PyEval_InitThreads() #18884
  • bpo-39877: Refactor take_gil() function #18885
  • bpo-39877: take_gil() checks tstate_must_exit() twice #18888
  • bpo-39877: take_gil() checks tstate_must_exit() twice #18890
  • bpo-39877: PyGILState_Ensure() don't call PyEval_InitThreads() #18891
  • bpo-39877: Deprecate PyEval_InitThreads() #18892
  • bpo-39877: Fix take_gil() for daemon threads #19054
  • bpo-39877: 4th take_gil() fix for daemon threads #19080
  • Files
  • daemon_threads_exit.py
  • slow_exit.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-03-11.23:14:10.705>
    created_at = <Date 2020-03-06.14:22:28.647>
    labels = ['interpreter-core', '3.9']
    title = 'Daemon thread is crashing in PyEval_RestoreThread() while the main thread is exiting the process'
    updated_at = <Date 2020-03-19.18:48:34.523>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-03-19.18:48:34.523>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-03-11.23:14:10.705>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2020-03-06.14:22:28.647>
    creator = 'vstinner'
    dependencies = []
    files = ['48958', '48959']
    hgrepos = []
    issue_num = 39877
    keywords = ['patch']
    message_count = 20.0
    messages = ['363512', '363517', '363519', '363520', '363566', '363649', '363653', '363667', '363764', '363767', '363773', '363776', '363790', '363791', '363792', '363795', '363981', '363983', '364496', '364629']
    nosy_count = 5.0
    nosy_names = ['ncoghlan', 'vstinner', 'eric.snow', 'pablogsal', 'nanjekyejoannah']
    pr_nums = ['18808', '18811', '18816', '18848', '18854', '18883', '18884', '18885', '18888', '18890', '18891', '18892', '19054', '19080']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue39877'
    versions = ['Python 3.9']

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 6, 2020

    Sometimes, test_multiprocessing_spawn does crash in PyEval_RestoreThread() on FreeBSD with a coredump. This issue should be the root cause of bpo-39088: "test_concurrent_futures crashed with python.core core dump on AMD64 FreeBSD Shared 3.x", where the second comment is a test_multiprocessing_spawn failure with "... After: ['python.core'] ..."

    # Thread 1

    (gdb) frame
    #0 0x00000000003b518c in PyEval_RestoreThread (tstate=0x801f23790) at Python/ceval.c:387
    387 _PyRuntimeState *runtime = tstate->interp->runtime;
    (gdb) p tstate->interp
    $3 = (PyInterpreterState *) 0xdddddddddddddddd

    (gdb) info threads
    Id Target Id Frame

    • 1 LWP 100839 0x00000000003b518c in PyEval_RestoreThread (tstate=0x801f23790) at Python/ceval.c:387
      2 LWP 100230 0x00000008006fbcfc in _fini () from /lib/libm.so.5
      3 LWP 100192 _accept4 () at _accept4.S:3

    # Thread 2

    (gdb) thread 2
    [Switching to thread 2 (LWP 100230)]
    #0 0x00000008006fbcfc in _fini () from /lib/libm.so.5

    (gdb) where
    (...)
    #4 0x0000000800859431 in exit (status=0) at /usr/src/lib/libc/stdlib/exit.c:74
    #5 0x000000000048f3d8 in Py_Exit (sts=0) at Python/pylifecycle.c:2349
    (...)

    The problem is that Python already freed the memory of all PyThreadState structures, whereas PyEval_RestoreThread(tstate) dereferences tstate to get the _PyRuntimeState structure:

    void
    PyEval_RestoreThread(PyThreadState *tstate)
    {
        assert(tstate != NULL);
    _PyRuntimeState *runtime = tstate->interp->runtime;  // <==== HERE \===
    
        struct _ceval_runtime_state *ceval = &runtime->ceval;
        assert(gil_created(&ceval->gil));
    
        int err = errno;
        take_gil(ceval, tstate);
        exit_thread_if_finalizing(tstate);
        errno = err;
    \_PyThreadState_Swap(&runtime-\>gilstate, tstate);
    

    }

    I modified PyEval_RestoreThread(tstate) to get runtime from tstate: commit 01b1cc1. Extract of the change:

    diff --git a/Python/ceval.c b/Python/ceval.c
    index 9f4b43615e..ee13fd1ad7 100644
    --- a/Python/ceval.c
    +++ b/Python/ceval.c
    @@ -384,7 +386,7 @@ PyEval_SaveThread(void)
     void
     PyEval_RestoreThread(PyThreadState *tstate)
     {
    -    _PyRuntimeState *runtime = &_PyRuntime;
    +    _PyRuntimeState *runtime = tstate->interp->runtime;
         struct _ceval_runtime_state *ceval = &runtime->ceval;
     
         if (tstate == NULL) {
    @@ -394,7 +396,7 @@ PyEval_RestoreThread(PyThreadState *tstate)
     
         int err = errno;
         take_gil(ceval, tstate);
    -    exit_thread_if_finalizing(runtime, tstate);
    +    exit_thread_if_finalizing(tstate);
         errno = err;
     
         _PyThreadState_Swap(&runtime->gilstate, tstate);

    By the way, exit_thread_if_finalizing(tstate) also get runtime from state.

    Before 01b1cc1, it was possible to call PyEval_RestoreThread() from a daemon thread even if tstate was a dangling pointer, since tstate wasn't dereferenced: _PyRuntime variable was accessed directly.

    --

    One simple fix is to access directly _PyRuntime in PyEval_RestoreThread() with a comment explaining why runtime is not get from tstate.

    I'm concerned by the fact that only FreeBSD buildbot spotted the crash. test_multiprocessing_spawn seems to silently ignore crashes. The bug was only spotted because Python produced a coredump in the current directory. My Fedora 31 doesn't write coredump files in the current files, and so the issue is silently ignored even when using --fail-env-changed.

    IMHO the most reliable solution is to drop support for daemon threads: they are dangerous by design. But that would be an incompatible change. Maybe we should at least deprecate daemon threads. Python 3.9 now denies spawning a daemon thread in a Python subinterpreter: bpo-37266.

    @vstinner vstinner added 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Mar 6, 2020
    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 6, 2020

    To reproduce the bug on Linux:

    • apply attached slow_exit.patch to sleep 1 second just before exiting the process, in the main() function
    • run: ./python daemon_threads_exit.py # attached file

    Example:
    ---

    $ ./python daemon_threads_exit.py 
    Start 100 daemon threads
    Exit Python main thread
    Erreur de segmentation (core dumped)

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 6, 2020

    This issue should be the root cause of bpo-39088: "test_concurrent_futures crashed with python.core core dump on AMD64 FreeBSD Shared 3.x", (...)

    I marked bpo-39088 as a duplicate of this issue.

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 6, 2020

    Patch to get a cleaner error if the bug occurs, but also to make the bug more reliable:

    diff --git a/Python/ceval.c b/Python/ceval.c
    index ef4aac2f9a..8bf1e4766d 100644
    --- a/Python/ceval.c
    +++ b/Python/ceval.c
    @@ -382,7 +382,8 @@ PyEval_SaveThread(void)
     void
     PyEval_RestoreThread(PyThreadState *tstate)
     {
    -    assert(tstate != NULL);
    +    assert(!_PyMem_IsPtrFreed(tstate));
    +    assert(!_PyMem_IsPtrFreed(tstate->interp));
     
         _PyRuntimeState *runtime = tstate->interp->runtime;
         struct _ceval_runtime_state *ceval = &runtime->ceval;

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 6, 2020

    New changeset 7b3c252 by Victor Stinner in branch 'master':
    bpo-39877: _PyRuntimeState.finalizing becomes atomic (GH-18816)
    7b3c252

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2020

    New changeset eb4e2ae by Victor Stinner in branch 'master':
    bpo-39877: Fix PyEval_RestoreThread() for daemon threads (GH-18811)
    eb4e2ae

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2020

    Ok, it should now be fixed.

    Let me try to revisit the old bpo-19466 issue.

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2020

    I reopen the issue, my change introduced a *new* issue.

    While trying to fix bpo-19466, work on PR 18848, I noticed that my commit eb4e2ae introduced a race condition :-(

    The problem is that while the main thread is executing Py_FinalizeEx(), daemon threads can be waiting in take_gil(). Py_FinalizeEx() calls _PyRuntimeState_SetFinalizing(runtime, tstate). Later, Py_FinalizeEx() executes arbitrary Python code in _PyImport_Cleanup(tstate) which releases the GIL to give a chance to other threads to execute:

                if (_Py_atomic_load_relaxed(&ceval->gil_drop_request)) {
                    /* Give another thread a chance */
                    if (_PyThreadState_Swap(&runtime->gilstate, NULL) != tstate) {
                        Py_FatalError("tstate mix-up");
                    }
                    drop_gil(ceval, tstate);
                /* Other threads may run now */
    
                /* Check if we should make a quick exit. */
                exit_thread_if_finalizing(tstate);
    
                take_gil(ceval, tstate);
    
                    if (_PyThreadState_Swap(&runtime->gilstate, tstate) != NULL) {
                        Py_FatalError("orphan tstate");
                    }
                }

    At this point, one daemon thread manages to get the GIL: take_gil() completes... even if runtime->finalizing is not NULL. I expected that exit_thread_if_finalizing() would exit the thread, but exit_thread_if_finalizing() is now called *after* take_gil().

    --

    It's unclear to me when the GIL (the lock object) is destroyed and how we are supposed to destroy it, if an unknown number of daemon threads are waiting for it.

    The GIL lock is created by PyEval_InitThreads(): create_gil() with MUTEX_INIT(gil->mutex).

    The GIL lock is destroyed by _PyEval_FiniThreads(): destroy_gil() with MUTEX_FINI(gil->mutex).

    @vstinner vstinner reopened this Mar 8, 2020
    @vstinner vstinner reopened this Mar 8, 2020
    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2020

    New changeset 3225b9f by Victor Stinner in branch 'master':
    bpo-39877: Remove useless PyEval_InitThreads() calls (GH-18883)
    3225b9f

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2020

    New changeset 111e4ee by Victor Stinner in branch 'master':
    bpo-39877: Py_Initialize() pass tstate to PyEval_InitThreads() (GH-18884)
    111e4ee

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2020

    New changeset 85f5a69 by Victor Stinner in branch 'master':
    bpo-39877: Refactor take_gil() function (GH-18885)
    85f5a69

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2020

    New changeset 9229eee by Victor Stinner in branch 'master':
    bpo-39877: take_gil() checks tstate_must_exit() twice (GH-18890)
    9229eee

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2020

    The problem is that Python already freed the memory of all PyThreadState structures, whereas PyEval_RestoreThread(tstate) dereferences tstate to get the _PyRuntimeState structure:

    Funny/not funny, bpo-36818 added a similar bug with commit 396e0a8 in June 2019: bpo-37135. I reverted the change to fix the issue.

    Hopefully, it should now be fixed and the rationale for accessing directly _PyRuntime should now be better documented.

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2020

    I tested (run multiple times) daemon_threads_exit.py with slow_exit.patch: no crash.

    I also tested (run multiple times) stress.py + sleep_at_exit.patch of bpo-37135: no crash.

    And I tested asyncio_gc.py of bpo-19466: no crash neither.

    Python finalization now looks reliable. I'm not sure if it's "more" reliable than previously, but at least, I cannot get a crash anymore, even after bpo-19466 has been fixed (clear Python thread states of daemon threads earlier).

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2020

    New changeset 175a704 by Victor Stinner in branch 'master':
    bpo-39877: PyGILState_Ensure() don't call PyEval_InitThreads() (GH-18891)
    175a704

    @vstinner
    Copy link
    Member Author

    New changeset b4698ec by Victor Stinner in branch 'master':
    bpo-39877: Deprecate PyEval_InitThreads() (GH-18892)
    b4698ec

    @vstinner
    Copy link
    Member Author

    I created bpo-39941 "multiprocessing: Process.join() should emit a warning if the process is killed by a signal" which may help to detect the issue earlier on any platform, not only on FreeBSD.

    @vstinner
    Copy link
    Member Author

    The initial issue is now fixed. I close the issue.

    take_gil() only checks if the thread must exit once the GIL is acquired. Maybe it would be able to exit earlier, but I took the safe approach. If we must exit, drop the GIL and then exit. That's basically Python 3.8 behavior.

    If someone wants to optimize/enhance take_gil() for daemon thread, I suggest to open a new issue.

    Note: Supporting daemon threads require tricky code in take_gil(). I would prefer to deprecate daemon threads and prepare their removal.

    @vstinner
    Copy link
    Member Author

    New changeset 29356e0 by Victor Stinner in branch 'master':
    bpo-39877: Fix take_gil() for daemon threads (GH-19054)
    29356e0

    @vstinner
    Copy link
    Member Author

    New changeset a36adfa by Victor Stinner in branch 'master':
    bpo-39877: 4th take_gil() fix for daemon threads (GH-19080)
    a36adfa

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant