classification
Title: Daemon thread is crashing in PyEval_RestoreThread() while the main thread is exiting the process
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eric.snow, nanjekyejoannah, ncoghlan, pablogsal, vstinner
Priority: normal Keywords: patch

Created on 2020-03-06 14:22 by vstinner, last changed 2020-03-19 18:48 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
daemon_threads_exit.py vstinner, 2020-03-06 14:49
slow_exit.patch vstinner, 2020-03-06 14:51
Pull Requests
URL Status Linked Edit
PR 18808 closed vstinner, 2020-03-06 16:13
PR 18811 merged vstinner, 2020-03-06 17:31
PR 18816 merged vstinner, 2020-03-06 21:59
PR 18848 merged vstinner, 2020-03-08 16:40
PR 18854 closed vstinner, 2020-03-08 17:24
PR 18883 merged vstinner, 2020-03-09 19:31
PR 18884 merged vstinner, 2020-03-09 19:59
PR 18885 merged vstinner, 2020-03-09 20:45
PR 18888 closed vstinner, 2020-03-09 21:22
PR 18890 merged vstinner, 2020-03-09 21:50
PR 18891 merged vstinner, 2020-03-09 23:15
PR 18892 merged vstinner, 2020-03-10 00:00
PR 19054 merged vstinner, 2020-03-18 01:42
PR 19080 merged vstinner, 2020-03-19 18:28
Messages (20)
msg363512 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-06 14:22
Sometimes, test_multiprocessing_spawn does crash in PyEval_RestoreThread() on FreeBSD with a coredump. This issue should be the root cause of bpo-39088: "test_concurrent_futures crashed with python.core core dump on AMD64 FreeBSD Shared 3.x", where the second comment is a test_multiprocessing_spawn failure with "...  After:  ['python.core'] ..."

# Thread 1

(gdb) frame
#0  0x00000000003b518c in PyEval_RestoreThread (tstate=0x801f23790) at Python/ceval.c:387
387         _PyRuntimeState *runtime = tstate->interp->runtime;
(gdb) p tstate->interp
$3 = (PyInterpreterState *) 0xdddddddddddddddd

(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 100839        0x00000000003b518c in PyEval_RestoreThread (tstate=0x801f23790) at Python/ceval.c:387
  2    LWP 100230        0x00000008006fbcfc in _fini () from /lib/libm.so.5
  3    LWP 100192        _accept4 () at _accept4.S:3

# Thread 2

(gdb) thread 2
[Switching to thread 2 (LWP 100230)]
#0  0x00000008006fbcfc in _fini () from /lib/libm.so.5

(gdb) where
(...)
#4  0x0000000800859431 in exit (status=0) at /usr/src/lib/libc/stdlib/exit.c:74
#5  0x000000000048f3d8 in Py_Exit (sts=0) at Python/pylifecycle.c:2349
(...)

The problem is that Python already freed the memory of all PyThreadState structures, whereas PyEval_RestoreThread(tstate) dereferences tstate to get the _PyRuntimeState structure:

void
PyEval_RestoreThread(PyThreadState *tstate)
{
    assert(tstate != NULL);

    _PyRuntimeState *runtime = tstate->interp->runtime;  // <==== HERE ===
    struct _ceval_runtime_state *ceval = &runtime->ceval;
    assert(gil_created(&ceval->gil));

    int err = errno;
    take_gil(ceval, tstate);
    exit_thread_if_finalizing(tstate);
    errno = err;

    _PyThreadState_Swap(&runtime->gilstate, tstate);
}


I modified PyEval_RestoreThread(tstate) to get runtime from tstate: commit 01b1cc12e7c6a3d6a3d27ba7c731687d57aae92a. Extract of the change:

diff --git a/Python/ceval.c b/Python/ceval.c
index 9f4b43615e..ee13fd1ad7 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -384,7 +386,7 @@ PyEval_SaveThread(void)
 void
 PyEval_RestoreThread(PyThreadState *tstate)
 {
-    _PyRuntimeState *runtime = &_PyRuntime;
+    _PyRuntimeState *runtime = tstate->interp->runtime;
     struct _ceval_runtime_state *ceval = &runtime->ceval;
 
     if (tstate == NULL) {
@@ -394,7 +396,7 @@ PyEval_RestoreThread(PyThreadState *tstate)
 
     int err = errno;
     take_gil(ceval, tstate);
-    exit_thread_if_finalizing(runtime, tstate);
+    exit_thread_if_finalizing(tstate);
     errno = err;
 
     _PyThreadState_Swap(&runtime->gilstate, tstate);


By the way, exit_thread_if_finalizing(tstate) also get runtime from state.


Before 01b1cc12e7c6a3d6a3d27ba7c731687d57aae92a, it was possible to call PyEval_RestoreThread() from a daemon thread even if tstate was a dangling pointer, since tstate wasn't dereferenced: _PyRuntime variable was accessed directly.

--

One simple fix is to access directly _PyRuntime in PyEval_RestoreThread() with a comment explaining why runtime is not get from tstate.

I'm concerned by the fact that only FreeBSD buildbot spotted the crash. test_multiprocessing_spawn seems to silently ignore crashes. The bug was only spotted because Python produced a coredump in the current directory. My Fedora 31 doesn't write coredump files in the current files, and so the issue is silently ignored even when using --fail-env-changed.

IMHO the most reliable solution is to drop support for daemon threads: they are dangerous by design. But that would be an incompatible change. Maybe we should at least deprecate daemon threads. Python 3.9 now denies spawning a daemon thread in a Python subinterpreter: bpo-37266.
msg363517 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-06 14:51
To reproduce the bug on Linux:

* apply attached slow_exit.patch to sleep 1 second just before exiting the process, in the main() function
* run: ./python daemon_threads_exit.py # attached file

Example:
---
$ ./python daemon_threads_exit.py 
Start 100 daemon threads
Exit Python main thread
Erreur de segmentation (core dumped)
---
msg363519 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-06 14:52
> This issue should be the root cause of bpo-39088: "test_concurrent_futures crashed with python.core core dump on AMD64 FreeBSD Shared 3.x", (...)

I marked bpo-39088 as a duplicate of this issue.
msg363520 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-06 14:53
Patch to get a cleaner error if the bug occurs, but also to make the bug more reliable:

diff --git a/Python/ceval.c b/Python/ceval.c
index ef4aac2f9a..8bf1e4766d 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -382,7 +382,8 @@ PyEval_SaveThread(void)
 void
 PyEval_RestoreThread(PyThreadState *tstate)
 {
-    assert(tstate != NULL);
+    assert(!_PyMem_IsPtrFreed(tstate));
+    assert(!_PyMem_IsPtrFreed(tstate->interp));
 
     _PyRuntimeState *runtime = tstate->interp->runtime;
     struct _ceval_runtime_state *ceval = &runtime->ceval;
msg363566 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-06 23:24
New changeset 7b3c252dc7f44d4bdc4c7c82d225ebd09c78f520 by Victor Stinner in branch 'master':
bpo-39877: _PyRuntimeState.finalizing becomes atomic (GH-18816)
https://github.com/python/cpython/commit/7b3c252dc7f44d4bdc4c7c82d225ebd09c78f520
msg363649 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-08 10:57
New changeset eb4e2ae2b8486e8ee4249218b95d94a9f0cc513e by Victor Stinner in branch 'master':
bpo-39877: Fix PyEval_RestoreThread() for daemon threads (GH-18811)
https://github.com/python/cpython/commit/eb4e2ae2b8486e8ee4249218b95d94a9f0cc513e
msg363653 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-08 11:19
Ok, it should now be fixed.

Let me try to revisit the old bpo-19466 issue.
msg363667 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-08 16:47
I reopen the issue, my change introduced a *new* issue.

While trying to fix bpo-19466, work on PR 18848, I noticed that my commit eb4e2ae2b8486e8ee4249218b95d94a9f0cc513e introduced a race condition :-(

The problem is that while the main thread is executing Py_FinalizeEx(), daemon threads can be waiting in take_gil(). Py_FinalizeEx() calls _PyRuntimeState_SetFinalizing(runtime, tstate). Later, Py_FinalizeEx() executes arbitrary Python code in _PyImport_Cleanup(tstate) which releases the GIL to give a chance to other threads to execute:

            if (_Py_atomic_load_relaxed(&ceval->gil_drop_request)) {
                /* Give another thread a chance */
                if (_PyThreadState_Swap(&runtime->gilstate, NULL) != tstate) {
                    Py_FatalError("tstate mix-up");
                }
                drop_gil(ceval, tstate);

                /* Other threads may run now */

                /* Check if we should make a quick exit. */
                exit_thread_if_finalizing(tstate);

                take_gil(ceval, tstate);

                if (_PyThreadState_Swap(&runtime->gilstate, tstate) != NULL) {
                    Py_FatalError("orphan tstate");
                }
            }

At this point, one daemon thread manages to get the GIL: take_gil() completes... even if runtime->finalizing is not NULL. I expected that exit_thread_if_finalizing() would exit the thread, but exit_thread_if_finalizing() is now called *after* take_gil().

--

It's unclear to me when the GIL (the lock object) is destroyed and how we are supposed to destroy it, if an unknown number of daemon threads are waiting for it.

The GIL lock is created by PyEval_InitThreads(): create_gil() with MUTEX_INIT(gil->mutex).

The GIL lock is destroyed by _PyEval_FiniThreads(): destroy_gil() with MUTEX_FINI(gil->mutex).
msg363764 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-09 19:57
New changeset 3225b9f9739cd4bcca372d0fa939cea1ae5c6402 by Victor Stinner in branch 'master':
bpo-39877: Remove useless PyEval_InitThreads() calls (GH-18883)
https://github.com/python/cpython/commit/3225b9f9739cd4bcca372d0fa939cea1ae5c6402
msg363767 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-09 20:24
New changeset 111e4ee52a1739e7c7221adde2fc364ef4954af2 by Victor Stinner in branch 'master':
bpo-39877: Py_Initialize() pass tstate to PyEval_InitThreads() (GH-18884)
https://github.com/python/cpython/commit/111e4ee52a1739e7c7221adde2fc364ef4954af2
msg363773 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-09 21:12
New changeset 85f5a69ae1541271286bb0f0e0303aabf792dd5c by Victor Stinner in branch 'master':
bpo-39877: Refactor take_gil() function (GH-18885)
https://github.com/python/cpython/commit/85f5a69ae1541271286bb0f0e0303aabf792dd5c
msg363776 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-09 22:10
New changeset 9229eeee105f19705f72e553cf066751ac47c7b7 by Victor Stinner in branch 'master':
bpo-39877: take_gil() checks tstate_must_exit() twice (GH-18890)
https://github.com/python/cpython/commit/9229eeee105f19705f72e553cf066751ac47c7b7
msg363790 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-09 23:30
> The problem is that Python already freed the memory of all PyThreadState structures, whereas PyEval_RestoreThread(tstate) dereferences tstate to get the _PyRuntimeState structure:

Funny/not funny, bpo-36818 added a similar bug with commit 396e0a8d9dc65453cb9d53500d0a620602656cfe in June 2019: bpo-37135. I reverted the change to fix the issue.

Hopefully, it should now be fixed and the rationale for accessing directly _PyRuntime should now be better documented.
msg363791 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-09 23:36
I tested (run multiple times) daemon_threads_exit.py with slow_exit.patch: no crash.

I also tested (run multiple times) stress.py + sleep_at_exit.patch of bpo-37135: no crash.

And I tested  asyncio_gc.py of bpo-19466: no crash neither.

Python finalization now looks reliable. I'm not sure if it's "more" reliable than previously, but at least, I cannot get a crash anymore, even after bpo-19466 has been fixed (clear Python thread states of daemon threads earlier).
msg363792 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-09 23:37
New changeset 175a704abfcb3400aaeb66d4f098d92ca7e30892 by Victor Stinner in branch 'master':
bpo-39877: PyGILState_Ensure() don't call PyEval_InitThreads() (GH-18891)
https://github.com/python/cpython/commit/175a704abfcb3400aaeb66d4f098d92ca7e30892
msg363795 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-10 00:29
New changeset b4698ecfdb526e0a9f5fa6ef0f8e1d8cca500203 by Victor Stinner in branch 'master':
bpo-39877: Deprecate PyEval_InitThreads() (GH-18892)
https://github.com/python/cpython/commit/b4698ecfdb526e0a9f5fa6ef0f8e1d8cca500203
msg363981 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-11 23:11
I created bpo-39941 "multiprocessing: Process.join() should emit a warning if the process is killed by a signal" which may help to detect the issue earlier on any platform, not only on FreeBSD.
msg363983 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-11 23:14
The initial issue is now fixed. I close the issue.

take_gil() only checks if the thread must exit once the GIL is acquired. Maybe it would be able to exit earlier, but I took the safe approach. If we must exit, drop the GIL and then exit. That's basically Python 3.8 behavior.

If someone wants to optimize/enhance take_gil() for daemon thread, I suggest to open a new issue.

Note: Supporting daemon threads require tricky code in take_gil(). I would prefer to deprecate daemon threads and prepare their removal.
msg364496 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 02:04
New changeset 29356e03d4f8800b04f799efe7a10e3ce8b16f61 by Victor Stinner in branch 'master':
bpo-39877: Fix take_gil() for daemon threads (GH-19054)
https://github.com/python/cpython/commit/29356e03d4f8800b04f799efe7a10e3ce8b16f61
msg364629 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-19 18:48
New changeset a36adfa6bbf5e612a4d4639124502135690899b8 by Victor Stinner in branch 'master':
bpo-39877: 4th take_gil() fix for daemon threads (GH-19080)
https://github.com/python/cpython/commit/a36adfa6bbf5e612a4d4639124502135690899b8
History
Date User Action Args
2020-03-19 18:48:34vstinnersetmessages: + msg364629
2020-03-19 18:28:08vstinnersetpull_requests: + pull_request18437
2020-03-18 02:04:39vstinnersetmessages: + msg364496
2020-03-18 01:42:59vstinnersetpull_requests: + pull_request18405
2020-03-11 23:14:10vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg363983

stage: patch review -> resolved
2020-03-11 23:11:29vstinnersetmessages: + msg363981
2020-03-10 00:29:00vstinnersetmessages: + msg363795
2020-03-10 00:00:55vstinnersetpull_requests: + pull_request18249
2020-03-09 23:37:52vstinnersetmessages: + msg363792
2020-03-09 23:36:28vstinnersetmessages: + msg363791
2020-03-09 23:30:42vstinnersetmessages: + msg363790
2020-03-09 23:15:14vstinnersetpull_requests: + pull_request18248
2020-03-09 22:10:56vstinnersetmessages: + msg363776
2020-03-09 21:50:58vstinnersetpull_requests: + pull_request18247
2020-03-09 21:22:24vstinnersetpull_requests: + pull_request18245
2020-03-09 21:12:11vstinnersetmessages: + msg363773
2020-03-09 20:45:15vstinnersetpull_requests: + pull_request18242
2020-03-09 20:24:23vstinnersetmessages: + msg363767
2020-03-09 19:59:07vstinnersetpull_requests: + pull_request18241
2020-03-09 19:57:05vstinnersetmessages: + msg363764
2020-03-09 19:31:36vstinnersetpull_requests: + pull_request18240
2020-03-08 17:24:25vstinnersetstage: resolved -> patch review
pull_requests: + pull_request18211
2020-03-08 16:47:22vstinnersetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg363667
2020-03-08 16:40:19vstinnersetpull_requests: + pull_request18210
2020-03-08 11:19:38vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg363653

stage: patch review -> resolved
2020-03-08 10:57:49vstinnersetmessages: + msg363649
2020-03-06 23:24:26vstinnersetmessages: + msg363566
2020-03-06 21:59:06vstinnersetpull_requests: + pull_request18174
2020-03-06 17:31:16vstinnersetpull_requests: + pull_request18169
2020-03-06 16:13:24vstinnersetstage: patch review
pull_requests: + pull_request18166
2020-03-06 14:53:16vstinnersetmessages: + msg363520
2020-03-06 14:52:22vstinnersetmessages: + msg363519
2020-03-06 14:51:46vstinnerlinkissue39088 superseder
2020-03-06 14:51:15vstinnersetfiles: + slow_exit.patch
keywords: + patch
messages: + msg363517
2020-03-06 14:49:43vstinnersetfiles: + daemon_threads_exit.py
2020-03-06 14:22:28vstinnercreate