classification
Title: Handling pending calls during runtime finalization may cause problems.
Type: crash Stage: resolved
Components: Interpreter Core Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: eric.snow Nosy List: eric.snow, vstinner
Priority: normal Keywords: patch

Created on 2019-06-01 19:11 by eric.snow, last changed 2020-04-08 21:19 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 19061 merged vstinner, 2020-03-18 18:03
PR 19439 merged vstinner, 2020-04-08 17:12
Messages (12)
msg344202 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2019-06-01 19:11
In Python/lifecycle.c (Py_FinalizeEx) we call _Py_FinishPendingCalls(), right after we stop all non-daemon Python threads but before we've actually started finalizing the runtime state.  That call looks for any remaining pending calls (for the main interpreter) and runs them.  There's some evidence of a bug there.

In bpo-33608 I moved the pending calls to per-interpreter state.  We saw failures (sometimes sporadic) on a few buildbots (e.g. FreeBSD) during runtime finalization.  However, nearly all of the buildbots were fine, so it may be a question of architecture or slow hardware.  See bpo-33608 for details on the failures.

There are a number of possibilities, but it's been tricky reproducing the problem in order to investigate.  Here are some theories:

* daemon threads (a known weak point in runtime finalization) block pending calls from happening until some time after portions of the runtime have already been cleaned up
* there's a race that causes the pending calls machinery to get caught in some sort infinite loop (e.g. a pending call fails and gets re-queued)
* a corner case in the pending calls logic that triggers only during finalization

Here are some other points to consider:

* do we have the same problem during subinterpreter finalization (Py_EndInterpreter() rather than runtime finalization)?
* perhaps the problem extends beyond finalization, but the conditions are more likely there
* the change for bpo-33608 could have introduced the bug rather that exposing an existing one
msg344207 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2019-06-01 19:21
Also, someone did manage to investigate and identify a likely cause:

https://bugs.python.org/issue33608#msg342791
msg364489 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 00:42
In the master branch of Python, trip_signal() calls _PyEval_AddPendingCall(tstate) and tstate is get using _PyRuntimeState_GetThreadState(runtime).

trip_signal() can be called while the GIL is not held: tstate is NULL in this case. For example, it's the case when calling signal.raise_signal().

Problem: _PyEval_AddPendingCall() uses tstate if pending->finishing is non-zero.

    if (pending->finishing) {
        ...
        _PyErr_Fetch(tstate, &exc, &val, &tb);
        _PyErr_SetString(tstate, PyExc_SystemError,
                        "Py_AddPendingCall: cannot add pending calls "
                        "(Python shutting down)");
        ...
    }

pending->finishing was addd in bpo-33608 by the change:

commit 842a2f07f2f08a935ef470bfdaeef40f87490cfc
Author: Eric Snow <ericsnowcurrently@gmail.com>
Date:   Fri Mar 15 15:47:51 2019 -0600

    bpo-33608: Deal with pending calls relative to runtime shutdown. (gh-12246)

I found this issue while trying to make pending calls per interpreter in bpo-39984: move pending from _PyRuntimeState to PyInterpreterState.
msg364490 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 00:48
> In bpo-33608 I moved the pending calls to per-interpreter state.  We saw failures (sometimes sporadic) on a few buildbots (e.g. FreeBSD) during runtime finalization.  However, nearly all of the buildbots were fine, so it may be a question of architecture or slow hardware.  See bpo-33608 for details on the failures.

That was a crash in multiprocessing tests. PyEval_RestoreThread(tstate) was called with a freed PyThreadState: tstate->interp = 0xdbdbdbdbdbdbdbdb for example.

I recently fixed PyEval_RestoreThread() in bpo-39877 to exit properly daemon theads without dereferencing tstate which points to freed memory.

To reproduce the bug, it helped me to add a sleep at Python exit. Fixed value (ex: 1 second) or better: random sleep (between 1 ms and 1 sec).
msg364543 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 18:10
> pending->finishing was addd in bpo-33608 by the change: commit 842a2f07f2f08a935ef470bfdaeef40f87490cfc

Since this change, Py_AddPendingCall() now requires the thread to have a Python thread state: it is used if pending->finishing is non-zero.

The function documentation says:

"This function doesn’t need a current thread state to run, and it doesn’t need the global interpreter lock."

https://docs.python.org/dev/c-api/init.html#c.Py_AddPendingCall

The current implementation seems to contradict the documentation.

We may update the documentation to replace "doesn't need" with "requires" (a Python thread state).

Or the implementation can use PyGILState_Ensure() and PyGILState_Release() to create a temporary Python thread state if the thread doesn't have one.

Removing pending->finishing would only partially fix the issue. With PR 19061, tstate is required to access "ceval" structure.

trip_signal() called by the C signal handler has a similar issue: it only requires non-NULL tstate if pending->finishing is non-zero.
msg364544 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 18:21
trip_signal() has another problem. If pending->finishing is non-zero, _PyEval_AddPendingCall() uses the C API whereas the current thread may not hold the GIL. That's forbidden by the C API.

The more I think about it, the more I think that pending->finishing is fragile and should be removed.

I understood that pending->finishing was added to workaround crashes during Python finalization. I fixed multiple crashes related to daemon threads during Python finalization in bpo-39877. Maybe the bugs that I fixed were the ones which pending->finishing was supposed to work around.
msg364545 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 18:21
My notes on Python finalization: https://pythondev.readthedocs.io/finalization.html
msg364550 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 18:28
New changeset 8849e5962ba481d5d414b3467a256aba2134b4da by Victor Stinner in branch 'master':
bpo-39984: trip_signal() uses PyGILState_GetThisThreadState() (GH-19061)
https://github.com/python/cpython/commit/8849e5962ba481d5d414b3467a256aba2134b4da
msg365997 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-08 17:44
See also bpo-40082: trip_signal() gets NULL tstate on Windows on CTRL+C.
msg366004 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-08 20:10
New changeset cfc3c2f8b34d3864717ab584c5b6c260014ba55a by Victor Stinner in branch 'master':
bpo-37127: Remove _pending_calls.finishing (GH-19439)
https://github.com/python/cpython/commit/cfc3c2f8b34d3864717ab584c5b6c260014ba55a
msg366007 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-08 21:18
I removed _pending_calls.finishing for multiple reasons:

* _PyEval_AddPendingCall() used the C API whereas the caller may not hold the GIL.
* bpo-40082: trip_signal() can be called from a thread which has no Python thread state. On Windows, CTRL+C calls trip_signal() in a new thread a each call.

I rewrote trip_signal() to only use the PyInterpreterState ("interp") and avoid PyThreadState ("tstate") in PR 19441 to fix bpo-40082.

trip_signal() should read and set atomtic variables: don't modify globals without a lock. _PyEval_AddPendingCall() is not fully async-signal safe yet :-/ Using a lock is unsafe in a signal handler.
msg366008 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-08 21:19
> Also, someone did manage to investigate and identify a likely cause:
> https://bugs.python.org/issue33608#msg342791

I made many changes in Python internals since bpo-33608 was reported. I am not aware of any recent issue with pending calls and so close the issue.

If you get new crashes/hangs with pending calls during Python finalization, please open a new issue.
History
Date User Action Args
2020-04-08 21:19:44vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg366008

stage: patch review -> resolved
2020-04-08 21:18:03vstinnersetmessages: + msg366007
2020-04-08 20:10:59vstinnersetmessages: + msg366004
2020-04-08 17:44:08vstinnersetmessages: + msg365997
2020-04-08 17:12:53vstinnersetpull_requests: + pull_request18793
2020-03-18 18:28:57vstinnersetmessages: + msg364550
2020-03-18 18:21:42vstinnersetmessages: + msg364545
2020-03-18 18:21:15vstinnersetmessages: + msg364544
2020-03-18 18:10:40vstinnersetmessages: + msg364543
2020-03-18 18:03:56vstinnersetkeywords: + patch
stage: test needed -> patch review
pull_requests: + pull_request18415
2020-03-18 00:48:47vstinnersetmessages: + msg364490
2020-03-18 00:42:09vstinnersetmessages: + msg364489
2020-03-09 23:40:31vstinnersetnosy: + vstinner
2019-06-01 19:21:38eric.snowsetmessages: + msg344207
2019-06-01 19:11:46eric.snowcreate