Title: Handling pending calls during runtime finalization may cause problems.
Type: crash Stage: test needed
Components: Interpreter Core Versions:
Status: open Resolution:
Dependencies: Superseder:
Assigned To: eric.snow Nosy List: eric.snow
Priority: normal Keywords:

Created on 2019-06-01 19:11 by eric.snow, last changed 2019-06-01 19:21 by eric.snow.

Messages (2)
msg344202 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2019-06-01 19:11
In Python/lifecycle.c (Py_FinalizeEx) we call _Py_FinishPendingCalls(), right after we stop all non-daemon Python threads but before we've actually started finalizing the runtime state.  That call looks for any remaining pending calls (for the main interpreter) and runs them.  There's some evidence of a bug there.

In bpo-33608 I moved the pending calls to per-interpreter state.  We saw failures (sometimes sporadic) on a few buildbots (e.g. FreeBSD) during runtime finalization.  However, nearly all of the buildbots were fine, so it may be a question of architecture or slow hardware.  See bpo-33608 for details on the failures.

There are a number of possibilities, but it's been tricky reproducing the problem in order to investigate.  Here are some theories:

* daemon threads (a known weak point in runtime finalization) block pending calls from happening until some time after portions of the runtime have already been cleaned up
* there's a race that causes the pending calls machinery to get caught in some sort infinite loop (e.g. a pending call fails and gets re-queued)
* a corner case in the pending calls logic that triggers only during finalization

Here are some other points to consider:

* do we have the same problem during subinterpreter finalization (Py_EndInterpreter() rather than runtime finalization)?
* perhaps the problem extends beyond finalization, but the conditions are more likely there
* the change for bpo-33608 could have introduced the bug rather that exposing an existing one
msg344207 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2019-06-01 19:21
Also, someone did manage to investigate and identify a likely cause:
Date User Action Args
2019-06-01 19:21:38eric.snowsetmessages: + msg344207
2019-06-01 19:11:46eric.snowcreate