classification
Title: Crash in _PyThreadState_DeleteExcept() at fork in the process child
Type: Stage:
Components: Interpreter Core Versions: Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Michael.Felt, shihai1991, vstinner
Priority: normal Keywords:

Created on 2020-03-27 17:16 by vstinner, last changed 2021-04-07 09:08 by Michael.Felt.

Messages (11)
msg365173 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-27 17:16
At fork, Python calls PyOS_AfterFork_Child() in the child process which indirectly calls _PyThreadState_DeleteExcept() whichs calls release_sentinel() of the thread which releases the thread state lock (threading.Thread._tstate_lock).

Problem: using a lock after fork is unsafe and can crash.

That's exactly what happens randomly on AIX when stressing ThreadJoinOnShutdown.test_reinit_tls_after_fork() of test_threading:
https://bugs.python.org/issue40068#msg365031

There are different options to solve this issue:

* Reset _tstate_lock before using it... not sure that it's worth it, since we are going to delete the threading.Thread object with its _tstate_lock object anymore. After calling fork, the child process has exactly 1 thread: all other threads have been removed.

* Modify release_sentinel() to not use the lock: avoid PyThread_release_lock() call.
msg365175 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-27 17:18
See also bpo-40089: "Add _at_fork_reinit() method to locks".
msg367277 - (view) Author: hai shi (shihai1991) * (Python triager) Date: 2020-04-25 17:54
>since we are going to delete the threading.Thread object with its _tstate_lock object anymore.

Do we have any pep or discuss record about this plan?

> * Modify release_sentinel() to not use the lock: avoid PyThread_release_lock() call.

Hm, I am not sure about it. Looks like we can not remove it directly.
msg367422 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-27 13:42
> Do we have any pep or discuss record about this plan?

"since we are going to delete the threading.Thread object with its _tstate_lock object anyway" sentence is a description of the current _PyThreadState_DeleteExcept() implementation.
msg367496 - (view) Author: hai shi (shihai1991) * (Python triager) Date: 2020-04-28 05:12
> "since we are going to delete the threading.Thread object with its _tstate_lock object anyway" sentence is a description of the current _PyThreadState_DeleteExcept() implementation.

Oh, got it. thanks for your explanation :)
msg389937 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-31 22:21
I marked bpo-43665 "AIX: test_importlib regression (ENV change)" as a duplicate of this issue.

test_multiprocessing_pool_circular_import() of test_importlib now triggers this crash as well.
msg389966 - (view) Author: Michael Felt (Michael.Felt) * Date: 2021-04-01 09:52
Adding 3.10.

While a (sort of) duplicate I also would like to add that before revision "7cb033c423b65def1632d6c3c747111543b342a2" this was not showing up as an issue with test_importlib.

my issue was with test_importlib suddenly going into error.

As this seem to be a long-standing issue is it perhaps a possibility to change the test so that the bot can go green again?
msg389968 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-01 09:59
> As this seem to be a long-standing issue is it perhaps a possibility to change the test so that the bot can go green again?

This issue is a real crash and it seems quite easy to get it. I prefer to not hide the bug in the test suite. The bug must be fixed.
msg389969 - (view) Author: Michael Felt (Michael.Felt) * Date: 2021-04-01 09:59
OK: further.

Two options are suggested:

There are different options to solve this issue:

* Reset _tstate_lock before using it... not sure that it's worth it, since we are going to delete the threading.Thread object with its _tstate_lock object anymore. After calling fork, the child process has exactly 1 thread: all other threads have been removed.

* Modify release_sentinel() to not use the lock: avoid PyThread_release_lock() call.

** as to option 1 - it is 'worth it' if it stops the crashes

** This is deeper than I usually go in Python code - but I'll make an effort - help is appreciated.
msg389972 - (view) Author: Michael Felt (Michael.Felt) * Date: 2021-04-01 11:20
OK. Please explain. Looking at tstate assignment

In posixmodule.c:PyOSAfterFork_Child()
    PyStatus status;
    _PyRuntimeState *runtime = &_PyRuntime;

...
    PyThreadState *tstate = _PyThreadState_GET();

and later calls
    status = _PyRuntimeState_ReInitThreads(runtime);

Yet in Posix/ceval.c
PyStatus
_PyEval_ReInitThreads(PyThreadState *tstate)
{
    _PyRuntimeState *runtime = tstate->interp->runtime;

** this looks like runtime->interp->runtime

And then we get down to:
    /* Destroy all threads except the current one */
    _PyThreadState_DeleteExcept(runtime, tstate);

Is this correct - as it looks like:
_PyThreadState_DeleteExcept(runtime->interp->runtime, runtime) -- where runtime == &_PyRuntime;
msg390407 - (view) Author: Michael Felt (Michael.Felt) * Date: 2021-04-07 09:08
Willing to spend more time on this - but the variable names chosen blind me - looks like a circle.

And, thinking about the address in the core dump starting with 0xd (segment 13) - confuses me somewhat - as from memory - I thought the shared library code/data converged on segment 14 (ie, addresses would begin with 0xe).

So, perhaps it is the 0xd* address that is killing everything.

See: https://bugs.python.org/issue43665#msg389923 -- as I am also still confused re: this line - looks suspect: with all arguments at 0xdddddddd

PyVectorcall_Call(callable = 0xdddddddd, tuple = 0xdddddddd, kwargs = 

0xdddddddd), line 255 in "call.c"

And also this line, and lines like it - notice the value for what looks like a variable for argument_count:

_PyEval_Vector(tstate = 0x22023cf0, con = 0x2000120c, locals = 
0x22023960, args = 0x220239f0, argcount = 3508213100, kwnames = 
0x22023cf0), line 46 in "pycore_ceval.h"
_PyFunction_Vectorcall(func = 0x220239f0, stack = (nil), nargsf = 
570572016, kwnames = 0x220239f0), line 361 in "call.c"
method_vectorcall(method = 0x10103bdc, args = (nil), nargsf = 0, kwnames 
= 0x20045994), line 119 in "abstract.h"
History
Date User Action Args
2021-04-07 09:08:26Michael.Feltsetmessages: + msg390407
2021-04-01 11:20:55Michael.Feltsetmessages: + msg389972
2021-04-01 09:59:19Michael.Feltsetmessages: + msg389969
2021-04-01 09:59:09vstinnersetmessages: + msg389968
2021-04-01 09:52:42Michael.Feltsetnosy: + Michael.Felt

messages: + msg389966
versions: + Python 3.10
2021-03-31 22:21:37vstinnersetmessages: + msg389937
2021-03-31 22:20:40vstinnerlinkissue43665 superseder
2020-04-28 05:12:33shihai1991setmessages: + msg367496
2020-04-27 13:42:26vstinnersetmessages: + msg367422
2020-04-25 17:54:07shihai1991setmessages: + msg367277
2020-04-17 17:13:22shihai1991setnosy: + shihai1991
2020-03-27 17:18:10vstinnersetmessages: + msg365175
2020-03-27 17:16:55vstinnercreate