classification
Title: PyGILState_Ensure on non-Python thread causes fatal error
Type: Stage: patch review
Components: Extension Modules Versions: Python 3.7, Python 3.6, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mekk, eric.snow, gvanrossum, kchen, ncoghlan, pitrou, steve.dower, vstinner, xcombelle
Priority: normal Keywords: patch

Created on 2014-03-11 18:40 by steve.dower, last changed 2017-12-12 04:29 by kchen.

Files
File name Uploaded Description Edit
test.c steve.dower, 2014-03-11 18:40
ptest.c vstinner, 2016-03-15 16:55
PyGILState_Ensure.patch vstinner, 2016-03-15 16:56 review
Pull Requests
URL Status Linked Edit
PR 4650 merged vstinner, 2017-11-30 17:14
PR 4655 merged vstinner, 2017-11-30 21:38
PR 4657 merged vstinner, 2017-11-30 21:48
PR 4700 open vstinner, 2017-12-04 17:35
Messages (21)
msg213160 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2014-03-11 18:40
In Python 3.4rc3, calling PyGILState_Ensure() from a thread that was not created by Python and without any calls to PyEval_InitThreads() will cause a fatal exit:

Fatal Python error: take_gil: NULL tstate

I believe this was added in http://hg.python.org/cpython/rev/dc4e805ec68a. The attached file has a minimal repro (for Windows, sorry, but someone familiar with low-level *nix threading APIs should be able to translate it easily).

Basically, the sequence looks like this:

1. main() calls Py_Initialize()
2. main() starts new C-level thread
3. main() calls Py_BEGIN_ALLOW_THREADS and waits for the new thread
4. New thread calls PyGILState_Ensure()
5. PyGILState_Ensure() sees that there is no thread state and calls PyEval_InitThreads() (this is what changed in the linked changeset)
6. PyEval_InitThreads() creates the GIL
7. PyEval_InitThreads() calls take_gil(PyThreadState_GET()) (this is what aborts - there is still no thread state)
8. (expected) PyGILState_Ensure() calls PyThreadState_New() and PyEval_RestoreThread().

The attached repro has two workarounds (commented out). The first is to call PyThreadState_New()/PyEval_RestoreThread() from the new thread, and the second is to call PyEval_InitThreads() from the main thread.

The latter is inappropriate for my use, since I am writing a debugger that can inject itself into a running CPython process - there is no way to require that PyEval_InitThreads() has been called and I can't reliably inject it onto the main thread. The first workaround is okay, but I'd rather not have to special-case 3.4 by rewriting PyGILState_Ensure() so that I can safely call PyGILState_Ensure().
msg213161 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2014-03-11 18:49
Should have linked to #19576 as well, which is the issue associated with that changeset.
msg213398 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-03-13 13:48
IMO it's a bug in PyEval_InitThreads().
msg261820 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-15 16:55
ptest.c: Portable example using Python API.
msg261821 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-15 16:56
Attached patch should fix the issue.
msg307038 - (view) Author: Marcin Kasperski (Mekk) Date: 2017-11-27 08:56
Is this fix released? I can't find it in the changelog…

(I faced this bug on 3.5.2, released a couple of months after this bug was closed…)
msg307330 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-30 17:20
> Is this fix released? I can't find it in the changelog…

Oops, I lost track of this issue, but it wasn't fixed, no.

I just proposed my old fix as a pull requets: PR 4650.

> (I faced this bug on 3.5.2, released a couple of months after this bug was closed…)

The workaround is to call PyEval_InitThreads() before spawning your first non-Python thread.
https://docs.python.org/dev/c-api/init.html#c.PyEval_InitThreads
msg307331 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-30 17:20
I have to check if Python 2.7 is impacted as well.
msg307342 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-30 21:05
New changeset b4d1e1f7c1af6ae33f0e371576c8bcafedb099db by Victor Stinner in branch 'master':
bpo-20891: Fix PyGILState_Ensure() (#4650)
https://github.com/python/cpython/commit/b4d1e1f7c1af6ae33f0e371576c8bcafedb099db
msg307349 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-30 22:35
New changeset be6b74c0795b709c7a04e2187a7e32d08f5155f6 by Victor Stinner in branch '2.7':
bpo-20891: Fix PyGILState_Ensure() (#4650) (#4657)
https://github.com/python/cpython/commit/be6b74c0795b709c7a04e2187a7e32d08f5155f6
msg307350 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-30 22:36
New changeset e10c9de9d74fd4c26b32e6719d96f04a5be6987d by Victor Stinner in branch '3.6':
bpo-20891: Fix PyGILState_Ensure() (#4650) (#4655)
https://github.com/python/cpython/commit/e10c9de9d74fd4c26b32e6719d96f04a5be6987d
msg307351 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-30 22:46
Ok, the bug is now fixed in Python 2.7, 3.6 and master (future 3.7). On 3.6 and master, the fix comes with an unit test.

Thanks Steve Dower for the bug report, sorry for the long delay, I completely forgot this old bug!

Marcin Kasperski: Thanks for the reminder. Sadly, we will have to wait for the next release to get the fix. In the meanwhile, you can workaround the bug by calling PyEval_InitThreads() after Py_Initialize() but before running actual code.
msg307550 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-04 11:51
"./Program/_testembed.exe bpo20891" fails randomly on macOS:
---
macbook:master haypo$ while true; do ./Programs/_testembed bpo20891 ||break; date; done
Lun  4 déc 2017 12:46:34 CET
Lun  4 déc 2017 12:46:34 CET
Lun  4 déc 2017 12:46:34 CET
Fatal Python error: PyEval_SaveThread: NULL tstate

Current thread 0x00007fffa5dff3c0 (most recent call first):
Abort trap: 6
---

In test_bpo20891() of Program/_testembed.c, Py_BEGIN_ALLOW_THREADS calls PyEval_SaveThread() which fails with a fatal error:

    PyThreadState *tstate = PyThreadState_Swap(NULL);
    if (tstate == NULL)
        Py_FatalError("PyEval_SaveThread: NULL tstate"); <~~~ HERE

I'm not sure that it's safe to create the GIL in PyGILState_Ensure() in a "non-Python" thread, while the main thread (which is a Python thread) "is running".

I found a working fix: call PyEval_InitThreads() in PyThread_start_new_thread(). So the GIL is created as soon as a second thread is spawned. The GIL cannot be created anymore while two threads are running. At least, with the "python" binary. It doesn't fix the issue if a thread is not spawned by Python, but this thread calls PyGILState_Ensure().

Maybe we need to better document how threads have to be initialized to prevent this race condition / bug?
msg307551 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-04 11:52
Why not *always* call PyEval_InitThreads() at interpreter initialization?  Are there any downsides?
msg307552 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-04 11:59
> Why not *always* call PyEval_InitThreads() at interpreter initialization?  Are there any downsides?

The code is quite old:

commit 1984f1e1c6306d4e8073c28d2395638f80ea509b
Author: Guido van Rossum <guido@python.org>
Date:   Tue Aug 4 12:41:02 1992 +0000

    * Makefile adapted to changes below.
    * split pythonmain.c in two: most stuff goes to pythonrun.c, in the library.
    * new optional built-in threadmodule.c, build upon Sjoerd's thread.{c,h}.
    * new module from Sjoerd: mmmodule.c (dynamically loaded).
    * new module from Sjoerd: sv (svgen.py, svmodule.c.proto).
    * new files thread.{c,h} (from Sjoerd).
    * new xxmodule.c (example only).
    * myselect.h: bzero -> memset
    * select.c: bzero -> memset; removed global variable

(...)

+void
+init_save_thread()
+{
+#ifdef USE_THREAD
+       if (interpreter_lock)
+               fatal("2nd call to init_save_thread");
+       interpreter_lock = allocate_lock();
+       acquire_lock(interpreter_lock, 1);
+#endif
+}
+#endif


Current version of the code:

void
PyEval_InitThreads(void)
{
    if (gil_created())
        return;
    create_gil();
    take_gil(PyThreadState_GET());
    _PyRuntime.ceval.pending.main_thread = PyThread_get_thread_ident();
    if (!_PyRuntime.ceval.pending.lock)
        _PyRuntime.ceval.pending.lock = PyThread_allocate_lock();
}
msg307553 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-04 12:05
The "if (gil_created())" check can be found in the Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS macros which call PyEval_SaveThread() / PyEval_RestoreThread():

PyThreadState *
PyEval_SaveThread(void)
{
    PyThreadState *tstate = PyThreadState_Swap(NULL);
    if (tstate == NULL)
        Py_FatalError("PyEval_SaveThread: NULL tstate");
    if (gil_created()) <~~~~ HERE
        drop_gil(tstate);
    return tstate;
}

void
PyEval_RestoreThread(PyThreadState *tstate)
{
    if (tstate == NULL)
        Py_FatalError("PyEval_RestoreThread: NULL tstate");
    if (gil_created()) { <~~~~ HERE
        int err = errno;
        take_gil(tstate);
        /* _Py_Finalizing is protected by the GIL */
        if (_Py_IsFinalizing() && !_Py_CURRENTLY_FINALIZING(tstate)) {
            drop_gil(tstate);
            PyThread_exit_thread();
            Py_UNREACHABLE();
        }
        errno = err;
    }
    PyThreadState_Swap(tstate);
}


I guess that the intent of dynamically created GIL is to reduce the "overhead" of the GIL when 100% of the code is run in single thread.

I understand that this potential overhead must be measured to decide if it's ok to always create the GIL.

This bug was reported once in 2014, then Marcin Kasperski asked for an update last month: 3 years later. It's not a common bug impacting many users. So if there is an overhead, I'm not sure that it's a good idea to change Python to fix this rare use case.
msg307554 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-04 12:14
> I guess that the intent of dynamically created GIL is to reduce the "overhead" of the GIL when 100% of the code is run in single thread.

I'm not sure it is ok to guess here.  The original code was written in 1992, when system threading APIs were buggy and not always available.  Also the original GIL logic released the lock every N opcodes, which is a fragile heuristic.  The current GIL shouldn't have any visible performance cost in single-threaded mode, and in even in multi-threaded mode it's been a long time since anyone complained about a real-world performance degradation due to the GIL's overhead.

Perhaps Guido remembers the original rationale?

> This bug was reported once in 2014, then Marcin Kasperski asked for an update last month: 3 years later. It's not a common bug impacting many users.

From a maintenance POV, initializing the GIL eagerly also makes maintenance easier for us, instead of making all those dynamic checks.
msg307571 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2017-12-04 15:38
Yeah, the original reasoning was that threads were something esoteric and
not used by most code, and at the time we definitely felt that always using
the GIL would cause a (tiny) slowdown and increase the risk of crashes due
to bugs in the GIL code. I'd be happy to learn that we no longer need to
worry about this and can just always initialize it.

(Note: I haven't read the entire thread, just the first and last message.)
msg307583 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-04 17:36
I wrote the PR 4700 to create the GIL in Py_Initialize().

Would you be ok to backport this fix to Python 2.7 and 3.6 as well?
msg307586 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2017-12-04 17:53
For 2.7 I hesitate to OK this, who knows what skeletons are still in that
closet. It doesn't sound like a security fix to me. For 3.6 I'm fine with
it as a bugfix.
msg307628 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-12-05 03:39
+1 for making this change 3.6+ only.

Victor, could you run your patch through the performance benchmarks? While I suspect Antoine is right that our current GIL management heuristics will mean we don't need the lazy initialisation optimisation any more, it's still preferable to have the specific numbers before merging it.
History
Date User Action Args
2017-12-12 04:29:10kchensetnosy: + kchen
2017-12-05 03:39:53ncoghlansetmessages: + msg307628
2017-12-04 17:53:15gvanrossumsetmessages: + msg307586
2017-12-04 17:36:29vstinnersetmessages: + msg307583
2017-12-04 17:35:43vstinnersetstage: resolved -> patch review
pull_requests: + pull_request4612
2017-12-04 15:38:00gvanrossumsetmessages: + msg307571
2017-12-04 12:14:31pitrousetnosy: + gvanrossum
messages: + msg307554
2017-12-04 12:05:04vstinnersetmessages: + msg307553
2017-12-04 11:59:07vstinnersetmessages: + msg307552
2017-12-04 11:52:38pitrousetmessages: + msg307551
2017-12-04 11:51:47vstinnersetstatus: closed -> open

nosy: + ncoghlan, eric.snow
messages: + msg307550

resolution: fixed ->
2017-11-30 22:46:57vstinnersetstatus: open -> closed
versions: + Python 2.7
messages: + msg307351

resolution: fixed
stage: patch review -> resolved
2017-11-30 22:36:51vstinnersetmessages: + msg307350
2017-11-30 22:35:16vstinnersetmessages: + msg307349
2017-11-30 21:48:38vstinnersetpull_requests: + pull_request4569
2017-11-30 21:38:43vstinnersetpull_requests: + pull_request4567
2017-11-30 21:05:02vstinnersetmessages: + msg307342
2017-11-30 17:20:45vstinnersetmessages: + msg307331
versions: + Python 3.6, Python 3.7, - Python 3.4
2017-11-30 17:20:05vstinnersetmessages: + msg307330
2017-11-30 17:14:43vstinnersetstage: patch review
pull_requests: + pull_request4562
2017-11-27 08:56:13Mekksetnosy: + Mekk
messages: + msg307038
2016-03-15 16:56:16vstinnersetfiles: + PyGILState_Ensure.patch
keywords: + patch
messages: + msg261821
2016-03-15 16:55:58vstinnersetfiles: + ptest.c

messages: + msg261820
2014-03-13 13:48:45vstinnersetmessages: + msg213398
2014-03-13 10:52:45xcombellesetnosy: + xcombelle
2014-03-11 18:49:59steve.dowersetmessages: + msg213161
2014-03-11 18:40:13steve.dowercreate