classification
Title: _PyGILState_Reinit assumes auto thread state will always exist which is not true.
Type: crash Stage: resolved
Components: Extension Modules Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: dfranke, grahamd, haypo, jcea, neologix, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2011-10-12 06:33 by grahamd, last changed 2013-02-04 17:09 by jcea. This issue is now closed.

Files
File name Uploaded Description Edit
pystate.c.diff grahamd, 2011-10-12 06:33
tstate_after_fork.diff neologix, 2011-10-12 20:14 review
Messages (14)
msg145383 - (view) Author: Graham Dumpleton (grahamd) Date: 2011-10-12 06:33
This is a followup bug report to fix wrong implementation of _PyGILState_Reinit() introduced by http://bugs.python.org/issue10517.

I don't have a standalone test case yet. Problem occurs under mod_wsgi with Python 2.7.2 and thus similarly 3.2 where _PyGILState_Reinit() was also added.

The Python code part which triggers the problem is:

    pid = os.fork()
    if pid:
       sys.stderr.write('Fork succeeded (PID=%s)\n' % pid)
    else:
       sys.stderr.write('Fork succeeded (child PID=%s)\n' % os.getpid())
       time.sleep(60.0)
       os._exit(0)

To trigger the problem requires this code be executed from a thread originally created outside of Python and then calling into a sub interpreter.

This thread would have created its own thread state object for the sub interpreter call since auto thread states don't work for sub interpreters. Further, it would never have called into the main interpreter so auto thread state simply doesn't exist for main interpreter.

When this thread has a fork() occur and _PyGILState_Reinit() is called, the call of PyGILState_GetThisThreadState() returns NULL because auto thread state for main interpreter was never initialised for this thread. When it then calls into PyThread_set_key_value() it is value of NULL and rather than set it, it thinks internally in find_key() you are doing a get which results in PyThread_set_key_value() returning -1 and so the fatal error.

So _PyGILState_Reinit() is broken because it assumes that an auto thread state will always exist for the thread for it to reinit, which will not always be the case.

The simple fix may be that if PyGILState_GetThisThreadState() returns NULL then don't do any reinit. Making that change does seem to fix the problem. Code that works then is:

void
_PyGILState_Reinit(void)
{
    PyThreadState *tstate = PyGILState_GetThisThreadState();

    if (tstate) {
        PyThread_delete_key(autoTLSkey);
        if ((autoTLSkey = PyThread_create_key()) == -1)
            Py_FatalError("Could not allocate TLS entry");

        /* re-associate the current thread state with the new key */
        if (PyThread_set_key_value(autoTLSkey, (void *)tstate) < 0)
            Py_FatalError("Couldn't create autoTLSkey mapping");
    }
}

Diff file also attached.
msg145384 - (view) Author: Graham Dumpleton (grahamd) Date: 2011-10-12 06:37
Whoops. Missed the error. The fatal error that occurs is:

Fatal Python error: Couldn't create autoTLSkey mapping
msg145385 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-10-12 07:10
Note that this doesn't apply to default: the problem is that 2.7 and 3.2 don't use native TLS, and with the ad-hoc TLS implementation, a NULL value isn't supported:
"""
 /* Internal helper.
  * If the current thread has a mapping for key, the appropriate struct key*
  * is returned.  NB:  value is ignored in this case!
  * If there is no mapping for key in the current thread, then:
  *     If value is NULL, NULL is returned.
  *     Else a mapping of key to value is created for the current thread,
  *     and a pointer to a new struct key* is returned; except that if
  *     malloc() can't find room for a new struct key*, NULL is returned.
  * So when value==NULL, this acts like a pure lookup routine, and when
  * value!=NULL, this acts like dict.setdefault(), returning an existing
  * mapping if one exists, else creating a new mapping.
"""

So PyThread_set_key_value() has different semantics between 2.7/3.2 and default...


> So _PyGILState_Reinit() is broken because it assumes that an auto 
> thread state will always exist for the thread for it to reinit, which 
> will not always be the case.

Hmm...
Please see http://docs.python.org/c-api/init.html#non-python-created-threads
"""
When threads are created using the dedicated Python APIs (such as the threading module), a thread state is automatically associated to them and the code showed above is therefore correct. However, when threads are created from C (for example by a third-party library with its own thread management), they don’t hold the GIL, nor is there a thread state structure for them.

If you need to call Python code from these threads (often this will be part of a callback API provided by the aforementioned third-party library), you must first register these threads with the interpreter by creating a thread state data structure, then acquiring the GIL, and finally storing their thread state pointer, before you can start using the Python/C API. When you are done, you should reset the thread state pointer, release the GIL, and finally free the thread state data structure.

The PyGILState_Ensure() and PyGILState_Release() functions do all of the above automatically. The typical idiom for calling into Python from a C thread is:
"""

I think you should be calling call PyGILState_Ensure() before (whoch does associate the thread state to the autoTLS key.
I let Antoine answer, he's got much more experience than me.
msg145386 - (view) Author: Graham Dumpleton (grahamd) Date: 2011-10-12 08:04
The PyGILState_Ensure() function is only for when working with the main interpreter. These external threads are not calling into the main interpreter.

Because they are external threads, calling PyGILState_Ensure() and then PyGILState_Release() will cause a thread state to be created for the main interpreter, but it will also be destroyed on the PyGILState_Release().

The only way to avoid that situation and ensure that the thread state for the main interpreter is therefore maintained would be to call PyGILState_Ensure() and then call PyThreadState_Swap() to change to thread state for the sub interpreter. Problem is that you aren't supposed to use PyThreadState_Swap() any more and recollect that newer Python 3.X even prohibits it in some way through some checks.

So, the documentation you quote is only to do with the main interpreter and is not how things work for sub interpreters.
msg145387 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-10-12 08:32
> So, the documentation you quote is only to do with the main
> interpreter and is not how things work for sub interpreters.

You're right, my bad.
However, it would probably be better to destroy/reset the autoTLSkey
even if the current thread doesn't have an associated TLS key (to
avoid stumbling upon the original libc bug of issue #10517):

"""
void
_PyGILState_Reinit(void)
{
    PyThreadState *tstate = PyGILState_GetThisThreadState();

        PyThread_delete_key(autoTLSkey);
        if ((autoTLSkey = PyThread_create_key()) == -1)
            Py_FatalError("Could not allocate TLS entry");

        /* re-associate the current thread state with the new key */
        if (tstate && PyThread_set_key_value(autoTLSkey, (void *)tstate) < 0)
            Py_FatalError("Couldn't create autoTLSkey mapping");
}
"""

Now that i think about it, the problem is even simpler: this patch
shouldn't have been applied to 2.7 and 3.2, it was only relevant for
native pthread TLS implementation (which does allow NULL values).
So the solution would be simply to backout this patch on 2.7 and 3.2.
msg145388 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-10-12 08:39
> So the solution would be simply to backout this patch on 2.7 and 3.2.

Actually, I just checked, and the native TLS implementation is present in 3.2, so this problem shouldn't show up: did you test it with 3.2?
AFAICT, this should only affect 2.7 (for which this patch wasn't relevant).
msg145389 - (view) Author: Graham Dumpleton (grahamd) Date: 2011-10-12 09:51
True. Doesn't appear to be an issue with Python 3.2.2. Only Python 2.7.2.

I was not aware that the TLS mechanism was changed in Python 3.X so assumed would also affect it.

So, looks like the change shouldn't have been applied to Python 2.7.

How many moons before Python 2.7.3 though?
msg145391 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-12 11:51
> How many moons before Python 2.7.3 though?

If you convince python-dev that's it's a critical bug (is it?) I suppose it could happen soon.
msg145421 - (view) Author: Roundup Robot (python-dev) Date: 2011-10-12 19:07
New changeset ee4fe16d9b48 by Charles-François Natali in branch '2.7':
Issue #13156: revert changeset f6feed6ec3f9, which was only relevant for native
http://hg.python.org/cpython/rev/ee4fe16d9b48
msg145424 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-10-12 20:14
Here's a patch for 3.2 and default which calls
PyThread_set_key_value() only if there was an auto thread state
previously associated (while the current code works with pthread TLS,
there are other implementations which may behave strangely, and
there's still the ad-hoc implementation in Python/thread.c).
msg145429 - (view) Author: Roundup Robot (python-dev) Date: 2011-10-12 20:26
New changeset 3313ce92cef7 by Charles-François Natali in branch '2.7':
Issue #13156: Add an entry in Misc/NEWS.
http://hg.python.org/cpython/rev/3313ce92cef7
msg148134 - (view) Author: Roundup Robot (python-dev) Date: 2011-11-22 18:52
New changeset ba90839c4993 by Charles-François Natali in branch '3.2':
Issue #13156: _PyGILState_Reinit(): Re-associate the auto thread state with the
http://hg.python.org/cpython/rev/ba90839c4993

New changeset aa6ce09d2350 by Charles-François Natali in branch 'default':
Issue #13156: _PyGILState_Reinit(): Re-associate the auto thread state with the
http://hg.python.org/cpython/rev/aa6ce09d2350
msg148135 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-11-22 18:54
Alright, should be fixed now.
Graham, thanks for the report!
msg160761 - (view) Author: Daniel Franke (dfranke) Date: 2012-05-15 20:39
Just FYI, I've lately been trying to track down a different threading-related bug, and I think the fix for this one also fixed that one by accident.

Leave the following program running for a while under python-2.7.2:

---snip---
import subprocess
import threading
import random

class Forkfuzz(threading.Thread):
    def run(self):
        while random.randint(0,100) > threading.active_count():
            Forkfuzz().start()

        p = subprocess.Popen(['/bin/sleep', '1'])
        p.communicate()

if __name__ == "__main__":
    while True:
        Forkfuzz().run()
---snip---

Eventually, you'll see a bunch of python process that are deadlocked in the following state:

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x00000000004e02ad in PyThread_acquire_lock (lock=<value optimized out>, waitflag=<value optimized out>) at Python/thread_pthread.h:321
#2  0x00000000004e0447 in find_key (key=1, value=0x0) at Python/thread.c:268
#3  0x00000000004e053b in PyThread_get_key_value (key=38356064) at Python/thread.c:360
#4  0x00000000004cb2f8 in PyGILState_GetThisThreadState () at Python/pystate.c:598
#5  _PyGILState_Reinit () at Python/pystate.c:547
#6  0x00000000004e43f9 in PyOS_AfterFork () at ./Modules/signalmodule.c:979
...

The problem happens when one thread forks while another thread holds a lock on the TLS mutex. The child process then tries to acquire this mutex, and can never do so because the thread that holds it only exists in the parent.

When the same program is run under python-2.7.3, the problem goes away.
History
Date User Action Args
2013-02-04 17:09:40jceasetnosy: + jcea
2012-05-15 20:39:19dfrankesetnosy: + dfranke
messages: + msg160761
2011-11-22 18:54:50neologixsetstatus: open -> closed
resolution: fixed
messages: + msg148135

stage: resolved
2011-11-22 18:52:20python-devsetmessages: + msg148134
2011-10-12 20:26:27python-devsetmessages: + msg145429
2011-10-12 20:14:08neologixsetfiles: + tstate_after_fork.diff

messages: + msg145424
2011-10-12 19:07:34python-devsetnosy: + python-dev
messages: + msg145421
2011-10-12 11:51:16pitrousetmessages: + msg145391
2011-10-12 09:51:26grahamdsetmessages: + msg145389
versions: - Python 3.2
2011-10-12 08:39:31neologixsetmessages: + msg145388
2011-10-12 08:33:07hayposetnosy: + haypo
2011-10-12 08:32:30neologixsetmessages: + msg145387
2011-10-12 08:04:01grahamdsetmessages: + msg145386
2011-10-12 07:10:03neologixsetmessages: + msg145385
2011-10-12 06:57:48neologixsetnosy: + pitrou
2011-10-12 06:37:31grahamdsetmessages: + msg145384
2011-10-12 06:33:46grahamdcreate