classification
Title: deadlocked child process after forking on pystate.c's head_mutex
Type: enhancement Stage: needs patch
Components: Interpreter Core Versions: Python 3.7, Python 3.6
process
Status: open Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Louis Brandy, Piotr Wysocki, cooperlees, drothera, gregory.p.smith, haypo, lukasz.langa, phantez, pixelbeat
Priority: normal Keywords:

Created on 2017-05-17 23:25 by Louis Brandy, last changed 2017-05-24 17:40 by gregory.p.smith.

Pull Requests
URL Status Linked Edit
PR 1734 merged fried, 2017-05-22 23:18
PR 1740 merged lukasz.langa, 2017-05-23 00:03
Messages (9)
msg293906 - (view) Author: Louis Brandy (Louis Brandy) Date: 2017-05-17 23:25
A forked process (via os.fork) can inherit a locked `head_mutex` from its parent process and will promptly deadlock in this stack (on my linux box):

Child Process (deadlocked):

#0  0x00007f1a4da82e3c in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7f1a4c2964e0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  0x00007f1a4da82e3c in do_futex_wait (sem=sem@entry=0x7f1a4c2964e0, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f1a4da82efc in __new_sem_wait_slow (sem=0x7f1a4c2964e0, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f1a4da82fb9 in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00007f1a4de4c605 in PyThread_acquire_lock_timed (lock=0x7f1a4c2964e0, microseconds=-1, intr_flag=0) at Python/thread_pthread.h:352
#5  0x00007f1a4de4c4b4 in PyThread_acquire_lock (lock=<optimized out>, waitflag=waitflag@entry=1) at Python/thread_pthread.h:556
#6  0x00007f1a4dd59e08 in _PyThreadState_DeleteExcept (tstate=0x7f19f2301800) at Python/pystate.c:483
#7  0x00007f1a4dd46af4 in PyEval_ReInitThreads () at Python/ceval.c:326
#8  0x00007f1a4dd78b0b in PyOS_AfterFork () at ./Modules/signalmodule.c:1608


The parent process has a race between one thread calling `os.fork` (and holding the GIL) and another (in our case C++) thread trying to use PyGILState_Ensure. PyGILState_Ensure will grab the head_mutex before it tries to get the GIL. So if a fork happens in this critical section, the child process will get the locked head_mutex. 

The path from PyGILState_Ensure -> head_mutex looks like this:

#0  new_threadstate (interp=0x7fb5fd483d80, init=init@entry=1) at Python/pystate.c:183
#1  0x00007fb5ff149027 in PyThreadState_New (interp=<optimized out>) at Python/pystate.c:250
#2  0x00007fb5ff006ac7 in PyGILState_Ensure () at Python/pystate.c:838


----

Possible fix?

A simple fix would be to, inside PyOS_AfterFork, reset/unlock pystate.c's head_mutex if it's already locked.

Unclear if this is related to: https://bugs.python.org/issue28812
msg294165 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-05-22 17:49
You cannot safely use Python's os.fork() in a process that has threads.  Because of POSIX.

The CPython interpreter is by definition not async signal safe so no Python code may safely be executed after the fork() system call if the process contained _any_ threads at fork() system call time.

The only reliable recommendation: *Never use os.fork()* (if your process _might ever_ contain any threads now or in the future started by your or any of your dependencies).

The closest thing to a fix to this is a bunch of band-aids to setup atfork functions to clear the state of known locks in the forked child. -- But even that can't guarantee things because that can't know about all locks.

The C standard library can contain locks.  malloc() for example.  Any C/C++ extension modules can contain locks of their own.  The problem isn't solvable within CPython.  Adding a clear of pystate.c's head_mutex after forking makes sense.  That may even get you further.  But you will encounter other problems of the same nature in the future.

related: https://bugs.python.org/issue6721 and https://bugs.python.org/issue16500
msg294205 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2017-05-22 23:58
New changeset f82c951d1c5416f3550d544e50ff5662d3836e73 by Łukasz Langa (Jason Fried) in branch 'master':
bpo-30395 _PyGILState_Reinit deadlock fix (#1734)
https://github.com/python/cpython/commit/f82c951d1c5416f3550d544e50ff5662d3836e73
msg294222 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2017-05-23 05:23
New changeset d29feccec3ce3dcd9ab3100f8956171c70ec3027 by Łukasz Langa in branch '3.6':
[3.6] bpo-30395 _PyGILState_Reinit deadlock fix (GH-1734) (#1740)
https://github.com/python/cpython/commit/d29feccec3ce3dcd9ab3100f8956171c70ec3027
msg294256 - (view) Author: Louis Brandy (Louis Brandy) Date: 2017-05-23 15:37
Thanks to everyone jumping in.

I need no convincing that mixing forks and threads isn't just a problem but a problem factory. Given that the rest of this code seems to try to avoid similar deadlocks with similar mutexes, I figured we'd want to include this mutex to make a best-effort at being safe here. That made it worth reporting. To be sure, I still believe that the application code that led us here needs deeper fixes to address the fork/thread problems.
msg294316 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-24 04:23
+ head_mutex = NULL;

Shouldn't we free memory here?
msg294326 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-05-24 05:57
Would PyThread_free_lock (effectively sem_destroy()) work without (additional) problems?
msg294331 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-24 07:22
Gregory P. Smith added the comment:

Would PyThread_free_lock (effectively sem_destroy()) work without
(additional) problems?

If I recall correctly, no, you can get issues if the lock is still
acquired... But I should check.
msg294366 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-05-24 17:40
Another alternative *might* be to check if the lock is locked (non-blocking acquire?) and release it if so. Under the normal assumption that we are the only thread running immediately post-fork().

I'm not sure that can be guaranteed reliable given that other C code could've used pthread_atfork to register an "after" fork handler that spawns threads.  But that should be rare, and nothing here can really fix the underlying "programmer has mixed fork and threads" issue. Only ameliorate it.

But i'm not sure this post fork memory leak is really a problem other than potentially showing up in memory sanatizer runs involving forked children.  The scenario where it would be an actual leak is if a process does serial fork() calls with the parent(s) dying rather than forking new children from a common parent.  That would grow the leak as each child would have an additional lock allocated (a few bytes).  I don't _believe_ that kind of process pattern is common (but people do everything).
History
Date User Action Args
2017-05-24 17:40:04gregory.p.smithsetmessages: + msg294366
2017-05-24 07:22:27hayposetmessages: + msg294331
2017-05-24 05:57:12gregory.p.smithsetmessages: + msg294326
2017-05-24 04:23:06hayposetmessages: + msg294316
2017-05-23 15:37:09Louis Brandysetmessages: + msg294256
2017-05-23 05:23:07lukasz.langasetmessages: + msg294222
2017-05-23 00:03:46lukasz.langasetpull_requests: + pull_request1830
2017-05-22 23:58:57lukasz.langasetnosy: + lukasz.langa
messages: + msg294205
2017-05-22 23:18:32friedsetpull_requests: + pull_request1825
2017-05-22 17:49:51gregory.p.smithsettype: enhancement
versions: + Python 3.7
nosy: + gregory.p.smith

messages: + msg294165
resolution: not a bug
stage: needs patch
2017-05-22 16:08:44cooperleessetnosy: + cooperlees
2017-05-18 14:55:41Piotr Wysockisetnosy: + Piotr Wysocki
2017-05-18 07:12:34phantezsetnosy: + phantez
2017-05-18 06:23:17drotherasetnosy: + drothera
2017-05-17 23:41:42pixelbeatsetnosy: + pixelbeat
2017-05-17 23:28:37hayposetnosy: + haypo
2017-05-17 23:25:58Louis Brandycreate