classification
Title: make faulthandler dump traceback of child processes
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: neologix, sbt, vstinner
Priority: normal Keywords:

Created on 2011-06-25 23:53 by neologix, last changed 2013-10-12 10:53 by sbt.

Messages (6)
msg139132 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-06-25 23:53
As noted in issue #11870, making faulthandler capable of dumping child processes' tracebacks could be a great aid in debugging tricky deadlocks involving for example multiprocessing and subprocess.
Since there's no portable way to find out child processes, a possible idea would be to make the handler send a signal to its process group if the current process is the process group leader.
Advantages:
- simple
- async-safe
Drawbacks:
- since all the processes receive the signal at the same time, their outputs will be interleaved (we could maybe add a random sleep before dumping the traceback?)
- children not part of the same process group (for example those who called setsid() or setpgrp()) won't be handled

I'm not sure how this would work out on Windows, but I don't event know if Windows has a notion of child processes or process groups...
msg139133 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-06-26 00:27
Oh oh. I already thaugh to this feature, but its implementation is not trivial.

> As noted in issue #11870 ...

You mean that the tracebacks of children should be dumped on a timeout of the parent? Or do you also want them on a segfault of the parent? In my experience, the most common problem with the multiprocessing and subprocess modules is the hang.

The timeout is implemeted using a (C) thread in faulthandler. You can do more in a thread than in a signal handler ;-) A hook may be added to faulthandler to execute code specific to multiprocessing / subprocess.

> send a signal to its process group if the current process
> is the process group leader

In which case is Python the leader of the group? Is it the case by default? Can we do something to ensure that in regrtest, in multiprocessing tests or the multiprocessing module?

See also #5115 (for the subprocess module).

The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash). 

subprocess can execute any program, not only Python. Send an arbitrary signal to a child process can cause issues.

Does multiprocessing maintain a list of child processes?

--

By the way, which signal do you want to send to the child processes? A test may replace the signal handler of your signal (most test use SIGALRM and SIGUSR1).

faulthandler.register() is not available on Windows.

--

crier ( https://gist.github.com/737056 ) is a tool similar to faulthandler, but it is implemented in Python and so is less reliable. It uses a different trigger: it checks if a file (e.g. /tmp/crier-<pid>) does exists.

A file (e.g. a pipe) can be used with a thread watching the file to send the "please dump your traceback" request to the child processes.
msg139136 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-06-26 00:43
> since all the processes receive the signal at the same time,
> their outputs will be interleaved (we could maybe add a random
> sleep before dumping the traceback?)

If we have the pid list of the children, we can use an arbitrary sleep (e.g. 1 second) before sending a signal to the next pid. Anyway, a sleep is the most reliable synchronization code after a crash/timeout.
msg139161 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-06-26 10:25
> You mean that the tracebacks of children should be dumped on a timeout of the parent? Or do you also want them on a segfault of the parent? In my experience, the most common problem with the multiprocessing and subprocess modules is the hang.
>

Well, a segfault is due to the current process (or sometimes to
external conditions like OOM, but that's not the point here), so it's
not really useful to dump tracebacks of child processes in that case.
I was more thinking about timeouts.

> The timeout is implemeted using a (C) thread in faulthandler. You can do more in a thread than in a signal handler ;-) A hook may be added to faulthandler to execute code specific to multiprocessing / subprocess.
>

Yes, but when the timeout expires, there's no guarantee about the
state of the interpreter (for example in issue #12352 it was the GC
that deadlocked), so I guess we can't do anything too fancy.

> In which case is Python the leader of the group? Is it the case by default? Can we do something to ensure that in regrtest, in multiprocessing tests or the multiprocessing module?
>

Yes, it's the case by default when you launch a process through a shell.

> The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash).
> Does multiprocessing maintain a list of child processes?

Yes, we don't have any guarantee about the interpreter's state, and
furthermore this won't work for processes calling fork() directly.

> subprocess can execute any program, not only Python. Send an arbitrary signal to a child process can cause issues.
>

Well, faulthandler is disabled by default, no ?

> By the way, which signal do you want to send to the child processes? A test may replace the signal handler of your signal (most test use SIGALRM and SIGUSR1).
>

Hum, SIGTERM maybe? Don't you register some fatal signals by default?
msg139210 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-06-26 20:11
> > In which case is Python the leader of the group? ...
> 
> Yes, it's the case by default when you launch a process through a shell.

subprocess doesn't use a shell by default, and I don't think that
multiprocessing uses a shell to start Python.

> > The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash).
> > Does multiprocessing maintain a list of child processes?
> 
> Yes, we don't have any guarantee about the interpreter's state, and
> furthermore this won't work for processes calling fork() directly.

I don't think that we can have a reliable, generic and portable solution
for this issue. I suggest to only focus on one use case (debug the
multiprocessing and/or subprocess module), and latter try to support
more cases.

I agree that interpreter state can be inconsistent, but faulthandler
does already use read the interpreter state. We cannot do better than
"best effort". Well, it doesn't really matter if faulthandler crashs,
the program is already dying ;-)

To simplify the implementation, I propose to patch multiprocessing
and/or subprocess to register the pid of the child process in a list in
the faulthandler module.

It would be better if these modules unregister pid when a subprocess
exits, but it's not mandatory. We can send a signal to a non existant
process. In the worst case, on a heavy loaded computer, another process
may get the same pid, but it's unlikely. I'm quite sure that
multiprocessing and subprocess already handle the subprocess exit, so it
should be quite simply to add a hook.

> > subprocess can execute any program, not only Python.
> > Send an arbitrary signal to a child process can cause issues.
> Well, faulthandler is disabled by default, no ?

Yes, but I prefer to interfer with unrelated processes if it's possible.

> > By the way, which signal do you want to send to the child processes?
> 
> Hum, SIGTERM maybe? Don't you register some fatal signals by default?

faulthandler.enable() installs a signal handler for SIGSEGV, SIGBUS,
SIGILL and SIGABRT signals. (SIGKILL cannot be handled by the
application.)

> > A test may replace the signal handler of your signal

Well, it's doesn't really matter. If one child process doesn't print the
traceback, you have less information, but it is unlikely and we may get
the information manually or by changing temporary the signal number.
msg139790 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-07-04 17:07
> subprocess doesn't use a shell by default, and I don't think that
> multiprocessing uses a shell to start Python.
>

No, but we precisely want subprocess/multiprocessing-created processes
to be in the same process group.

> To simplify the implementation, I propose to patch multiprocessing
> and/or subprocess to register the pid of the child process in a list in
> the faulthandler module.
>
> It would be better if these modules unregister pid when a subprocess
> exits, but it's not mandatory. We can send a signal to a non existant
> process. In the worst case, on a heavy loaded computer, another process
> may get the same pid, but it's unlikely. I'm quite sure that
> multiprocessing and subprocess already handle the subprocess exit, so it
> should be quite simply to add a hook.
>

It'll be intrusive and error-prone: for example, you'll have to reset
this list upon fork().
And sending a signal to an unrelated process is risky, no?

>> > subprocess can execute any program, not only Python.
>> > Send an arbitrary signal to a child process can cause issues.
>> Well, faulthandler is disabled by default, no ?
>
> Yes, but I prefer to interfer with unrelated processes if it's possible.
>

Well, those processes are started by subprocess, and this would be
enabled only on demand. I find it less risky than sending a signal to
a completely unrelated process.

> faulthandler.enable() installs a signal handler for SIGSEGV, SIGBUS,
> SIGILL and SIGABRT signals. (SIGKILL cannot be handled by the
> application.)
>

We could use one of these signals.
History
Date User Action Args
2013-10-12 10:53:18sbtsetnosy: + sbt
2011-07-04 17:07:49neologixsetmessages: + msg139790
2011-06-26 20:11:07vstinnersetmessages: + msg139210
2011-06-26 10:25:41neologixsetmessages: + msg139161
2011-06-26 00:43:39vstinnersetmessages: + msg139136
2011-06-26 00:43:09vstinnersetmessages: - msg139135
2011-06-26 00:42:55vstinnersetmessages: + msg139135
2011-06-26 00:27:11vstinnersetmessages: + msg139133
2011-06-25 23:53:30neologixcreate