make faulthandler dump traceback of child processes #56622

neologix · 2011-06-25T23:53:31Z

BPO	12413
Nosy	@vstinner

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-12-06.16:12:57.501>
created_at = <Date 2011-06-25.23:53:30.992>
labels = ['type-feature', 'library']
title = 'make faulthandler dump traceback of child processes'
updated_at = <Date 2021-12-06.16:12:57.500>
user = 'https://bugs.python.org/neologix'

bugs.python.org fields:

activity = <Date 2021-12-06.16:12:57.500>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2021-12-06.16:12:57.501>
closer = 'vstinner'
components = ['Library (Lib)']
creation = <Date 2011-06-25.23:53:30.992>
creator = 'neologix'
dependencies = []
files = []
hgrepos = []
issue_num = 12413
keywords = []
message_count = 7.0
messages = ['139132', '139133', '139136', '139161', '139210', '139790', '407826']
nosy_count = 3.0
nosy_names = ['vstinner', 'neologix', 'sbt']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue12413'
versions = []

neologix · 2011-06-25T23:53:31Z

As noted in issue bpo-11870, making faulthandler capable of dumping child processes' tracebacks could be a great aid in debugging tricky deadlocks involving for example multiprocessing and subprocess.
Since there's no portable way to find out child processes, a possible idea would be to make the handler send a signal to its process group if the current process is the process group leader.
Advantages:

simple
async-safe
Drawbacks:
since all the processes receive the signal at the same time, their outputs will be interleaved (we could maybe add a random sleep before dumping the traceback?)
children not part of the same process group (for example those who called setsid() or setpgrp()) won't be handled

I'm not sure how this would work out on Windows, but I don't event know if Windows has a notion of child processes or process groups...

vstinner · 2011-06-26T00:27:11Z

Oh oh. I already thaugh to this feature, but its implementation is not trivial.

As noted in issue bpo-11870 ...

You mean that the tracebacks of children should be dumped on a timeout of the parent? Or do you also want them on a segfault of the parent? In my experience, the most common problem with the multiprocessing and subprocess modules is the hang.

The timeout is implemeted using a (C) thread in faulthandler. You can do more in a thread than in a signal handler ;-) A hook may be added to faulthandler to execute code specific to multiprocessing / subprocess.

send a signal to its process group if the current process
is the process group leader

In which case is Python the leader of the group? Is it the case by default? Can we do something to ensure that in regrtest, in multiprocessing tests or the multiprocessing module?

See also bpo-5115 (for the subprocess module).

The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash).

subprocess can execute any program, not only Python. Send an arbitrary signal to a child process can cause issues.

Does multiprocessing maintain a list of child processes?

--

By the way, which signal do you want to send to the child processes? A test may replace the signal handler of your signal (most test use SIGALRM and SIGUSR1).

faulthandler.register() is not available on Windows.

--

crier ( https://gist.github.com/737056 ) is a tool similar to faulthandler, but it is implemented in Python and so is less reliable. It uses a different trigger: it checks if a file (e.g. /tmp/crier-<pid>) does exists.

A file (e.g. a pipe) can be used with a thread watching the file to send the "please dump your traceback" request to the child processes.

vstinner · 2011-06-26T00:43:39Z

since all the processes receive the signal at the same time,
their outputs will be interleaved (we could maybe add a random
sleep before dumping the traceback?)

If we have the pid list of the children, we can use an arbitrary sleep (e.g. 1 second) before sending a signal to the next pid. Anyway, a sleep is the most reliable synchronization code after a crash/timeout.

neologix · 2011-06-26T10:25:41Z

You mean that the tracebacks of children should be dumped on a timeout of the parent? Or do you also want them on a segfault of the parent? In my experience, the most common problem with the multiprocessing and subprocess modules is the hang.

Well, a segfault is due to the current process (or sometimes to
external conditions like OOM, but that's not the point here), so it's
not really useful to dump tracebacks of child processes in that case.
I was more thinking about timeouts.

The timeout is implemeted using a (C) thread in faulthandler. You can do more in a thread than in a signal handler ;-) A hook may be added to faulthandler to execute code specific to multiprocessing / subprocess.

Yes, but when the timeout expires, there's no guarantee about the
state of the interpreter (for example in issue bpo-12352 it was the GC
that deadlocked), so I guess we can't do anything too fancy.

In which case is Python the leader of the group? Is it the case by default? Can we do something to ensure that in regrtest, in multiprocessing tests or the multiprocessing module?

Yes, it's the case by default when you launch a process through a shell.

The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash).
Does multiprocessing maintain a list of child processes?

Yes, we don't have any guarantee about the interpreter's state, and
furthermore this won't work for processes calling fork() directly.

subprocess can execute any program, not only Python. Send an arbitrary signal to a child process can cause issues.

Well, faulthandler is disabled by default, no ?

By the way, which signal do you want to send to the child processes? A test may replace the signal handler of your signal (most test use SIGALRM and SIGUSR1).

Hum, SIGTERM maybe? Don't you register some fatal signals by default?

vstinner · 2011-06-26T20:11:07Z

> In which case is Python the leader of the group? ...

Yes, it's the case by default when you launch a process through a shell.

subprocess doesn't use a shell by default, and I don't think that
multiprocessing uses a shell to start Python.

> The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash).
> Does multiprocessing maintain a list of child processes?

Yes, we don't have any guarantee about the interpreter's state, and
furthermore this won't work for processes calling fork() directly.

I don't think that we can have a reliable, generic and portable solution
for this issue. I suggest to only focus on one use case (debug the
multiprocessing and/or subprocess module), and latter try to support
more cases.

I agree that interpreter state can be inconsistent, but faulthandler
does already use read the interpreter state. We cannot do better than
"best effort". Well, it doesn't really matter if faulthandler crashs,
the program is already dying ;-)

To simplify the implementation, I propose to patch multiprocessing
and/or subprocess to register the pid of the child process in a list in
the faulthandler module.

It would be better if these modules unregister pid when a subprocess
exits, but it's not mandatory. We can send a signal to a non existant
process. In the worst case, on a heavy loaded computer, another process
may get the same pid, but it's unlikely. I'm quite sure that
multiprocessing and subprocess already handle the subprocess exit, so it
should be quite simply to add a hook.

> subprocess can execute any program, not only Python.
> Send an arbitrary signal to a child process can cause issues.
Well, faulthandler is disabled by default, no ?

Yes, but I prefer to interfer with unrelated processes if it's possible.

> By the way, which signal do you want to send to the child processes?

Hum, SIGTERM maybe? Don't you register some fatal signals by default?

faulthandler.enable() installs a signal handler for SIGSEGV, SIGBUS,
SIGILL and SIGABRT signals. (SIGKILL cannot be handled by the
application.)

> A test may replace the signal handler of your signal

Well, it's doesn't really matter. If one child process doesn't print the
traceback, you have less information, but it is unlikely and we may get
the information manually or by changing temporary the signal number.

neologix · 2011-07-04T17:07:49Z

subprocess doesn't use a shell by default, and I don't think that
multiprocessing uses a shell to start Python.

No, but we precisely want subprocess/multiprocessing-created processes
to be in the same process group.

To simplify the implementation, I propose to patch multiprocessing
and/or subprocess to register the pid of the child process in a list in
the faulthandler module.

It would be better if these modules unregister pid when a subprocess
exits, but it's not mandatory. We can send a signal to a non existant
process. In the worst case, on a heavy loaded computer, another process
may get the same pid, but it's unlikely. I'm quite sure that
multiprocessing and subprocess already handle the subprocess exit, so it
should be quite simply to add a hook.

It'll be intrusive and error-prone: for example, you'll have to reset
this list upon fork().
And sending a signal to an unrelated process is risky, no?

> > subprocess can execute any program, not only Python.
> > Send an arbitrary signal to a child process can cause issues.
> Well, faulthandler is disabled by default, no ?

Yes, but I prefer to interfer with unrelated processes if it's possible.

Well, those processes are started by subprocess, and this would be
enabled only on demand. I find it less risky than sending a signal to
a completely unrelated process.

faulthandler.enable() installs a signal handler for SIGSEGV, SIGBUS,
SIGILL and SIGABRT signals. (SIGKILL cannot be handled by the
application.)

We could use one of these signals.

vstinner · 2021-12-06T16:12:57Z

There is not activity for 10 years. I consider that this feature is not really needed. I reject this feature request.

neologix mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jun 25, 2011

vstinner closed this as completed Dec 6, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make faulthandler dump traceback of child processes #56622

make faulthandler dump traceback of child processes #56622

neologix mannequin commented Jun 25, 2011

neologix mannequin commented Jun 25, 2011

vstinner commented Jun 26, 2011

vstinner commented Jun 26, 2011

neologix mannequin commented Jun 26, 2011

vstinner commented Jun 26, 2011

neologix mannequin commented Jul 4, 2011

vstinner commented Dec 6, 2021

make faulthandler dump traceback of child processes #56622

make faulthandler dump traceback of child processes #56622

Comments

neologix mannequin commented Jun 25, 2011

neologix mannequin commented Jun 25, 2011

vstinner commented Jun 26, 2011

vstinner commented Jun 26, 2011

neologix mannequin commented Jun 26, 2011

vstinner commented Jun 26, 2011

neologix mannequin commented Jul 4, 2011

vstinner commented Dec 6, 2021