Issue 12413: make faulthandler dump traceback of child processes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56622

classification

Title:	make faulthandler dump traceback of child processes
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	neologix, sbt, vstinner
Priority:	normal	Keywords:

Created on 2011-06-25 23:53 by neologix, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (7)
msg139132 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-06-25 23:53
As noted in issue #11870, making faulthandler capable of dumping child processes' tracebacks could be a great aid in debugging tricky deadlocks involving for example multiprocessing and subprocess. Since there's no portable way to find out child processes, a possible idea would be to make the handler send a signal to its process group if the current process is the process group leader. Advantages: - simple - async-safe Drawbacks: - since all the processes receive the signal at the same time, their outputs will be interleaved (we could maybe add a random sleep before dumping the traceback?) - children not part of the same process group (for example those who called setsid() or setpgrp()) won't be handled I'm not sure how this would work out on Windows, but I don't event know if Windows has a notion of child processes or process groups...
msg139133 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-06-26 00:27
Oh oh. I already thaugh to this feature, but its implementation is not trivial. > As noted in issue #11870 ... You mean that the tracebacks of children should be dumped on a timeout of the parent? Or do you also want them on a segfault of the parent? In my experience, the most common problem with the multiprocessing and subprocess modules is the hang. The timeout is implemeted using a (C) thread in faulthandler. You can do more in a thread than in a signal handler ;-) A hook may be added to faulthandler to execute code specific to multiprocessing / subprocess. > send a signal to its process group if the current process > is the process group leader In which case is Python the leader of the group? Is it the case by default? Can we do something to ensure that in regrtest, in multiprocessing tests or the multiprocessing module? See also #5115 (for the subprocess module). The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash). subprocess can execute any program, not only Python. Send an arbitrary signal to a child process can cause issues. Does multiprocessing maintain a list of child processes? -- By the way, which signal do you want to send to the child processes? A test may replace the signal handler of your signal (most test use SIGALRM and SIGUSR1). faulthandler.register() is not available on Windows. -- crier ( https://gist.github.com/737056 ) is a tool similar to faulthandler, but it is implemented in Python and so is less reliable. It uses a different trigger: it checks if a file (e.g. /tmp/crier-<pid>) does exists. A file (e.g. a pipe) can be used with a thread watching the file to send the "please dump your traceback" request to the child processes.
msg139136 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-06-26 00:43
> since all the processes receive the signal at the same time, > their outputs will be interleaved (we could maybe add a random > sleep before dumping the traceback?) If we have the pid list of the children, we can use an arbitrary sleep (e.g. 1 second) before sending a signal to the next pid. Anyway, a sleep is the most reliable synchronization code after a crash/timeout.
msg139161 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-06-26 10:25
> You mean that the tracebacks of children should be dumped on a timeout of the parent? Or do you also want them on a segfault of the parent? In my experience, the most common problem with the multiprocessing and subprocess modules is the hang. > Well, a segfault is due to the current process (or sometimes to external conditions like OOM, but that's not the point here), so it's not really useful to dump tracebacks of child processes in that case. I was more thinking about timeouts. > The timeout is implemeted using a (C) thread in faulthandler. You can do more in a thread than in a signal handler ;-) A hook may be added to faulthandler to execute code specific to multiprocessing / subprocess. > Yes, but when the timeout expires, there's no guarantee about the state of the interpreter (for example in issue #12352 it was the GC that deadlocked), so I guess we can't do anything too fancy. > In which case is Python the leader of the group? Is it the case by default? Can we do something to ensure that in regrtest, in multiprocessing tests or the multiprocessing module? > Yes, it's the case by default when you launch a process through a shell. > The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash). > Does multiprocessing maintain a list of child processes? Yes, we don't have any guarantee about the interpreter's state, and furthermore this won't work for processes calling fork() directly. > subprocess can execute any program, not only Python. Send an arbitrary signal to a child process can cause issues. > Well, faulthandler is disabled by default, no ? > By the way, which signal do you want to send to the child processes? A test may replace the signal handler of your signal (most test use SIGALRM and SIGUSR1). > Hum, SIGTERM maybe? Don't you register some fatal signals by default?
msg139210 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-06-26 20:11
> > In which case is Python the leader of the group? ... > > Yes, it's the case by default when you launch a process through a shell. subprocess doesn't use a shell by default, and I don't think that multiprocessing uses a shell to start Python. > > The subprocess maintains a list of the create subprocesses: subprocess.alive, but you need a reference a reference to this list (or you can access it using the Python namespace, but it requires the GIL and you cannot trust the GIL on a crash). > > Does multiprocessing maintain a list of child processes? > > Yes, we don't have any guarantee about the interpreter's state, and > furthermore this won't work for processes calling fork() directly. I don't think that we can have a reliable, generic and portable solution for this issue. I suggest to only focus on one use case (debug the multiprocessing and/or subprocess module), and latter try to support more cases. I agree that interpreter state can be inconsistent, but faulthandler does already use read the interpreter state. We cannot do better than "best effort". Well, it doesn't really matter if faulthandler crashs, the program is already dying ;-) To simplify the implementation, I propose to patch multiprocessing and/or subprocess to register the pid of the child process in a list in the faulthandler module. It would be better if these modules unregister pid when a subprocess exits, but it's not mandatory. We can send a signal to a non existant process. In the worst case, on a heavy loaded computer, another process may get the same pid, but it's unlikely. I'm quite sure that multiprocessing and subprocess already handle the subprocess exit, so it should be quite simply to add a hook. > > subprocess can execute any program, not only Python. > > Send an arbitrary signal to a child process can cause issues. > Well, faulthandler is disabled by default, no ? Yes, but I prefer to interfer with unrelated processes if it's possible. > > By the way, which signal do you want to send to the child processes? > > Hum, SIGTERM maybe? Don't you register some fatal signals by default? faulthandler.enable() installs a signal handler for SIGSEGV, SIGBUS, SIGILL and SIGABRT signals. (SIGKILL cannot be handled by the application.) > > A test may replace the signal handler of your signal Well, it's doesn't really matter. If one child process doesn't print the traceback, you have less information, but it is unlikely and we may get the information manually or by changing temporary the signal number.
msg139790 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-07-04 17:07
> subprocess doesn't use a shell by default, and I don't think that > multiprocessing uses a shell to start Python. > No, but we precisely want subprocess/multiprocessing-created processes to be in the same process group. > To simplify the implementation, I propose to patch multiprocessing > and/or subprocess to register the pid of the child process in a list in > the faulthandler module. > > It would be better if these modules unregister pid when a subprocess > exits, but it's not mandatory. We can send a signal to a non existant > process. In the worst case, on a heavy loaded computer, another process > may get the same pid, but it's unlikely. I'm quite sure that > multiprocessing and subprocess already handle the subprocess exit, so it > should be quite simply to add a hook. > It'll be intrusive and error-prone: for example, you'll have to reset this list upon fork(). And sending a signal to an unrelated process is risky, no? >> > subprocess can execute any program, not only Python. >> > Send an arbitrary signal to a child process can cause issues. >> Well, faulthandler is disabled by default, no ? > > Yes, but I prefer to interfer with unrelated processes if it's possible. > Well, those processes are started by subprocess, and this would be enabled only on demand. I find it less risky than sending a signal to a completely unrelated process. > faulthandler.enable() installs a signal handler for SIGSEGV, SIGBUS, > SIGILL and SIGABRT signals. (SIGKILL cannot be handled by the > application.) > We could use one of these signals.
msg407826 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-12-06 16:12
There is not activity for 10 years. I consider that this feature is not really needed. I reject this feature request.

History
Date	User	Action	Args
2022-04-11 14:57:19	admin	set	github: 56622
2021-12-06 16:12:57	vstinner	set	status: open -> closed resolution: rejected messages: + msg407826 stage: needs patch -> resolved
2013-10-12 10:53:18	sbt	set	nosy: + sbt
2011-07-04 17:07:49	neologix	set	messages: + msg139790
2011-06-26 20:11:07	vstinner	set	messages: + msg139210
2011-06-26 10:25:41	neologix	set	messages: + msg139161
2011-06-26 00:43:39	vstinner	set	messages: + msg139136
2011-06-26 00:43:09	vstinner	set	messages: - msg139135
2011-06-26 00:42:55	vstinner	set	messages: + msg139135
2011-06-26 00:27:11	vstinner	set	messages: + msg139133
2011-06-25 23:53:30	neologix	create