Issue 14148: Option to kill "stuck" workers in a multiprocessing pool

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58356

classification

Title:	Option to kill "stuck" workers in a multiprocessing pool
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.3, Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	bquinlan, neologix, paul.moore, pitrou
Priority:	normal	Keywords:

Created on 2012-02-28 11:42 by paul.moore, last changed 2022-04-11 14:57 by admin.

Messages (3)
msg154549 - (view)	Author: Paul Moore (paul.moore) *	Date: 2012-02-28 11:42
I have an application which fires off a number of database connections via a multiprocessing pool. Unfortunately, the database software occasionally gets "stuck" and a connection request hangs indefinitely. This locks up the whole process doing the connection, and cannot be interrupted except by killing the process. It would be useful to have a facility to restart "stuck" workers in this case. As an interface, I would suggest an additional argument to the AsyncResult.get method, kill_on_timeout. If this argument is true, and the get times out, the worker servicing the result will be killed and restarted. Alternatively, provide a method on an AsyncResult to access the worker process that is servicing the request. I could then wait on the result and kill the worker manually if it does not respond in time. Without a facility like this, there is a potential for the pool to get starved of workers if multiple connections hang.
msg154573 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-28 21:42
The problem is that queues and other synchronization objects can end up in an inconsistent state when a worker crashes, hangs or gets killed. That's why, in concurrent.futures, a crashed worker makes the ProcessPoolExecutor become "broken". A similar thing should be done for multiprocessing.Pool but it's a more complex object.
msg154575 - (view)	Author: Paul Moore (paul.moore) *	Date: 2012-02-28 22:24
As an alternative, maybe leave the "stuck" worker, but allow the pool to recognise when a worker has not processed new messages for a long period and spawn an extra worker to replace it. That would avoid the starvation issue, and the stuck workers would die when the pool is terminated.

History
Date	User	Action	Args
2022-04-11 14:57:27	admin	set	github: 58356
2012-03-07 20:12:38	bquinlan	set	nosy: + bquinlan
2012-02-28 22:24:47	paul.moore	set	messages: + msg154575
2012-02-28 21:42:13	pitrou	set	nosy: + neologix, pitrou messages: + msg154573
2012-02-28 11:42:59	paul.moore	create