Title: multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.5
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Francis Bolduc, brianboonstra, cvrebert, dan.oreilly, davin, jnoller, pitrou, sbt
Priority: normal Keywords: patch

Created on 2014-09-11 22:33 by dan.oreilly, last changed 2017-06-01 20:53 by Francis Bolduc.

File name Uploaded Description Edit
multiproc_broken_pool.diff dan.oreilly, 2014-09-11 22:33 Abort running task and close down a pool if a worker is unexpectedly terminated. review
Messages (2)
msg226805 - (view) Author: Dan O'Reilly (dan.oreilly) * Date: 2014-09-11 22:33
This is essentially a dupe of issue9205, but it was suggested I open a new issue, since that one ended up being used to fix this same problem in concurrent.futures, and was subsequently closed.

Right now, should a worker process in a Pool unexpectedly get terminated while a blocking Pool method is running (e.g. apply, map), the method will hang forever. This isn't a normal occurrence, but it does occasionally happen (either because someone  sends a SIGTERM, or because of a bug in the interpreter or a C-extension). It would be preferable for multiprocessing to follow the lead of concurrent.futures.ProcessPoolExecutor when this happens, and abort all running tasks and close down the Pool.

Attached is a patch that implements this behavior. Should a process in a Pool unexpectedly exit (meaning, *not* because of hitting the maxtasksperchild limit), the Pool will be closed/terminated and all cached/running tasks will raise a BrokenProcessPool exception. These changes also prevent the Pool from going into a bad state if the "initializer" function raises an exception (previously, the pool would end up infinitely starting new processes, which would immediately die because of the exception).

One concern with the patch: The way timings are altered with these changes, the Pool seems to be particularly susceptible to issue6721 in certain cases. If processes in the Pool are being restarted due to maxtasksperchild just as the worker is being closed or joined, there is a chance the worker will be forked while some of the debug logging inside of Pool is running (and holding locks on either sys.stdout or sys.stderr). When this happens, the worker deadlocks on startup, which will hang the whole program. I believe the current implementation is susceptible to this as well, but I could reproduce it much more consistently with this patch. I think its rare enough in practice that it shouldn't prevent the patch from being accepted, but thought I should point it out. 

(I do think issue6721 should be addressed, or at the very least internal  I/O locks should always reset after forking.)
msg294968 - (view) Author: Francis Bolduc (Francis Bolduc) Date: 2017-06-01 20:53
This problem also happens simply by calling sys.exit from one of the child processes.

The following script exhibits the problem:

import multiprocessing
import sys
def test(value):
    if value:
if __name__ == '__main__':
    pool = multiprocessing.Pool(4)
    cases = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], cases)
Date User Action Args
2017-06-01 20:53:39Francis Bolducsetnosy: + Francis Bolduc
messages: + msg294968
2015-12-27 17:11:24davinlinkissue25908 dependencies
2015-10-11 17:21:58davinsetnosy: + davin
2015-09-16 17:01:10berker.peksaglinkissue24927 superseder
2015-09-16 12:22:35brianboonstrasetnosy: + brianboonstra
2014-09-12 16:57:22cvrebertsetnosy: + cvrebert
2014-09-11 22:33:06dan.oreillycreate