This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Deadlock in multiprocessing.pool.Pool on terminate
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: davin, ionelmc, mapozyan, pitrou
Priority: normal Keywords:

Created on 2017-03-08 17:50 by mapozyan, last changed 2022-04-11 14:58 by admin.

Messages (5)
msg289248 - (view) Author: Michael (mapozyan) Date: 2017-03-08 17:50
Following code snippet causes a deadlock on Linux:

"""
import multiprocessing.pool
import signal


def signal_handler(signum, frame):
    pass

if __name__ == '__main__':
    signal.signal(signal.SIGTERM, signal_handler)
    pool = multiprocessing.pool.Pool(processes=1)
    pool.terminate() # alternatively - raise Exception("EXCEPTION")
"""

The reason is that the termination code starts before the worker processes being fully initialized.

Here, parent process acquires a forever-lock:

"""
    @staticmethod
    def _help_stuff_finish(inqueue, task_handler, size):
        # task_handler may be blocked trying to put items on inqueue
        util.debug('removing tasks from inqueue until task handler finished')
        inqueue._rlock.acquire()          < -----------------
        while task_handler.is_alive() and inqueue._reader.poll():
            inqueue._reader.recv()
            time.sleep(0)
"""

And then the worker processes are getting stuck here:

"""
def worker(...):

    while maxtasks is None or (maxtasks and completed < maxtasks):
        try:
            task = get()                  < ----------------- trying to acquire the same lock
        except (EOFError, OSError):
            util.debug('worker got EOFError or OSError -- exiting')
            break


"""

Whats going on then? As far as the default process start method is set to 'fork', worker subprocesses inherit parent's signal handler. Trying to terminate workers from _terminate_pool() doesn't have any effect. Finally, processes enter into a deadlock when parent join()-s workers.
msg289249 - (view) Author: Michael (mapozyan) Date: 2017-03-08 18:17
This patch kind of solves the issue. Not a nice one, but perhaps the safest one.

https://github.com/michael-a-cliqz/cpython/commit/1536c8c8cfc5a87ad4ab84d1248cb50fefe166ae
msg297215 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-28 19:50
This is tricky to reproduce but can definitely happen.  It seems it is enough to release the lock when done, why did you have to move the `task_handler._state = TERMINATE` line in your patch?
msg297733 - (view) Author: Michael (mapozyan) Date: 2017-07-05 12:41
If `task_handler._state = TERMINATE` is done before call to _help_stuff_finish(), then the following loop `while task_handler.is_alive() and inqueue._reader.poll()` in that function won't work as `is_alive()` will obviously return False.
msg297735 - (view) Author: Michael (mapozyan) Date: 2017-07-05 13:08
I found a couple of other cases when deadlock still occurs.

1. _help_stuff_finish may remove sentinels from the queue. Some of the workers will then never get a signal to terminate.

2. worker handler thread may be terminated too late, so it may spawn new workers while terminating is in progress.

I tried to fix these two issues too in following commit: https://github.com/michael-a-cliqz/cpython/commit/3a767ee7b33a194c193e39e0f614796130568630

NB: This updated snippet has higher chances for deadlock:

"""
import logging
import multiprocessing.pool
import signal
import time

def foo(num):
    return num * num

def signal_handler(signum, frame):
    pass

if __name__ == '__main__':
    signal.signal(signal.SIGTERM, signal_handler)

    logger = multiprocessing.log_to_stderr()
    logger.setLevel(logging.DEBUG)

    pool = multiprocessing.pool.Pool(processes=16)
    time.sleep(0.5)
    pool.map_async(foo, range(16))
    pool.terminate()
"""

(I am running it from dead loop in a shell script)
History
Date User Action Args
2022-04-11 14:58:44adminsetgithub: 73945
2020-05-06 13:48:01ionelmcsetnosy: + ionelmc
2017-07-05 13:08:41mapozyansetmessages: + msg297735
2017-07-05 12:41:11mapozyansetmessages: + msg297733
2017-06-28 19:50:46pitrousetnosy: + pitrou
messages: + msg297215
2017-03-09 07:02:24xiang.zhangsetnosy: + davin

versions: - Python 3.3, Python 3.4
2017-03-09 00:08:28mapozyansettype: behavior
versions: + Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6
2017-03-08 18:17:17mapozyansetmessages: + msg289249
2017-03-08 17:50:46mapozyancreate