multiprocessing cannot recover from crashed worker #82265

zooba · 2019-09-10T08:58:29Z

BPO	38084
Nosy	@pitrou, @vstinner, @zooba, @pppery, @applio, @pablogsal
Superseder	bpo-22393: multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly
Files	mp_exit.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-09-10.22:49:35.737>
created_at = <Date 2019-09-10.08:58:28.996>
labels = ['3.8', 'type-bug', '3.7', '3.9']
title = 'multiprocessing cannot recover from crashed worker'
updated_at = <Date 2019-09-10.22:49:35.736>
user = 'https://github.com/zooba'

bugs.python.org fields:

activity = <Date 2019-09-10.22:49:35.736>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2019-09-10.22:49:35.737>
closer = 'vstinner'
components = []
creation = <Date 2019-09-10.08:58:28.996>
creator = 'steve.dower'
dependencies = []
files = ['48603']
hgrepos = []
issue_num = 38084
keywords = []
message_count = 10.0
messages = ['351594', '351690', '351691', '351702', '351703', '351705', '351710', '351746', '351752', '351753']
nosy_count = 6.0
nosy_names = ['pitrou', 'vstinner', 'steve.dower', 'ppperry', 'davin', 'pablogsal']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '22393'
type = 'behavior'
url = 'https://bugs.python.org/issue38084'
versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

zooba · 2019-09-10T08:58:29Z

Imitation repro:

import os
from multiprocessing import Pool

def f(x):
    os._exit(0)
    return "success"

if __name__ == '__main__':
    with Pool(1) as p:
        print(p.map(f, [1]))

Obviously a process may crash for various other reasons besides os._exit().

I believe this is the cause of bpo-37245.

vstinner · 2019-09-10T14:52:20Z

multiprocessing cannot recover from crashed worker

This issue has been seen on the macOS job of the Azure Pipeline: bpo-37245. I don't know if other platforms are affected.

zooba · 2019-09-10T14:54:31Z

Windows is definitely affected, and you can run the repro in my first post to check other platforms.

vstinner · 2019-09-10T15:30:58Z

I converted the example into attached file mp_exit.py and I added a call to faulthandler to see what is going on.

Output with the master branch of Python:

vstinner@apu$ ~/python/master/python ~/mp_exit.py
Timeout (0:00:05)!
Thread 0x00007ff40139a700 (most recent call first):
File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 379 in _recv
File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 414 in _recv_bytes
File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 250 in recv
File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 576 in _handle_results
File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007ff401b9b700 (most recent call first):
File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 528 in _handle_tasks
File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007ff40239c700 (most recent call first):
File "/home/vstinner/python/master/Lib/selectors.py", line 415 in select
File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 930 in wait
File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 499 in _wait_for_updates
File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 519 in _handle_workers
File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007ff4102cf740 (most recent call first):
File "/home/vstinner/python/master/Lib/threading.py", line 303 in wait
File "/home/vstinner/python/master/Lib/threading.py", line 565 in wait
File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 759 in wait
File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 762 in get
File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 364 in map
File "/home/vstinner/mp_exit.py", line 12 in <module>

In the main process, Pool._handle_results() thread is blocked on os.read() which never completes, even if the child process died and so the other end of the pipe should be closed.

vstinner · 2019-09-10T15:32:12Z

Windows is definitely affected, and you can run the repro in my first post to check other platforms.

Oh right, I can also reproduce the issue on Linux.

But I don't understand why test_multiprocessing_spawn works on all platforms, but only fails on macOS when run on Azure Pipelines.

Aaaaah, multiprocesing mysteries...

applio · 2019-09-10T15:38:20Z

Sharing for the sake of documenting a few things going on in this particular example:

When a PoolWorker process exits in this way (os._exit(anything)), the PoolWorker never gets the chance to send a signal of failure (normally sent via the outqueue) to the MainProcess.
In the current logic of the MainProcess, Pool._maintain_pool() detects the termination of that PoolWorker process and starts a new PoolWorker process to replace it, maintaining the desired size of Pool.
The infinite hang observed in this example comes from the original p.map() call performing an unlimited-timeout wait for a result to appear on the outqueue, hence an infinite wait. This wait is performed in MapResult.get() which does expose a timeout parameter though it is not possible to control through Pool.map(). It is not at all a correct, general solution, but exposing the control on this timeout and setting it to 1.0 seconds permits Steve's repro code snippet to run to completion (no infinite hang, raises a multiprocessing.context.TimeoutError).

applio · 2019-09-10T15:50:24Z

Thanks to Pablo's good work with implementing the use of multiprocessing's Process.sentinel, the logic for handling PoolWorkers that die has been centralized into Pool._maintain_pool(). If _maintain_pool() can also identify which job died with the dead PoolWorker, then it should be possible to put a corresponding message on the outqueue to indicate an exception occurred but pool can otherwise continue its work.

The question of whether Pool.map() should expose a timeout parameter deserves a separate discussion and should not be considered a path forward on this issue as it would require that users always specify and somehow know beforehand how long it should take for results to be returned from workers. Exposing the timeout control may have other practical benefits elsewhere but not here.

pppery · 2019-09-10T22:24:26Z

Is this not a duplicate of bpo-22393?

applio · 2019-09-10T22:45:54Z

Agreed with @PPPerry that this is a duplicate of bpo-22393.

The proposed patch in bpo-22393 is, for the moment, out of sync with more recent changes. That patch's approach would result in the loss of all partial results from a Pool.map, but it may be faster to update and review.

vstinner · 2019-09-10T22:49:36Z

Agreed with @PPPerry that this is a duplicate of bpo-22393.

Ok, in that case I close this issue as a duplicate of bpo-22393. There is no need to duplicate the discussion here :-)

zooba added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes type-bug An unexpected behavior, bug, or error labels Sep 10, 2019

vstinner closed this as completed Sep 10, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

pombredanne mentioned this issue Sep 14, 2022

Getting traceback calls & no report is generated. nexB/scancode-toolkit#2908

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiprocessing cannot recover from crashed worker #82265

multiprocessing cannot recover from crashed worker #82265

zooba commented Sep 10, 2019

zooba commented Sep 10, 2019

vstinner commented Sep 10, 2019

zooba commented Sep 10, 2019

vstinner commented Sep 10, 2019

vstinner commented Sep 10, 2019

applio commented Sep 10, 2019

applio commented Sep 10, 2019

pppery mannequin commented Sep 10, 2019

applio commented Sep 10, 2019

vstinner commented Sep 10, 2019

multiprocessing cannot recover from crashed worker #82265

multiprocessing cannot recover from crashed worker #82265

Comments

zooba commented Sep 10, 2019

zooba commented Sep 10, 2019

vstinner commented Sep 10, 2019

zooba commented Sep 10, 2019

vstinner commented Sep 10, 2019

vstinner commented Sep 10, 2019

applio commented Sep 10, 2019

applio commented Sep 10, 2019

pppery mannequin commented Sep 10, 2019

applio commented Sep 10, 2019

vstinner commented Sep 10, 2019