classification
Title: multiprocessing cannot recover from crashed worker
Type: behavior Stage: resolved
Components: Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly
View: 22393
Assigned To: Nosy List: davin, pablogsal, pitrou, ppperry, steve.dower, vstinner
Priority: normal Keywords:

Created on 2019-09-10 08:58 by steve.dower, last changed 2019-09-10 22:49 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
mp_exit.py vstinner, 2019-09-10 15:30
Messages (10)
msg351594 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-09-10 08:58
Imitation repro:

import os
from multiprocessing import Pool

def f(x):
    os._exit(0)
    return "success"

if __name__ == '__main__':
    with Pool(1) as p:
        print(p.map(f, [1]))


Obviously a process may crash for various other reasons besides os._exit().

I believe this is the cause of issue37245.
msg351690 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-10 14:52
> multiprocessing cannot recover from crashed worker

This issue has been seen on the macOS job of the Azure Pipeline: bpo-37245. I don't know if other platforms are affected.
msg351691 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-09-10 14:54
Windows is definitely affected, and you can run the repro in my first post to check other platforms.
msg351702 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-10 15:30
I converted the example into attached file mp_exit.py and I added a call to faulthandler to see what is going on.

Output with the master branch of Python:

vstinner@apu$ ~/python/master/python ~/mp_exit.py 
Timeout (0:00:05)!
Thread 0x00007ff40139a700 (most recent call first):
  File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 379 in _recv
  File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 414 in _recv_bytes
  File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 250 in recv
  File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 576 in _handle_results
  File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
  File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
  File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007ff401b9b700 (most recent call first):
  File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 528 in _handle_tasks
  File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
  File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
  File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007ff40239c700 (most recent call first):
  File "/home/vstinner/python/master/Lib/selectors.py", line 415 in select
  File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 930 in wait
  File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 499 in _wait_for_updates
  File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 519 in _handle_workers
  File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
  File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
  File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

Thread 0x00007ff4102cf740 (most recent call first):
  File "/home/vstinner/python/master/Lib/threading.py", line 303 in wait
  File "/home/vstinner/python/master/Lib/threading.py", line 565 in wait
  File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 759 in wait
  File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 762 in get
  File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 364 in map
  File "/home/vstinner/mp_exit.py", line 12 in <module>


In the main process, Pool._handle_results() thread is blocked on os.read() which never completes, even if the child process died and so the other end of the pipe should be closed.
msg351703 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-10 15:32
> Windows is definitely affected, and you can run the repro in my first post to check other platforms.

Oh right, I can also reproduce the issue on Linux.

But I don't understand why test_multiprocessing_spawn works on all platforms, but only fails on macOS when run on Azure Pipelines.

Aaaaah, multiprocesing mysteries...
msg351705 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2019-09-10 15:38
Sharing for the sake of documenting a few things going on in this particular example:
* When a PoolWorker process exits in this way (os._exit(anything)), the PoolWorker never gets the chance to send a signal of failure (normally sent via the outqueue) to the MainProcess.
* In the current logic of the MainProcess, Pool._maintain_pool() detects the termination of that PoolWorker process and starts a new PoolWorker process to replace it, maintaining the desired size of Pool.
* The infinite hang observed in this example comes from the original p.map() call performing an unlimited-timeout wait for a result to appear on the outqueue, hence an infinite wait.  This wait is performed in MapResult.get() which does expose a timeout parameter though it is not possible to control through Pool.map().  It is not at all a correct, general solution, but exposing the control on this timeout and setting it to 1.0 seconds permits Steve's repro code snippet to run to completion (no infinite hang, raises a multiprocessing.context.TimeoutError).
msg351710 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2019-09-10 15:50
Thanks to Pablo's good work with implementing the use of multiprocessing's Process.sentinel, the logic for handling PoolWorkers that die has been centralized into Pool._maintain_pool().  If _maintain_pool() can also identify which job died with the dead PoolWorker, then it should be possible to put a corresponding message on the outqueue to indicate an exception occurred but pool can otherwise continue its work.


The question of whether Pool.map() should expose a timeout parameter deserves a separate discussion and should not be considered a path forward on this issue as it would require that users always specify and somehow know beforehand how long it should take for results to be returned from workers.  Exposing the timeout control may have other practical benefits elsewhere but not here.
msg351746 - (view) Author: (ppperry) Date: 2019-09-10 22:24
Is this not a duplicate of issue22393?
msg351752 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2019-09-10 22:45
Agreed with @ppperry that this is a duplicate of issue22393.

The proposed patch in issue22393 is, for the moment, out of sync with more recent changes.  That patch's approach would result in the loss of all partial results from a Pool.map, but it may be faster to update and review.
msg351753 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-10 22:49
> Agreed with @ppperry that this is a duplicate of issue22393.

Ok, in that case I close this issue as a duplicate of bpo-22393. There is no need to duplicate the discussion here :-)
History
Date User Action Args
2019-09-10 22:49:35vstinnersetstatus: open -> closed
superseder: multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly
messages: + msg351753

resolution: duplicate
stage: resolved
2019-09-10 22:45:53davinsetmessages: + msg351752
2019-09-10 22:24:25ppperrysetnosy: + ppperry
messages: + msg351746
2019-09-10 15:50:24davinsetmessages: + msg351710
2019-09-10 15:38:20davinsetmessages: + msg351705
2019-09-10 15:32:12vstinnersetmessages: + msg351703
2019-09-10 15:30:58vstinnersetfiles: + mp_exit.py

messages: + msg351702
2019-09-10 14:54:30steve.dowersetmessages: + msg351691
2019-09-10 14:52:19vstinnersetmessages: + msg351690
2019-09-10 14:51:36vstinnersetnosy: + vstinner, pablogsal
2019-09-10 08:58:29steve.dowercreate