Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing cannot recover from crashed worker #82265

Closed
zooba opened this issue Sep 10, 2019 · 10 comments
Closed

multiprocessing cannot recover from crashed worker #82265

zooba opened this issue Sep 10, 2019 · 10 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes type-bug An unexpected behavior, bug, or error

Comments

@zooba
Copy link
Member

zooba commented Sep 10, 2019

BPO 38084
Nosy @pitrou, @vstinner, @zooba, @pppery, @applio, @pablogsal
Superseder
  • bpo-22393: multiprocessing.Pool shouldn't hang forever if a worker process dies unexpectedly
  • Files
  • mp_exit.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-09-10.22:49:35.737>
    created_at = <Date 2019-09-10.08:58:28.996>
    labels = ['3.8', 'type-bug', '3.7', '3.9']
    title = 'multiprocessing cannot recover from crashed worker'
    updated_at = <Date 2019-09-10.22:49:35.736>
    user = 'https://github.com/zooba'

    bugs.python.org fields:

    activity = <Date 2019-09-10.22:49:35.736>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-09-10.22:49:35.737>
    closer = 'vstinner'
    components = []
    creation = <Date 2019-09-10.08:58:28.996>
    creator = 'steve.dower'
    dependencies = []
    files = ['48603']
    hgrepos = []
    issue_num = 38084
    keywords = []
    message_count = 10.0
    messages = ['351594', '351690', '351691', '351702', '351703', '351705', '351710', '351746', '351752', '351753']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'vstinner', 'steve.dower', 'ppperry', 'davin', 'pablogsal']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '22393'
    type = 'behavior'
    url = 'https://bugs.python.org/issue38084'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @zooba
    Copy link
    Member Author

    zooba commented Sep 10, 2019

    Imitation repro:

    import os
    from multiprocessing import Pool
    
    def f(x):
        os._exit(0)
        return "success"
    
    if __name__ == '__main__':
        with Pool(1) as p:
            print(p.map(f, [1]))

    Obviously a process may crash for various other reasons besides os._exit().

    I believe this is the cause of bpo-37245.

    @zooba zooba added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes type-bug An unexpected behavior, bug, or error labels Sep 10, 2019
    @vstinner
    Copy link
    Member

    multiprocessing cannot recover from crashed worker

    This issue has been seen on the macOS job of the Azure Pipeline: bpo-37245. I don't know if other platforms are affected.

    @zooba
    Copy link
    Member Author

    zooba commented Sep 10, 2019

    Windows is definitely affected, and you can run the repro in my first post to check other platforms.

    @vstinner
    Copy link
    Member

    I converted the example into attached file mp_exit.py and I added a call to faulthandler to see what is going on.

    Output with the master branch of Python:

    vstinner@apu$ ~/python/master/python ~/mp_exit.py
    Timeout (0:00:05)!
    Thread 0x00007ff40139a700 (most recent call first):
    File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 379 in _recv
    File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 414 in _recv_bytes
    File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 250 in recv
    File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 576 in _handle_results
    File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
    File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
    File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

    Thread 0x00007ff401b9b700 (most recent call first):
    File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 528 in _handle_tasks
    File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
    File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
    File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

    Thread 0x00007ff40239c700 (most recent call first):
    File "/home/vstinner/python/master/Lib/selectors.py", line 415 in select
    File "/home/vstinner/python/master/Lib/multiprocessing/connection.py", line 930 in wait
    File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 499 in _wait_for_updates
    File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 519 in _handle_workers
    File "/home/vstinner/python/master/Lib/threading.py", line 882 in run
    File "/home/vstinner/python/master/Lib/threading.py", line 944 in _bootstrap_inner
    File "/home/vstinner/python/master/Lib/threading.py", line 902 in _bootstrap

    Thread 0x00007ff4102cf740 (most recent call first):
    File "/home/vstinner/python/master/Lib/threading.py", line 303 in wait
    File "/home/vstinner/python/master/Lib/threading.py", line 565 in wait
    File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 759 in wait
    File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 762 in get
    File "/home/vstinner/python/master/Lib/multiprocessing/pool.py", line 364 in map
    File "/home/vstinner/mp_exit.py", line 12 in <module>

    In the main process, Pool._handle_results() thread is blocked on os.read() which never completes, even if the child process died and so the other end of the pipe should be closed.

    @vstinner
    Copy link
    Member

    Windows is definitely affected, and you can run the repro in my first post to check other platforms.

    Oh right, I can also reproduce the issue on Linux.

    But I don't understand why test_multiprocessing_spawn works on all platforms, but only fails on macOS when run on Azure Pipelines.

    Aaaaah, multiprocesing mysteries...

    @applio
    Copy link
    Member

    applio commented Sep 10, 2019

    Sharing for the sake of documenting a few things going on in this particular example:

    • When a PoolWorker process exits in this way (os._exit(anything)), the PoolWorker never gets the chance to send a signal of failure (normally sent via the outqueue) to the MainProcess.
    • In the current logic of the MainProcess, Pool._maintain_pool() detects the termination of that PoolWorker process and starts a new PoolWorker process to replace it, maintaining the desired size of Pool.
    • The infinite hang observed in this example comes from the original p.map() call performing an unlimited-timeout wait for a result to appear on the outqueue, hence an infinite wait. This wait is performed in MapResult.get() which does expose a timeout parameter though it is not possible to control through Pool.map(). It is not at all a correct, general solution, but exposing the control on this timeout and setting it to 1.0 seconds permits Steve's repro code snippet to run to completion (no infinite hang, raises a multiprocessing.context.TimeoutError).

    @applio
    Copy link
    Member

    applio commented Sep 10, 2019

    Thanks to Pablo's good work with implementing the use of multiprocessing's Process.sentinel, the logic for handling PoolWorkers that die has been centralized into Pool._maintain_pool(). If _maintain_pool() can also identify which job died with the dead PoolWorker, then it should be possible to put a corresponding message on the outqueue to indicate an exception occurred but pool can otherwise continue its work.

    The question of whether Pool.map() should expose a timeout parameter deserves a separate discussion and should not be considered a path forward on this issue as it would require that users always specify and somehow know beforehand how long it should take for results to be returned from workers. Exposing the timeout control may have other practical benefits elsewhere but not here.

    @pppery
    Copy link
    Mannequin

    pppery mannequin commented Sep 10, 2019

    Is this not a duplicate of bpo-22393?

    @applio
    Copy link
    Member

    applio commented Sep 10, 2019

    Agreed with @PPPerry that this is a duplicate of bpo-22393.

    The proposed patch in bpo-22393 is, for the moment, out of sync with more recent changes. That patch's approach would result in the loss of all partial results from a Pool.map, but it may be faster to update and review.

    @vstinner
    Copy link
    Member

    Agreed with @PPPerry that this is a duplicate of bpo-22393.

    Ok, in that case I close this issue as a duplicate of bpo-22393. There is no need to duplicate the discussion here :-)

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants