New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing cannot recover from crashed worker #82265
Comments
Imitation repro: import os
from multiprocessing import Pool
def f(x):
os._exit(0)
return "success"
if __name__ == '__main__':
with Pool(1) as p:
print(p.map(f, [1])) Obviously a process may crash for various other reasons besides os._exit(). I believe this is the cause of bpo-37245. |
This issue has been seen on the macOS job of the Azure Pipeline: bpo-37245. I don't know if other platforms are affected. |
Windows is definitely affected, and you can run the repro in my first post to check other platforms. |
I converted the example into attached file mp_exit.py and I added a call to faulthandler to see what is going on. Output with the master branch of Python: vstinner@apu$ ~/python/master/python ~/mp_exit.py Thread 0x00007ff401b9b700 (most recent call first): Thread 0x00007ff40239c700 (most recent call first): Thread 0x00007ff4102cf740 (most recent call first): In the main process, Pool._handle_results() thread is blocked on os.read() which never completes, even if the child process died and so the other end of the pipe should be closed. |
Oh right, I can also reproduce the issue on Linux. But I don't understand why test_multiprocessing_spawn works on all platforms, but only fails on macOS when run on Azure Pipelines. Aaaaah, multiprocesing mysteries... |
Sharing for the sake of documenting a few things going on in this particular example:
|
Thanks to Pablo's good work with implementing the use of multiprocessing's Process.sentinel, the logic for handling PoolWorkers that die has been centralized into Pool._maintain_pool(). If _maintain_pool() can also identify which job died with the dead PoolWorker, then it should be possible to put a corresponding message on the outqueue to indicate an exception occurred but pool can otherwise continue its work. The question of whether Pool.map() should expose a timeout parameter deserves a separate discussion and should not be considered a path forward on this issue as it would require that users always specify and somehow know beforehand how long it should take for results to be returned from workers. Exposing the timeout control may have other practical benefits elsewhere but not here. |
Is this not a duplicate of bpo-22393? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: