classification
Title: reproducible deadlock with multiprocessing.Pool
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jonathan.Gossage, Windson Yang, davin, dzhu, nconway, pablogsal, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2018-11-17 01:06 by dzhu, last changed 2019-09-11 13:12 by davin.

Files
File name Uploaded Description Edit
lock.py dzhu, 2018-11-17 01:06 reproducer script
lock1.py Jonathan.Gossage, 2018-12-07 01:44
lock1.result.txt Jonathan.Gossage, 2018-12-07 01:45
Pull Requests
URL Status Linked Edit
PR 11143 closed Windson Yang, 2018-12-13 13:53
Messages (8)
msg330017 - (view) Author: (dzhu) Date: 2018-11-17 01:06
The attached snippet causes a deadlock just about every time it's run (tested with 3.6.7/Ubuntu, 3.7.1/Arch, 3.6.7/OSX, and 3.7.1/OSX -- deadlock seems to be less frequent on the last, but still common). The issue appears to be something like the following sequence of events:

1. The main thread calls pool.__exit__, eventually entering Pool._terminate_pool.
2. result_handler's state is set to TERMINATE, causing it to stop reading from outqueue.
3. The main thread, in _terminate_pool, joins on worker_handler, which is (usually) in the middle of sleeping for 0.1 seconds, opening a window for the next two steps to occur.
4. The worker process finishes its task and acquires the shared outqueue._wlock.
5. The worker attempts to put the result into outqueue, but its pickled form is too big to fit into the buffer of os.pipe, and it blocks here with the lock held.
6. worker_handler wakes up and exits, freeing _terminate_pool to continue.
7. _terminate_pool terminates the worker.
8. task_handler tries to put None into outqueue, but blocks, since the lock was acquired by the terminated worker.
9. _terminate_pool joins on task_handler, and everything is deadlocked.
msg330018 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-11-17 03:44
Hello, dzhu. I can reproduce on my OSX, since you already dive into the code, do you have any idea to fix or improve it?
msg330202 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-11-21 13:24
I will work on it if no one wants to create a PR for this next week.
msg330968 - (view) Author: (dzhu) Date: 2018-12-03 18:50
Given the hairiness of the deadlock, I think I would rather let someone who has more experience with the codebase in general handle it, but I can come back to it if it doesn't get addressed.
msg331260 - (view) Author: Jonathan Gossage (Jonathan.Gossage) * Date: 2018-12-07 01:44
This is a great example of abusing the multi-processing API and thus
creating timing errors that lead to locks not being released. What is
happening is that the example attempts to transmit data that is too
big for the underlying pipe and this creates the timing errors since
the pipe that returns the result of invoking apply_async does not have
the capacity of transmitting the result of the worker process in a single
operation and accordingly the worker process returns prematurely while
still holding the pipe lock. This can be seen by testing the ready status
of the result returned by apply_async which will show whether the complete
result of the worker process has been received.
  
 The solution to this situation is simple, simply invoke get() on
 the asynchronous result returned by apply_async. Using this call has the
 effect of correctly synchronizing the pipe used by the low-level queues
 used by the Pool Manager and there will be no lock left active when it
 should'nt be. The script lock1.py contains code that implements this fix
 as well as verifying that everything has worked properly. The output from
 the script is found in the file lock1.result.txt.
  
 Because there is an API based solution to the problem plus the
 behavior of apply_async makes sense in that the process of transferring
 multi-buffer data is very CPU intensive and should be delegated to the
 worker process rather than to the main-line process, I do not recommend that
 any changes be made to the multiprocessing code in Python, rather the
 solution should use the available multiprocessing API correctly.
msg331401 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-12-09 00:56
As Jonathan Gossage said, I think it may break some code to fix this issue, maybe we could just add a warning on the document?
msg331403 - (view) Author: Jonathan Gossage (Jonathan.Gossage) * Date: 2018-12-09 01:37
I think documentation is sufficient but I would like it to state the pitfalls available if apply_async is not synchronized correctly which will happen whenever the output does not fit the pipe buffer.
msg351856 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2019-09-11 13:12
I second what @vstinner already said in the comments for PR11143, that this should not merely be documented.
History
Date User Action Args
2019-09-11 13:12:25davinsetnosy: + davin
messages: + msg351856
2018-12-13 13:53:04Windson Yangsetkeywords: + patch
stage: patch review
pull_requests: + pull_request10374
2018-12-12 16:29:27vstinnersetnosy: + pablogsal
2018-12-12 00:25:13vstinnersetnosy: + vstinner
2018-12-09 01:37:47Jonathan.Gossagesetmessages: + msg331403
2018-12-09 00:56:54Windson Yangsetmessages: + msg331401
2018-12-07 01:45:58Jonathan.Gossagesetfiles: + lock1.result.txt
2018-12-07 01:44:10Jonathan.Gossagesetfiles: + lock1.py
nosy: + Jonathan.Gossage
messages: + msg331260

2018-12-03 18:50:54dzhusetmessages: + msg330968
2018-11-21 13:24:21Windson Yangsetmessages: + msg330202
2018-11-19 17:45:30nconwaysetnosy: + nconway
2018-11-17 17:13:58ned.deilysetnosy: + pitrou
2018-11-17 03:44:41Windson Yangsetnosy: + Windson Yang
messages: + msg330018
2018-11-17 01:06:39dzhucreate