This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [multiprocessing] Calling pool.terminate() from an error_callback causes deadlock
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: sjelin
Priority: normal Keywords:

Created on 2021-01-19 01:08 by sjelin, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg385240 - (view) Author: Sammy Jelin (sjelin) Date: 2021-01-19 01:08
As the title says, calling `pool.terminate()` inside an `error_callback` handler causes a deadlock.  The deadlock always happens, so it's not a race condition or thread-safety issue.

Simple repro:

```
from multiprocessing import Pool
p = Pool()

def error_callback(x):
    print(f'error: {x!r}')
    p.terminate()
    print('this message is never seen, because p.termiante() deadlocks')

p.apply_async(lambda: None, error_callback=error_callback)

# The following lines are technically aren't threadsafe,
# but I manually verified that that wasn't the problem.
p.close()
p.join()
print('this is also never seen, because the task handler is stuck in the deadlock')
```

This will print the following line and then hang:
```
error: PicklingError("Can't pickle <function <lambda> at 0x112c55e18>: attribute lookup <lambda> on __main__ failed")
```

The hanging happens inside `_help_stuff_finish`, when we call `inqueue._rlock.acquire()`.  As far as I can tell, `_handle_tasks` is already holding the lock when it calls the error_callback, so when `_terminate_pool` calls `_help_stuff_finish` a deadlock occurs.

Calling `p.terminate()` from a success callback doesn't appear to be an issue, likely because success callbacks are called via `_handle_results` instead of via `_handle_tasks`.

If calling `p.terminate()` from an error_callback isn't supported, that should be one of those big red warnings in the documentation.

I verified that this bug happens on 3.6 and 3.7, a skimmed the code on the github project to verify that it was likely still an issue.
History
Date User Action Args
2022-04-11 14:59:40adminsetgithub: 87129
2021-01-19 01:10:42sjelinsettype: behavior
2021-01-19 01:08:51sjelincreate