classification
Title: Interpreter exit blocks waiting for futures of shut-down ThreadPoolExecutors
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: aeros, asvetlov, bquinlan, hniksic, pitrou, torsten
Priority: normal Keywords: patch

Created on 2019-05-03 09:28 by hniksic, last changed 2020-04-15 02:39 by aeros.

Files
File name Uploaded Description Edit
pool-shutdown hniksic, 2019-05-03 09:28
Pull Requests
URL Status Linked Edit
PR 13250 open hniksic, 2019-05-11 17:41
Messages (19)
msg341329 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2019-05-03 09:28
At interpreter shutdown, Python waits for all pending futures of all executors to finish. There seems to be no way to disable the wait for pools that have been explicitly shut down with pool.shutdown(wait=False). The attached script demonstrates the issue.

In our code the futures are running blocking network calls that can be canceled by the user. The cancel action automatically cancels the pending futures and informs the running ones that they should exit. The remaining futures are those whose callables are "stuck" in network calls with long or infinite timeouts, such as reading from a non-responding network filesystem. Since those can't be interrupted, we use pool.shutdown(wait=False) to disown the whole pool. This works nicely, until the application exit, at which point it blocks trying to wait for the pending futures to finish. This can take an arbitrary amount of time, possibly never finishing.

A similar question has come up on StackOverflow, with the only answer recommending to unregister concurrent.futures.thread._python_exit: https://stackoverflow.com/q/48350257/1600898 . We are ourselves using a similar hack, but we would like to contribute a proper way to disown an executor.

The current behavior is explicitly documented, so presumably it can't be (easily) changed, but we could add a "disown" keyword argument to shutdown(), or a new disown() method, which would serve to explicitly disable the waiting. If this is considered desirable, I will create a pull request.
msg341515 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2019-05-06 14:48
>> The current behavior is explicitly documented, so presumably
>> it can't be (easily) changed

And it isn't clear that it would be desirable to change this even if it were possible - doing structured resource clean-up seems consistent with the rest of Python.

That being said, I can see the user case for this.

If you added this as an argument to shutdown() then you'd probably also have to add it as an option to the constructors (for people using Executors as context managers). But, if you have to add it to the constructor anyway, you may as well *only* add it to the constructor.

I'll let an active maintainer weigh in on whether, on balance, they think that this functionality is worth complicating the API.
msg341573 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2019-05-06 17:38
OK, I completely disagree with my statement:

"""If you added this as an argument to shutdown() then you'd probably also have to add it as an option to the constructors (for people using Executors as context managers). But, if you have to add it to the constructor anyway, you may as well *only* add it to the constructor."""

This functionality wouldn't apply to context manager use (since the point of the context manager is to ensure resource clean-up). So maybe another argument to shutdown() would make sense
msg341630 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2019-05-06 19:53
I agree with the last comment. The disowning functionality is only used in specific circumstances, so it's perfectly fine to keep the functionality as a shutdown flag. I also agree that the change cannot be *unconditional*, for backward compatibility if not for other reasons.

The StackOverflow question, and more importantly the existence of shutdown(wait=False), suggest that there are legitimate cases when one doesn't want to wait for all running futures to finish. If a flag to shutdown() is considered to complicate the API, then perhaps we could add an opt-out by subclassing the executor and overriding a semi-private method.


Currently there seems to be no way to just abandon the thread pool. Since user threads don't and never will support cancellation, the options are:

1. add the disown option to shutdown, as suggested
2. monkey-patch concurrent.futures to not block at shutdown
3. make functions executed by the pool externally cancellable
4. roll our own thread pool

#1 is suggested by this issue, and #2 is what we do now, but it's obviously unacceptable in the long term.

#3 is infeasible because the functions we submit to the pool heavily rely on http (through "requests") and database communication, which don't support user-driven cancellation.

#4 would be technically possible, but it would require reimplementing half of concurrent.futures (badly), just for the purpose of being able to get rid of the mandatory wait at interpreter exit.
msg341734 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2019-05-07 14:29
Hey Hrvoje,

I agree that #1 is the correct approach. `disown` might not be the best name - maybe `allow_shutdown` or something. But we can bike shed about that later.

Are you interested in writing a patch?
msg341737 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2019-05-07 14:41
> Are you interested in writing a patch?

Yes - I wanted to check if there is interest in the feature before I commit time to write the patch, documentation, tests, etc. (And also I wanted to check if there's a better way to do it.)

In any case, thanks for picking up on this.

> `disown` might not be the best name - maybe `allow_shutdown` or something.

Disown is an admittedly obscure reference to the shell built-in of the same name (https://tinyurl.com/qfn8ao7). The idea is to get the point across that the pool is truly abandoned, and its running futures are left to their own devices. Maybe wait_at_exit would be a clearer name:

    pool.shutdown(wait=False, wait_at_exit=True)
msg341992 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2019-05-09 18:23
We can bike shed over the name in the PR ;-)
msg342259 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-05-12 14:40
Should wait_on_exit be True only on application exit?
Do application exit means interpreter shutdown (sys.is_finalizing() == True)?
msg342260 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2019-05-12 14:52
@asvetlov The idea of the new flag is to disable any subsequent waiting for futures after ThreadPoolExecutor.shutdown(wait=False) returns.

Currently the additional waiting is implemented using "atexit", so I assumed it referred to process exit. (The documentation at https://docs.python.org/3/library/atexit.html doesn't seem to specify precisely when the callbacks are executed.) But looking at the implementation of atexit, it seems that the atexit callbacks are actually called from Py_FinalizeEx.

I think wait_at_exit is descriptive because in most cases the process exit and interpreter shutdown will correlate, but I can still update the docs to make it clearer what "exit" refers to. We should just avoid the word "shutdown" in the flag name to avoid confusion with executor shutdown.
msg342261 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-05-12 15:05
As documentation reader I have a mental problem to figure out when use non-default `wait_at_exit=False` flag and when don't.

The rule like: "if you tried to call `executor.shutdown(wait=False)` and it still hangs try `executor.shutdown(wait=False, wait_at_exit=False)`, maybe it can help" doesn't sound too robust.
msg366372 - (view) Author: Torsten Landschoff (torsten) * Date: 2020-04-14 07:51
I intended to file a bug because of the shutdown problem but found out there is this ticket already.

In our application we sometimes run into this problem that you can't exit because of a hanging TCP connection used inside a ThreadPoolExecutor task. 

executor.shutdown(wait=False) does not help - the process still hangs.

Another problem was inside our tests were a bug caused a hanging task inside a thread pool executor and the test runner (pytest) would not terminate because of this.

I actually went so far to drop the thread reference inside the concurrent.futures.thread module. :-(

A documented way to drop a hanging executor thread would be very much appreciated.
msg366415 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-14 18:02
I don't think there's much ThreadPoolExecutor can do.  If you drop the references to the threads, they still exist and they still be waited upon at interpreter exit.

The solution is for you to avoid having hanging threads.  In the particular case of TCP connections, I'd recommend using a dedicated framework such as asyncio (or Twisted, Tornado, etc.) instead of home-baked networking code.

Also, note that Python sockets have a feature called *timeouts*.
msg366419 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2020-04-14 18:15
> I don't think there's much ThreadPoolExecutor can do.  If you drop the references to the threads, they still exist and they still be waited upon at interpreter exit.

ThreadPoolExecutor introduces additional waiting of its own, and it is this wait the PR adds an option to disable.

> The solution is for you to avoid having hanging threads.

In some cases that is not possible. For example, the file IO on network file systems can take an arbitrary amount of time, and there is no way to implement a timeout. DNS lookups have been notorious for not supporting timeouts. Many third-party libraries, such as database drivers, still don't support timeouts.

This is a real issue that bit many users in production systems, it's not about a toy program using "home-baked networking code".
msg366421 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-14 18:19
I think there's a misunderstanding: "wait_at_exit" will make the *executor* forget about the threads, but Python itself still knows about them, and it waits for them to end at interpreter shutdown.

These threads were daemon threads in 3.8, so your patch indeed works there, but we've made them non-daemon in 3.9, for two reasons:
1) daemon threads are fragile and can crash the interpreter at shutdown
2) they are not supported on subinterpreters
msg366424 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2020-04-14 18:30
Thanks for the clarification, I didn't know about the change to non-daemon threads.

I still think this patch is useful, and won't harm general use because it is opt-in, just like daemon threads themselves. I suggest to update the PR to specify non-waiting pool at the point of creation rather than in shutdown(). (That will also make it more acceptable for the same option not being supported for process pools - it is ok for constructor signatures to differ.)

If there is interest, someone should take over the PR, as I have changed jobs and no longer have time to actively pursue this issue.
msg366431 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-14 19:01
> I don't think there's much ThreadPoolExecutor can do.  If you drop the references to the threads, they still exist and they still be waited upon at interpreter exit.
> 
> The solution is for you to avoid having hanging threads.  In the particular case of TCP connections, I'd recommend using a dedicated framework such as asyncio (or Twisted, Tornado, etc.) instead of home-baked networking code.

As far as I'm aware, I don't think there's a _safe_ way to force a hanging thread to immediately exit without causing resource-related problems (an unsafe way would be something like releasing the internal thread state lock). But in general, I think that hanging threads indicate a fundamental issue in the implementation of the function the thread is targeting. For any operation that can potentially stall or run indefinitely, there should be some form of fail safe or timeout in place to break out of it. So I agree with Antoine that it's ultimately the responsibility of the thread itself to not hang. There's not much of anything that ThreadPoolExecutor can safely do to resolve a hanging thread.

> Also, note that Python sockets have a feature called *timeouts*.

There's also of course timeouts implemented at a higher level in many networking frameworks, if the developer doesn't want to do so at the socket level. For example, see asyncio.wait_for(): https://docs.python.org/3/library/asyncio-task.html#asyncio.wait_for.

> These threads were daemon threads in 3.8, so your patch indeed works there, but we've made them non-daemon in 3.9, for two reasons:
1) daemon threads are fragile and can crash the interpreter at shutdown
2) they are not supported on subinterpreters

Just to briefly clarify on (2), Victor recently opened a PR to revert daemon threads being entirely unsupported in subinterpreters: PR-19456. From reading over the attached bpo issue, it looks like the plan is to deprecate daemon threads in subinterpreters, but to not immediately drop support because users of mod_wsgi and various monitoring tools were reliant upon it (through the C-API).

It seems like the current plan is for it to undergo a deprecation process. Either way though, I still think the change to make executors no longer reliant upon daemon threads was a significant stability improvement and will mean we don't have to make the change later when they are fully unsupported in subinterpreters.
msg366437 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-14 19:51
> ThreadPoolExecutor introduces additional waiting of its own, and it is this wait the PR adds an option to disable.

[next post]

> Thanks for the clarification, I didn't know about the change to non-daemon threads.

> I still think this patch is useful, and won't harm general use because it is opt-in, just like daemon threads themselves. I suggest to update the PR to specify non-waiting pool at the point of creation rather than in shutdown(). (That will also make it more acceptable for the same option not being supported for process pools - it is ok for constructor signatures to differ.)

Yes, but if you simply make the ThreadPoolExecutor forget about the thread, then all that it does is divert the wait to occur at a later time. In this case, it would get stuck at `threading._shutdown()` (where all non-daemon threads are joined) instead of `executor.shutdown()` since they no longer use daemon threads.

The only way I could see this to work as intended without making any changes to threading would be to optionally use daemon threads and avoid joining the threads in `executor.shutdown()` if `wait_at_exit` is set to False in the constructor for the executor. But at the least, users would have to be made aware in the documentation that this could lead to significant resource finalization issues, and potential incompatibility with subinterpreters. 

Otherwise, they may end up accidentally shooting themselves in the foot (which we certainly want to avoid). I'm also somewhat concerned that it may end up getting used to cover up actual bugs. 

IMO, it also would require some fairly extensive testing to make sure it works as intended, which the patch currently lacks. So in order to implement this properly, it would require a significant amount of additional time investment beyond what has been put into the existing patch.
msg366439 - (view) Author: Hrvoje Nikšić (hniksic) * Date: 2020-04-14 20:01
> The only way I could see this to work as intended without making any changes to threading would be to optionally use daemon threads and avoid joining the threads in `executor.shutdown()` if `wait_at_exit` is set to False in the constructor for the executor.

Agreed, and that is precisely what I suggested in my previous comment. The attached PR already deals with the executor-specific part of the wait, but it would be straightforward to have it also affect the executor to create daemon threads, and the flag moved to the constructor.

> IMO, it also would require some fairly extensive testing to make sure it works as intended, which the patch currently lacks.

It is quite easy to check that a hanging thread (emulated by a simple sleep) is not joined by the executor with the appropriate flag set.
msg366479 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2020-04-15 02:39
> Agreed, and that is precisely what I suggested in my previous comment. The attached PR already deals with the executor-specific part of the wait, but it would be straightforward to have it also affect the executor to create daemon threads, and the flag moved to the constructor.

Ah, I see. That should be fine then, but I would like to wait on additional feedback to make sure that re-adding daemon threads (even as a opt-in flag) is something that we want to do. I think you do have a good point regarding the practical utility in situations where it may not be feasible to programmatically prevent a thread from hanging indefinitely, but the main questions are: 

Is it something that ought to be provided by ThreadPoolExecutor? 
If so, is this the best way to implement it?

> It is quite easy to check that a hanging thread (emulated by a simple sleep) is not joined by the executor with the appropriate flag set.

That would certainly be part of it, but we also have to verify a few other things (this is off the top of my head, so I'm likely missing some points):

1) All of the threads are created as daemon threads, instead of regular threads when the flag is set (easy)
2) The hanging thread itself was able to finalize properly (without any leaked references or resource issues)
3) The rest of the executor resources were able to finalize properly, despite the hanging thread
4) The interpreter was able to finalize properly, despite the hanging thread

It's not so much writing the tests that will be especially involved, it's more so the time investment required to get them all to pass. Which could potentially be on the first try, or it could take many different implementations until it works across supported platforms. In my experience so far from working with the executors, it's more likely to be the latter.
History
Date User Action Args
2020-04-15 02:39:49aerossetmessages: + msg366479
2020-04-14 20:01:55hniksicsetmessages: + msg366439
2020-04-14 19:52:00aerossetmessages: + msg366437
2020-04-14 19:01:29aerossetmessages: + msg366431
2020-04-14 18:30:12hniksicsetmessages: + msg366424
2020-04-14 18:19:20pitrousetmessages: + msg366421
2020-04-14 18:15:17hniksicsetmessages: + msg366419
2020-04-14 18:02:43pitrousetmessages: + msg366415
2020-04-14 17:59:22pitrousetnosy: + aeros
2020-04-14 07:51:12torstensetnosy: + torsten
messages: + msg366372
2019-05-12 15:05:12asvetlovsetmessages: + msg342261
2019-05-12 14:52:59hniksicsetmessages: + msg342260
2019-05-12 14:40:20asvetlovsetnosy: + asvetlov
messages: + msg342259
2019-05-11 17:41:14hniksicsetkeywords: + patch
stage: patch review
pull_requests: + pull_request13161
2019-05-09 18:23:01bquinlansetmessages: + msg341992
2019-05-07 14:41:40hniksicsetmessages: + msg341737
2019-05-07 14:29:00bquinlansetmessages: + msg341734
2019-05-06 19:53:14hniksicsetmessages: + msg341630
2019-05-06 17:38:19bquinlansetmessages: + msg341573
2019-05-06 14:48:37bquinlansetmessages: + msg341515
2019-05-03 09:51:44SilentGhostsetnosy: + bquinlan, pitrou

type: behavior
versions: - Python 3.9
2019-05-03 09:28:03hniksiccreate