classification
Title: asyncio: BaseEventLoop.close() shutdowns the executor without waiting causing leak of dangling threads
Type: Stage: resolved
Components: asyncio, Tests Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: aeros, asvetlov, cjrh, lukasz.langa, vstinner, yselivanov
Priority: high Keywords: patch

Created on 2018-07-03 22:57 by vstinner, last changed 2019-09-20 02:08 by aeros. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 13802 merged lukasz.langa, 2019-06-04 10:43
PR 15735 merged aeros, 2019-09-09 04:26
PR 16284 merged vstinner, 2019-09-19 14:12
Messages (25)
msg321006 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-03 22:57
AMD64 FreeBSD 10.x Shared 3.7:
http://buildbot.python.org/all/#/builders/124/builds/410

...
test_remove_fds_after_closing (test.test_asyncio.test_events.KqueueEventLoopTests) ... ok
test_run_in_executor (test.test_asyncio.test_events.KqueueEventLoopTests) ... ok
test_run_in_executor_cancel (test.test_asyncio.test_events.KqueueEventLoopTests) ...

  Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
  Dangling thread: <Thread(ThreadPoolExecutor-13_0, started daemon 34471091200)>
  Dangling thread: <_MainThread(MainThread, started 34393318400)>

ok
test_run_until_complete (test.test_asyncio.test_events.KqueueEventLoopTests) ... ok
test_run_until_complete_nesting (test.test_asyncio.test_events.KqueueEventLoopTests) ... ok
...
1 test altered the execution environment:
    test_asyncio
msg321534 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-12 10:20
It's easy to reproduce the issue on Linux:

diff --git a/Lib/test/test_asyncio/test_events.py b/Lib/test/test_asyncio/test_events.py
index 11cd950df1..df4c2b9849 100644
--- a/Lib/test/test_asyncio/test_events.py
+++ b/Lib/test/test_asyncio/test_events.py
@@ -359,7 +359,7 @@ class EventLoopTestsMixin:
             called = True
 
         def run():
-            time.sleep(0.05)
+            time.sleep(1.0)
 
         f2 = self.loop.run_in_executor(None, run)
         f2.cancel()


The problem is that BaseEventLoop.close() shutdowns its default executor  without waiting:

    def close(self):
        ...
        executor = self._default_executor
        if executor is not None:
            self._default_executor = None
            executor.shutdown(wait=True)

I fixed a similar issue in socketserver:

* bpo-31233: for socketserver.ThreadingMixIn
* bpo-31151: for socketserver.ForkingMixIn 
* bpo-33540: add block_on_close attr to socketserver

I suggest to wait by default, but maybe also add a block_on_close attribute to BaseEventLoop (default: False) just for backward compatibility.

What do you think Yury, Andrew, and Guido?
msg326050 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-21 22:19
Yury, Andrew: Do you know if the executor doesn't wait on purpose? Would it be possible to change that in Python 3.8?
msg326052 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2018-09-21 22:29
> Yury, Andrew: Do you know if the executor doesn't wait on purpose? Would it be possible to change that in Python 3.8?

Maybe.  At least we need to add a "timeout" argument to asyncio.run() to let it wait for executor jobs.

I'm also thinking about making OS threads cancellable/interruptable in Python 3.8 (if they run pure Python code).
msg326259 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-24 15:58
> Maybe.  At least we need to add a "timeout" argument to asyncio.run() to let it wait for executor jobs.

The shutdown() method of concurrent.futures.Executor API doesn't accept a timeout. It waits for multiple things.

I added "block_on_close = True" class attribute to socketserver.ForkingMixIn and socketserver.ThreadingMixIn. By default, server_close() waits until all children complete, but the wait is non-blocking if block_on_close is false.
msg344475 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-03 21:00
I wrote PR 13786 to fix this issue.
msg344476 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-03 21:09
"executor.shutdown(wait=False)" was added by this change:

commit 4ca7355901049ff9a873cf236091e6780d7a8126
Author: Antoine Pitrou <solipsis@pitrou.net>
Date:   Sun Oct 20 00:54:10 2013 +0200

    Issue #19299: fix refleak test failures in test_asyncio

Guido van Rossum proposed to wait for the executor. Guido proposed " executor.shutdown(wait=False)", but he didn't comment why wait=False:

https://bugs.python.org/issue19299#msg200508
msg344477 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-03 21:31
New changeset 0f0a30f4da4b529e0f7df857b9f575b231b32758 by Victor Stinner in branch 'master':
bpo-34037, asyncio: add BaseEventLoop.wait_executor_on_close (GH-13786)
https://github.com/python/cpython/commit/0f0a30f4da4b529e0f7df857b9f575b231b32758
msg344545 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-06-04 11:03
New changeset 7f9a2ae78051877f4d966119e2fcd27ec77eda1d by Łukasz Langa in branch 'master':
Revert "bpo-34037, asyncio: add BaseEventLoop.wait_executor_on_close (GH-13786)" (#13802)
https://github.com/python/cpython/commit/7f9a2ae78051877f4d966119e2fcd27ec77eda1d
msg344547 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-04 11:50
Łukasz Langa reverted my change (PR 13786). I pushed my change without any review to try to fix https://bugs.python.org/issue37137 which blocked the Python 3.8 beta1. Sadly, my change didn't fix this issue. The root cause was differently (and it's now fixed).

Andrew Svetlov asked to make some changes on my API. Yury Selivanov added "I don't like this last minute api decision. Please don't merge this.". See PR 13786 for the discussion. In short, my change was reverted because of the API.

But I still consider that calling executor.shutdown(wait=False) in loop.close() in wrong: we must wait until it finishes. Otherwise, the code "leaks" threads which continue to run and so can cause troubles. I would prefer to wait by default.

Andrew, Yury: Would you be interested to rewrite my change in the "asyncio way"?

IMHO this change can wait Python 3.9, it's not a critical bug.
msg344548 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-06-04 11:58
I still think that this is a valuable change but we need to discuss public API.

In fact, in my job we used to override default executor and wait for its shutdown with explicit waiting to make the system robust.

The problem is that loop.close() can "hang" for minutes if there are pending background jobs or slow DNS resolution requests.

We controlled our tasks running in the thread pool carefully and used aiodns for DNS resolving.
msg344552 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-04 12:08
> The problem is that loop.close() can "hang" for minutes if there are pending background jobs or slow DNS resolution requests.

Maybe concurrent.futures should be enhanced for that?

Or from asyncio, is it possible to poll the executor and emit a warning if some threads take longer than <threshold in esconds>? It would be great if we could display repr() of the slow job in that case.

Yury also asked for a timeout, but it's not supported by concurrent.futures yet, and I'm not sure what should be done in case of timeout? Retry? Log a warning?
msg344686 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-06-05 08:58
I prefer postponing to think about it until all required 3.8 tasks are done :)
msg351231 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-06 07:04
This bug still occurs randomly on the CI. Example on AppVeyor:
https://ci.appveyor.com/project/python/cpython/builds/27209898

test_reader_callback (test.test_asyncio.test_events.SelectEventLoopTests) ... ok
test_remove_fds_after_closing (test.test_asyncio.test_events.SelectEventLoopTests) ... ok
test_run_in_executor (test.test_asyncio.test_events.SelectEventLoopTests) ... ok
test_run_in_executor_cancel (test.test_asyncio.test_events.SelectEventLoopTests) ... Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Dangling thread: <Thread(ThreadPoolExecutor-19_0, started daemon 2728)>
Dangling thread: <_MainThread(MainThread, started 1916)>
ok

Any plan to reapply my change, or fix it?
msg351232 - (view) Author: Kyle Stanley (aeros) * (Python triager) Date: 2019-09-06 07:19
> Any plan to reapply my change, or fix it?

I can try to help with this. I'm not the most familiar with the internals of asyncio, but I think it would provide a good learning experience.
msg351349 - (view) Author: Kyle Stanley (aeros) * (Python triager) Date: 2019-09-09 05:50
I've opened PR-15735 which applies the same functionality as Victor's PR-13786, but adds the public getter and setter methods (for both AbstractEventLoop and BaseEventLoop) as requested by Andrew. 

Since this is still causing intermittent CI failures from test_asyncio, I think it's important to implement these changes in some form in the near future.
msg351359 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-09 08:34
I change the version to Python 3.9: the 3.8 branch don't accept new features anymore.
msg351384 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-09-09 09:43
Here's the API I propose to solve this problem: https://github.com/python/cpython/pull/15735#pullrequestreview-285389412

Summary:

* Add a new loop.shutdown_threadpool() method. Just like with shutdown_asyncgens() -- it would be invalid to call loop.run_in_executer() after loop.shutdown_threadpool() is called.

* Make asyncio.run() to call the new loop.shutdown_threadpool().
msg351394 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-09 09:56
"shutdown_threadpool()" name

What do you think of the "shutdown_default_executor()" name?

The default executor can be overriden by set_default_executor():

    def set_default_executor(self, executor):
        if not isinstance(executor, concurrent.futures.ThreadPoolExecutor):
            warnings.warn(
                'Using the default executor that is not an instance of '
                'ThreadPoolExecutor is deprecated and will be prohibited '
                'in Python 3.9',
                DeprecationWarning, 2)
        self._default_executor = executor

The default executor should always be a thread pool, so "shutdown_threadpool()" name is also correct. I have no strong preference.
msg351396 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-09-09 09:57
> What do you think of the "shutdown_default_executor()" name?

Yeah, I think it's a better name!
msg352788 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-09-19 12:47
New changeset 9fdc64cf1266b6d5bf0503847b5c38e5edc53a14 by Andrew Svetlov (Kyle Stanley) in branch 'master':
bpo-34037: Fix test_asyncio failure and add loop.shutdown_default_executor() (GH-15735)
https://github.com/python/cpython/commit/9fdc64cf1266b6d5bf0503847b5c38e5edc53a14
msg352789 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-09-19 12:48
Thnks, Kyle!
msg352797 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-19 14:45
New changeset 079931d12223ec98cbf53185b90db48efa61f93f by Victor Stinner in branch 'master':
bpo-34037: test_asyncio uses shutdown_default_executor() (GH-16284)
https://github.com/python/cpython/commit/079931d12223ec98cbf53185b90db48efa61f93f
msg352798 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-19 14:47
asyncio.run() now shutdowns the default executor: nice, thanks! That was my request. IMHO it will make asyncio more reliable, especially for tests on the CI.

If it becomes an issue in Python 3.9 (executor hangs forever), we can add new parameters (to run()?) to put a timeout for example.
msg352822 - (view) Author: Kyle Stanley (aeros) * (Python triager) Date: 2019-09-20 02:08
> Thanks, Kyle!

No problem, and thanks for all of the help from Andrew, Yury, and Victor!

> IMHO it will make asyncio more reliable, especially for tests on the CI.

Awesome, that was my primary intention. (:

> If it becomes an issue in Python 3.9 (executor hangs forever),

Feel free to add me to the nosy list if you have to open an issue for it, I'd be glad to help out with this if it becomes an issue.
History
Date User Action Args
2019-09-20 02:08:48aerossetmessages: + msg352822
2019-09-19 14:47:52vstinnersetmessages: + msg352798
2019-09-19 14:45:09vstinnersetmessages: + msg352797
2019-09-19 14:12:39vstinnersetpull_requests: + pull_request15869
2019-09-19 12:48:29asvetlovsetstatus: open -> closed
resolution: fixed
messages: + msg352789

stage: patch review -> resolved
2019-09-19 12:47:29asvetlovsetmessages: + msg352788
2019-09-11 13:56:06cjrhsetnosy: + cjrh
2019-09-09 09:57:20yselivanovsetmessages: + msg351396
2019-09-09 09:56:33vstinnersetmessages: + msg351394
2019-09-09 09:43:07yselivanovsetmessages: + msg351384
2019-09-09 08:34:39vstinnersetmessages: + msg351359
versions: + Python 3.9, - Python 3.8
2019-09-09 05:50:36aerossetpriority: normal -> high

messages: + msg351349
2019-09-09 04:26:15aerossetpull_requests: + pull_request15390
2019-09-06 07:19:54aerossetnosy: + aeros
messages: + msg351232
2019-09-06 07:04:51vstinnersetmessages: + msg351231
2019-06-05 08:58:01asvetlovsetmessages: + msg344686
2019-06-04 12:08:53vstinnersetmessages: + msg344552
2019-06-04 11:58:42asvetlovsetmessages: + msg344548
2019-06-04 11:50:37vstinnersetmessages: + msg344547
2019-06-04 11:03:36lukasz.langasetnosy: + lukasz.langa
messages: + msg344545
2019-06-04 10:43:42lukasz.langasetkeywords: + patch
stage: patch review
pull_requests: + pull_request13688
2019-06-03 21:31:27vstinnersetmessages: + msg344477
2019-06-03 21:09:09vstinnersetmessages: + msg344476
2019-06-03 21:00:25vstinnersetmessages: + msg344475
versions: + Python 3.8, - Python 3.7
2018-09-24 15:58:45vstinnersetmessages: + msg326259
2018-09-21 22:29:20yselivanovsetmessages: + msg326052
2018-09-21 22:19:51vstinnersetmessages: + msg326050
2018-07-13 21:46:20gvanrossumsetnosy: - gvanrossum
2018-07-12 10:21:26vstinnersettitle: test_asyncio: test_run_in_executor_cancel() leaked a dangling thread on AMD64 FreeBSD 10.x Shared 3.7 -> asyncio: BaseEventLoop.close() shutdowns the executor without waiting causing leak of dangling threads
2018-07-12 10:20:53vstinnersetnosy: + gvanrossum
messages: + msg321534
2018-07-03 22:57:26vstinnercreate