classification
Title: Deprecate Process Child Watchers
Type: Stage:
Components: asyncio Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: aeros Nosy List: aeros, asvetlov, benjamin.peterson, vstinner, yselivanov
Priority: normal Keywords:

Created on 2019-10-25 19:23 by aeros, last changed 2019-11-15 13:21 by aeros.

Messages (22)
msg355372 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-10-25 19:23
Proposal: 
Deprecate the alternative process watcher implementation to ThreadedChildWatcher, MultiLoopChildWatcher.

Motivation:
The idea for this proposal came from a comment from Andrew Svetlov in GH-16552:
"I believe after polishing ThreadedChildWatcher we can start deprecation of watchers subsystem at all, it generates more problems than solves."

Although Andrew suggested the idea of deprecating the process watchers subsystem entirely, I think that it would be adequate to start with just deprecating MultiLoopChildWatcher and proceeding from there. This is because the other three watcher implementations are significantly more stable in comparison (based on number of issues), have less complex implementations, and have far more widespread usage across public repositories. I would not be opposed to deprecating the other alternative watchers as well, but MultiLoopChildWatcher likely has the most justification for removal.

Details:
The class MultiLoopChildWatcher has a fairly complex implementation, causes a number of issues, is fairly unstable, is rarely utilized in comparison with the rest of the watchers API. It seems to have become a significant burden on maintenance for asyncio development without providing a significant benefit to the vast majority of users. 

Current unresolved MultiLoopChildWatcher issues:
https://bugs.python.org/issue38323
https://bugs.python.org/issue38182
https://bugs.python.org/issue37573

(3 out of the 5 open process watcher issues are caused by MultiLoopChildWatcher)

GitHub code usage comparison:
MultiLoopChildWatcher: https://github.com/search?l=Python&q=MultiLoopChildWatcher&type=Code (20 results)
ThreadedChildWatcher: https://github.com/search?l=Python&q=ThreadedChildWatcher&type=Code (77 results) 
FastChildWatcher: https://github.com/search?l=Python&q=FastChildWatcher&type=Code (4,426 results)
SafeChildWatcher: https://github.com/search?l=Python&q=SafeChildWatcher&type=Code (7,007 results)
All of asyncio usage: https://github.com/search?l=Python&q=asyncio&type=Code (599,131 results)

Note that for the above results, it also includes matches in the CPython repository and for repositories that have exact copies of Lib/asyncio/unix_events.py. Also, ThreadedChildWatcher likely has significantly less results since it's already the default watcher returned by asyncio.get_child_watcher(), so there's less need to explicitly declare it compared to the others. There are of course private repositories and non-GitHub repositories this doesn't include, but it should provide a general idea of overall usage.

My experience in interacting with asyncio users is similar to the results. The process watchers subsystem seems to be minimally used compared to the rest of asyncio, with ThreadedChildWatcher likely being the most used as the default returned from `asyncio.get_child_watcher()`. I have not personally seen a realistic example of usage of MultiLoopChildWatcher outside of python/cpython, and all of the results returned on GitHub were copies of Lib/asyncio/unix_events.py or Lib/test/test_asyncio/test_subprocess.py.

I would be interested in working on this issue if it is approved, as I think it would provide a significant long term reduction in maintenance for asyncio; allowing us to focus on the improvement and development of other features that benefit a far larger audience.
msg355373 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-10-25 19:28
Didn't we just add MultiLoopChildWatcher in 3.8?
msg355375 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-10-25 19:37
> Didn't we just add MultiLoopChildWatcher in 3.8?

Yep, that's correct: https://docs.python.org/3/library/asyncio-policy.html#asyncio.MultiLoopChildWatcher

I wasn't aware of that, but it would explain the current lack of usage. I'm generally in favor of Andrew's idea to deprecate the watchers subsystem, but perhaps if we go through with that we should remove MultiLoopChildWatcher from 3.8 (if it's not too late to do so) instead of deprecating it.
msg355376 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-10-25 19:43
> but perhaps if we go through with that we should remove MultiLoopChildWatcher from 3.8 (if it's not too late to do so) instead of deprecating it.

I'll leave that up to Andrew to decide, but I'd be +1 to drop it asap, especially if we want to eventually deprecate watchers.  

Speaking of watchers -- big +1 from me to drop them all at some point. I would start as early as 3.9.

Linux has pidfd now, freebsd/macos has kqueue, windows has its own apis for watching processes. Threads can be the backup method for OSes that lack proper APIs for watching multiple processes (without using SIGCHLD etc).
msg355379 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-10-25 19:54
> Speaking of watchers -- big +1 from me to drop them all at some point. I would start as early as 3.9.

Yeah that was my initial plan, to start the deprecation in 3.9 and finalize the removal in 3.11. We might be able to make an exception for MultiLoopChildWatcher and just remove it from 3.8 though. 

I think it will be necessary to fix the issues in the near future if we want to keep MultiLoopChildWatcher in 3.8 (as they apply to 3.8 and 3.9). I would certainly not be ecstatic about the idea of removing something that was just recently added (similar to how we had to revert the new streaming changes due to API design), but I'm even less in favor of keeping something around that's not stable in a final release. 

Based on my investigation of https://bugs.python.org/issue38323, it seems like it will be a rather complex issue to solve.
msg355380 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2019-10-25 19:58
Kyle, why are you resetting the status to "Pending"?

> I think it will be necessary to fix the issues in the near future if we want to keep MultiLoopChildWatcher in 3.8 (as they apply to 3.8 and 3.9). I would certainly not be ecstatic about the idea of removing something that was just recently added (similar to how we had to revert the new streaming changes due to API design), but I'm even less in favor of keeping something around that's not stable in a final release. 

Your opinion on this is duly noted.  

It would be great to hear what Andrew and Victor think about this.
msg355381 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-10-25 20:01
> Kyle, why are you resetting the status to "Pending"?

That was an accident, I think I had that set already on the page and submitted my comment just after you did yours.

I'm going to change the title of the issue to "Deprecate Process Child Watchers". I started the proposal bit more conservatively because I thought that it might be easier to just deprecate MultiLoopChildWatcher at first. But seeing as it was just added in 3.8, I don't think it would make as much sense to deprecate that one by itself.
msg355390 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-10-25 22:05
ThreadedChildWatcher starts a thread per process but has O(1) complexity.

MultiLoopChildWatcher doesn't spawn threads, it can be used safely with asyncio loops spawn in multiple threads. The complexity is O(N) plus no other code should contest for SIG_CHLD subscription.

FastChildWatcher has O(1), this is the good news. All others are bad: the watcher conflicts even with blocking `subprocess.wait()` call, even if the call is performed from another thread.

SafeChildWatcher is safer than FastChildWatcher but working with asyncio subprocess API is still super complicated if asyncio code is running from multiple threads. SafeChildWatcher works well only if asyncio is run from the main thread only. Complexity is O(N).

I think FastChildWatcher and SafeChildWatcher should go, ThreadedChildWatcher should be kept default and MultiLoopChildWatcher is an option where ThreadedChildWatcher is not satisfactory.

MultiLoopChildWatcher problems can and should be fixed; there is nothing bad in the idea but slightly imperfect implementation.

Regarding pidfd and kqueue -- I love to see pull requests with proposals. Now nothing exists.
The pidfd is available starting from the latest released Linux 5.3;  we need to wait for a decade before all Linux distros adopt it and we can drop all other implementations.
msg355394 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-10-25 23:36
> I think FastChildWatcher and SafeChildWatcher should go, ThreadedChildWatcher should be kept default and MultiLoopChildWatcher is an option where ThreadedChildWatcher is not satisfactory.

Okay, I think I can understand the reasoning here. Do you think that FastChildWatcher and SafeChildWatcher could be deprecated starting in 3.9 and removed in 3.11? If so, I'd be glad to start working on adding the deprecation warnings and the 3.9 Whats New entries.

> MultiLoopChildWatcher problems can and should be fixed; there is nothing bad in the idea but slightly imperfect implementation.

Yeah my largest concern was just that the current issues seem especially complex to fix and I interpreted your previous comment as "I'd like to deprecate all the process watchers in the near future". Thus, it seemed to make sense to me that we could start removing them rather than sinking time into fixing something that might be removed soon. 

By "slightly imperfect implementation", do you have any ideas for particularly imperfect parts of it that could use improvement? 

I feel that I've developed a decent understanding of the implementation for ThreadedChildWatcher, but after looking at the race conditions for MultiLoopChildWatcher in https://bugs.python.org/issue38323, I'll admit that I felt a bit lost for where to find a solution. Primarily because my understanding of the signal module is quite limited in comparison to others; it's not an area that I've strongly focused on.
msg355395 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-26 00:36
> ThreadedChildWatcher starts a thread per process but has O(1) complexity.

But it spawns a new Python thread per process which can be a blocker issue if a server memory is limited. What if you want to spawn 100 processes? Or 1000 processes? What is the memory usage?

I like FastChildWatcher!

> ... but working with asyncio subprocess API is still super complicated if asyncio code is running from multiple threads

Well, I like the ability to choose the child watcher implementation depending on my use case. If asyncio is only run from the main thread, FastChildWatcher is safe, fast and has low memory footprint, no?
msg355396 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-26 00:40
> Regarding pidfd and kqueue -- I love to see pull requests with proposals. Now nothing exists. The pidfd is available starting from the latest released Linux 5.3;  we need to wait for a decade before all Linux distros adopt it and we can drop all other implementations.

I hope that it will take less time to expose pidfd_open() in Python! It is *already* available in the Linux kernel 5.3. My laptop is already running Linux 5.3! (Thanks Fedora 30.)

I would prefer to keep an API to choose the child watcher since it seems like even in 2019, there are still new APIs (pidfd) to wait for a process completion. I expect that each implementation will have advantages and drawbacks.
msg355409 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-10-26 08:45
My non-LTS Ubuntu also has 5.3 kernel but I'm talking about the oldest supported RHEL/CentOS.

That's why pidfd_open() cannot be a single implementation. It's so new; my local man-pages system has not a record about the API yet (but the web has: http://man7.org/linux/man-pages/man2/pidfd_open.2.html).

> If asyncio is only run from the main thread, FastChildWatcher is safe, fast and has low memory footprint, no?

Unfortunately, no. FastChildWatcher is safe if you can guarantee that no code executed in asyncio main thread AND thread pools spawn subprocesses. Otherwise, the whole Python process becomes broken by the race condition between FastChildWatcher and any other wait()/waitpid()/waitid() call.
msg355419 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-10-26 16:52
> > If asyncio is only run from the main thread, FastChildWatcher is safe, fast and has low memory footprint, no?

> Unfortunately, no. FastChildWatcher is safe if you can guarantee that no code executed in asyncio main thread AND thread pools spawn subprocesses

Am I misunderstanding something here or is this supposed to be "FastChildWatcher is safe if you can guarantee that no code executed *outside of* the asyncio main thread AND ..."? Alternatively, "FastChildWatcher is safe if you can guarantee that code *only* executed in the asyncio main thread". Both of the above have the same functional meaning. 

I think it was a typo, but I just wanted to make sure because the distinction from the original makes a functional difference in this case. 

Also, regarding the second part "thread pools spawn subprocesses", is that to say that subprocesses can only spawn within the thread pools? As in, FastChildWatcher becomes unsafe if subprocesses are spawned from anywhere else?

These answers may be fairly obvious to someone familiar working within the internals of FastChildWatcher, but it may not be overly clear to someone such as myself who has mostly just read through the documentation and looked over the implementation briefly. I'm only familiar with the internals of ThreadedChildWatcher.
msg355421 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-10-26 17:12
> But it spawns a new Python thread per process which can be a blocker issue if a server memory is limited.

I understand that there's *some* overhead associated with spawning a new thread, but from my impression it's not substantial enough to make a significant impact in most cases. Each individual instance of threading.Thread is only 64 bytes. Have you seen any recent cases where the server memory is limited enough for the memory cost associated with having to spawn an additional thread per subprocess becomes the limiting factor?
msg356002 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-11-05 05:43
FWIW, I started implementing a pidfd-based child process watcher over on #38692.
msg356586 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-11-14 08:30
> I understand that there's *some* overhead associated with spawning a new thread, but from my impression it's not substantial enough to make a significant impact in most cases.

Although I think this still stands to some degree, I will have to rescind the following:

> Each individual instance of threading.Thread is only 64 bytes.

The 64 bytes was measured by `sys.getsizeof(threading.Thread())`, which only provides a surface level assessment. I believe this only includes the size of the reference to the thread object.

In order to get a better estimate, I implemented a custom get_size() function, that recursively adds the size of the object and all unique objects from gc.get_referents()  (ignoring several redundant and/or unnecessary types). For more details, see https://gist.github.com/aeros/632bd035b6f95e89cdf4bb29df970a2a. Feel free to critique it if there are any apparent issues (for the purpose of measuring the size of threads). 

Then, I used this function on three different threads, to figure how much memory was needed for each one:

Python 3.8.0+ (heads/3.8:1d2862a323, Nov  4 2019, 06:59:53) 
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import threading
>>> from get_size import get_size
>>> a = threading.Thread()
>>> b = threading.Thread()
>>> c = threading.Thread()
>>> get_size(a)
3995
>>> get_size(b)
1469
>>> get_size(c)
1469

1469 bytes seems to be roughly the amount of additional memory required for each new thread, at least on Linux kernel 5.3.8 and Python 3.8. I don't know if this is 100% accurate, but it at least provides an improved estimate over sys.getsizeof().

> But it spawns a new Python thread per process which can be a blocker issue if a server memory is limited. What if you want to spawn 100 processes? Or 1000 processes? What is the memory usage?

From my understanding, ~1.5KB/thread seems to be quite negligible for most modern equipment. The server's memory would have to be very limited for spawning an additional 1000 threads to be a bottleneck/blocker issue:

Python 3.8.0+ (heads/3.8:1d2862a323, Nov  4 2019, 06:59:53) 
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import threading
>>> from get_size import get_size
>>> threads = []
>>> for _ in range(1000):
...     th = threading.Thread()
...     threads.append(th)
... 
>>> get_size(threads)
1482435

(~1.5MB)

Victor (or anyone else), in your experience, would the additional ~1.5KB per process be an issue for 99% of production servers? If not, it seems to me like the additional maintenance cost of keeping SafeChildWatcher and FastChildWatcher in asyncio's API wouldn't be worthwhile.
msg356594 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-11-14 11:11
> The 64 bytes was measured by `sys.getsizeof(threading.Thread())`, which only provides a surface level assessment. I believe this only includes the size of the reference to the thread object.

You have to account also for the thread stack size. I suggest to look at RSS memory instead.
msg356596 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-11-14 11:17
I measured the RSS memory per thread: it's around 13.2 kB/thread. Oh, that's way lower than what I expected.

Script:

# 10: 9636 kB => 9756 kB: +12 kB/thread
# 100: 9476 kB = 10796 kB: +13.2 kB/thread
# 1000: 9552 kB = 22776 kB: +13.2 kB/thread

import os
import threading

def display_rss():
    os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")

def wait(event):
    event.wait()

class Thread(threading.Thread):
    def __init__(self):
        super().__init__()
        self.stop_event = threading.Event()
        self.started_event = threading.Event()

    def run(self):
        self.started_event.set()
        self.stop_event.wait()

    def stop(self):
        self.stop_event.set()
        self.join()

def main():
    nthread = 1000
    display_rss()
    threads = [Thread() for i in range(nthread)]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.started_event.wait()
    display_rss()
    for thread in threads:
        thread.stop()

main()
msg356599 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-11-14 11:39
To be clear: I mean that FastChildWatcher is safe only if all process's code spaws subprocesses by FastChildWatcher.
If ProcessPoolExecutor or direct subprocess calls are used the watcher is unsafe.
If some C extension spawns new processes on its own (e.g. in a separate thread) -- the watcher is unsafe.

I just think that this particular watcher is too dangerous.
msg356639 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-11-15 00:41
> You have to account also for the thread stack size. I suggest to look at RSS memory instead.

Ah, good point. I believe get_size() only accounts for the memory usage of the thread object, not the amount allocated in physical memory from the thread stack. Thanks for the clarification. 

> I measured the RSS memory per thread: it's around 13.2 kB/thread. Oh, that's way lower than what I expected.

On Python 3.8 and Linux kernel 5.3.8, I received the following result:

# Starting mem
VmRSS:      8408 kB
# After initializing and starting 1k threads:
VmRSS:     21632 kB

~13224kB for 1k threads, which reflects the ~13.2kB/thread estimate. 

Also, as a sidenote, I think you could remove the "for thread in threads: thread.started_event.wait()" portion for our purposes. IIUC, waiting on the threading.Event objects wouldn't affect memory usage.

> To be clear: I mean that FastChildWatcher is safe only if all process's code spaws subprocesses by FastChildWatcher.
> If ProcessPoolExecutor or direct subprocess calls are used the watcher is unsafe.
> If some C extension spawns new processes on its own (e.g. in a separate thread) -- the watcher is unsafe.

> I just think that this particular watcher is too dangerous.

So are we at least in agreement for starting with deprecating FastChildWatcher? If a server is incredibly tight on memory and it can't spare ~13.2kB/thread, SafeChildWatcher would be an alternative to ThreadedChildWatcher.

Personally, I still think this amount is negligible for most production servers, and that we can reasonably deprecate SafeChildWatcher as well. But I can start with FastChildWatcher.
msg356669 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-11-15 10:26
> So are we at least in agreement for starting with deprecating FastChildWatcher?

I think so. It will take a long before we remove it though.
msg356672 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-11-15 13:21
> I think so. It will take a long before we remove it though.

In that case, it could be a long term deprecation notice, where we start the deprecation process without having a definitive removal version. This will at least encourage users to look towards using the other watchers instead of FastChildWatcher. I can start working on a PR.
History
Date User Action Args
2019-11-15 13:21:36aerossetmessages: + msg356672
2019-11-15 10:26:16asvetlovsetmessages: + msg356669
2019-11-15 00:41:11aerossetmessages: + msg356639
2019-11-14 11:39:03asvetlovsetmessages: + msg356599
2019-11-14 11:17:35vstinnersetmessages: + msg356596
2019-11-14 11:11:46vstinnersetmessages: + msg356594
2019-11-14 08:30:02aerossetmessages: + msg356586
2019-11-05 05:43:45benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg356002
2019-10-26 17:12:34aerossetmessages: + msg355421
2019-10-26 16:52:02aerossetmessages: + msg355419
2019-10-26 08:45:41asvetlovsetmessages: + msg355409
2019-10-26 00:40:39vstinnersetmessages: + msg355396
2019-10-26 00:36:55vstinnersetmessages: + msg355395
2019-10-25 23:36:40aerossetmessages: + msg355394
2019-10-25 22:05:39asvetlovsetmessages: + msg355390
2019-10-25 20:02:23aerossettitle: Deprecating MultiLoopChildWatcher -> Deprecate Process Child Watchers
2019-10-25 20:01:29aerossetmessages: + msg355381
2019-10-25 19:58:23yselivanovsetstatus: pending -> open

messages: + msg355380
2019-10-25 19:54:38aerossetstatus: open -> pending

messages: + msg355379
2019-10-25 19:43:31yselivanovsetmessages: + msg355376
2019-10-25 19:37:03aerossetmessages: + msg355375
2019-10-25 19:28:00yselivanovsetstatus: pending -> open

messages: + msg355373
2019-10-25 19:24:15aerossetstatus: open -> pending
assignee: aeros

nosy: + vstinner
2019-10-25 19:23:29aeroscreate