Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers #83388

Closed
yus2047889 mannequin opened this issue Jan 3, 2020 · 10 comments
Assignees
Labels
3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@yus2047889
Copy link
Mannequin

yus2047889 mannequin commented Jan 3, 2020

BPO 39207
Nosy @gpshead, @brianquinlan, @pitrou, @methane, @miss-islington, @aeros, @yus2047889
PRs
  • bpo-39207: Spawn workers on demand in ProcessPoolExecutor #19453
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/aeros'
    closed_at = <Date 2020-04-19.14:01:20.866>
    created_at = <Date 2020-01-03.21:44:04.541>
    labels = ['type-bug', 'library', '3.9']
    title = 'concurrent.futures.ProcessPoolExecutor does not properly reap jobs and spawns too many workers'
    updated_at = <Date 2022-01-21.23:35:45.178>
    user = 'https://github.com/yus2047889'

    bugs.python.org fields:

    activity = <Date 2022-01-21.23:35:45.178>
    actor = 'gregory.p.smith'
    assignee = 'aeros'
    closed = True
    closed_date = <Date 2020-04-19.14:01:20.866>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2020-01-03.21:44:04.541>
    creator = 'yus2047889'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39207
    keywords = ['patch']
    message_count = 10.0
    messages = ['359260', '359270', '359778', '359958', '359960', '359962', '360023', '365291', '366779', '411209']
    nosy_count = 7.0
    nosy_names = ['gregory.p.smith', 'bquinlan', 'pitrou', 'methane', 'miss-islington', 'aeros', 'yus2047889']
    pr_nums = ['19453']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue39207'
    versions = ['Python 3.9']

    @yus2047889
    Copy link
    Mannequin Author

    yus2047889 mannequin commented Jan 3, 2020

    This came up from a supporting library but the actual issue is within concurrent.futures.ProcessPool.

    Discussion can be found at agronholm/apscheduler#414

    ProcessPoolExecutor does not properly spin down and spin up new processes. Instead, it simply re-claims existing processes to re-purpose them for new jobs. Is there no option or way to make it so that instead of re-claiming existing processes, it spins down the process and then spins up another one. This behavior is a lot better for garbage collection and will help to prevent memory leaks.

    ProcessPoolExecutor also spins up too many processes and ignores the max_workers argument. An example is my setting max_workers=10, but I am only utilizing 3 processes. One would expect given the documentation that I would have at most 4 processes, the main process, and the 3 worker processes. Instead, ProcessPoolExecutor spawns all 10 max_workers and lets the other 7 just sit there, even though they are not necessary.

    @yus2047889 yus2047889 mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 3, 2020
    @methane
    Copy link
    Member

    methane commented Jan 4, 2020

    ProcessPoolExecutor does not properly spin down and spin up new processes.

    It is because Process "Pool" is for reusing processes.
    If you don't want to reuse process, you can use the Process.
    https://docs.python.org/3/library/multiprocessing.html#the-process-class

    Or you can create ProcessPoolExecutor before starting bunch of jobs and shutdown it after complete the jobs.

    ProcessPoolExecutor also spins up too many processes and ignores the max_workers argument. An example is my setting max_workers=10,
    [snip]
    Instead, ProcessPoolExecutor spawns all 10 max_workers

    What "ignores the max_workers argument" means?

    And would you create a simple reproducible example?

    Old ThreadPoolExecutor had the behavior (bpo-24882). But ProcessPoolExecutor starts worker processes on demand from old.

    @aeros
    Copy link
    Contributor

    aeros commented Jan 11, 2020

    What "ignores the max_workers argument" means?

    From my understanding, their argument was that the parameter name "max_workers" and documentation implies that it will spawn processes as needed up to *max_workers* based on the number of jobs scheduled.

    And would you create a simple reproducible example?

    I can't speak directly for the OP, but this simple example may demonstrate what they're talking about:

    Linux 5.4.8
    Python 3.8.1

    import concurrent.futures as cf
    import os
    import random
    
    def get_rand_nums(ls, n):
        return [random.randint(1, 100) for i in range(n)]
        
    def show_processes():
        print("All python processes:")
        os.system("ps -C python")
    
    def main():
        nums = []
        with cf.ProcessPoolExecutor(max_workers=6) as executor:
            futs = []
            show_processes()
            for _ in range(3):
                fut = executor.submit(get_rand_nums, nums, 10_000_000)
                futs.append(fut)
            show_processes()
            for fut in cf.as_completed(futs):
                nums.extend(fut.result())
            show_processes()
    
        assert len(nums) == 30_000_000
    
    if __name__ == '__main__':
        main()
    

    Output:

    [aeros:~/programming/python]$ python ppe_max_workers.py
    All python processes: # Main python process
        PID TTY          TIME CMD
      23683 pts/1    00:00:00 python
    All python processes: # Main python process + 6 unused subprocesses
        PID TTY          TIME CMD
      23683 pts/1    00:00:00 python
      23685 pts/1    00:00:00 python
      23686 pts/1    00:00:00 python
      23687 pts/1    00:00:00 python
      23688 pts/1    00:00:00 python
      23689 pts/1    00:00:00 python
      23690 pts/1    00:00:00 python
    All python processes: # Main python process + 3 used subprocesses + 3 unused subprocesses
        PID TTY          TIME CMD
      23683 pts/1    00:00:00 python
      23685 pts/1    00:00:07 python
      23686 pts/1    00:00:07 python
      23687 pts/1    00:00:07 python
      23688 pts/1    00:00:00 python
      23689 pts/1    00:00:00 python
      23690 pts/1    00:00:00 python
    

    As seen above, all processes up to *max_workers* were spawned immediately after the jobs were submitted to ProcessPoolExecutor, regardless of the actual number of jobs (3). It is also apparent that only three of those spawned processes were utilized by the CPU, as indicated by the values in the TIME field. The other three processes were not used.

    If it wasn't for this behavior, I think there would be a significant performance loss, as the executor would have to continuously calculate how many processes are needed and spawn them throughout it's lifespan. AFAIK, it _seems_ more efficient to spawn *max_workers* processes when the jobs are scheduled, and then use them as needed; rather than spawning the processes as needed.

    As a result, I think the current behavior should remain the same; unless someone can come up with a backwards-compatible alternative version and demonstrate its advantages over the current one.

    However, I do think the current documentation could do a better at explaining how max_workers actually behaves. See the current explanation: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor.

    The current version does not address any of the above points. In fact, the first line seems like it might imply the opposite of what it's actually doing (at least based on my above example):

    "An Executor subclass that executes calls asynchronously *using a pool of at most max_workers processes*." (asterisks added for emphasis)

    "using a pool of at most max_workers processes" could imply to users that *max_workers* sets an upper bound limit on the number of processes in the pool, but that *max_workers* is only reached if all of those processes are _needed_. Unless I'm misunderstanding something, that's not the case.

    I would suggest converting this into a documentation issue, assuming that the experts for the concurrent.futures confirm that the present behavior is intentional and that I'm correctly understanding the OP.

    @methane
    Copy link
    Member

    methane commented Jan 14, 2020

    Uh, my understanding "But ProcessPoolExecutor starts worker processes on demand from old." was wrong.

    I think this should be fixed like ThreadPoolExecutor.

    @aeros
    Copy link
    Contributor

    aeros commented Jan 14, 2020

    I think this should be fixed like ThreadPoolExecutor.

    Are there are any downsides or complications with changing this behavior for ProcessPoolExecutor to consider, such as what I mentioned above? From my understanding, there would be a performance penalty associated with spawning the processes on-demand as opposed to the current behavior of spawning *max_workers* processes at the same time, and using each of them as needed.

    Also, I think it's worth considering the following: do the same arguments for changing the behavior for ThreadPoolExecutor also apply to ProcessPoolExecutor? Although they share the same executor API, they have rather different use cases.

    That being said, if it's decided that we do want to make this behavior consistent with ThreadPoolExecutor, I would be willing to look into implementing it. I just want to make sure that it's carefully considered first.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 14, 2020

    It would certainly be better to start the worker processes on demand. It probably also requires careful thought about how to detect that more workers are required.

    @aeros
    Copy link
    Contributor

    aeros commented Jan 15, 2020

    It would certainly be better to start the worker processes on demand. It probably also requires careful thought about how to detect that more workers are required.

    Alright. In that case, I'll do some experimentation when I get the chance and see if I can come up with an effective way to spawn the worker processes as needed, without incurring a significant performance penalty.

    Note: I have a few other projects and obligations in the meantime, so I'm not 100% certain when I'll have the time to work on this. If anyone else is interested in working on this as well, certainly feel free to do so.

    @aeros
    Copy link
    Contributor

    aeros commented Mar 30, 2020

    So, I think a potential approach to this issue with ProcessPoolExecutor would be making it a bit more similar to ThreadPoolExecutor: instead of spawning max_workers processes on startup (see _adjust_process_count()), spawn each new process in submit() (see _adjust_thread_count()), and only do so when there are no idle processes (and the current count is below max_workers, of course).

    This wouldn't reap the idle processes, but I think re-using them is a behavior we want to retain. It seems like it would be quite costly to constantly close and start new processes each time a work item is completed.

    From my perspective, the main issue is that the processes are being spawned all at once instead of being spawned as needed. This can result in a substantial amount of extra cumulative idle time throughout the lifespan of the ProcessPoolExecutor.

    I should have some time to work on this in the next month or sooner, so I'll assign myself to this issue.

    (I also changed the version to just Python 3.9, as this seems like too significant of a behavioral change to backport to 3.8. Let me know if anyone disagrees.)

    @aeros aeros added 3.9 only security fixes and removed 3.8 only security fixes labels Mar 30, 2020
    @aeros aeros self-assigned this Mar 30, 2020
    @aeros aeros added the 3.9 only security fixes label Mar 30, 2020
    @miss-islington
    Copy link
    Contributor

    New changeset 1ac6e37 by Kyle Stanley in branch 'master':
    bpo-39207: Spawn workers on demand in ProcessPoolExecutor (GH-19453)
    1ac6e37

    @pitrou pitrou closed this as completed Apr 19, 2020
    @gpshead
    Copy link
    Member

    gpshead commented Jan 21, 2022

    fyi - this appears to have caused https://bugs.python.org/issue46464

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants