This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author aeros
Recipients aeros, bquinlan, methane, pitrou, yus2047889
Date 2020-01-11.04:30:24
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1578717024.9.0.621441956171.issue39207@roundup.psfhosted.org>
In-reply-to
Content
> What "ignores the max_workers argument" means?

From my understanding, their argument was that the parameter name "max_workers" and documentation implies that it will spawn processes as needed up to *max_workers* based on the number of jobs scheduled. 

> And would you create a simple reproducible example?

I can't speak directly for the OP, but this simple example may demonstrate what they're talking about:

Linux 5.4.8
Python 3.8.1

```
import concurrent.futures as cf
import os
import random

def get_rand_nums(ls, n):
    return [random.randint(1, 100) for i in range(n)]
    
def show_processes():
    print("All python processes:")
    os.system("ps -C python")

def main():
    nums = []
    with cf.ProcessPoolExecutor(max_workers=6) as executor:
        futs = []
        show_processes()
        for _ in range(3):
            fut = executor.submit(get_rand_nums, nums, 10_000_000)
            futs.append(fut)
        show_processes()
        for fut in cf.as_completed(futs):
            nums.extend(fut.result())
        show_processes()

    assert len(nums) == 30_000_000

if __name__ == '__main__':
    main()
```

Output:

```
[aeros:~/programming/python]$ python ppe_max_workers.py
All python processes: # Main python process
    PID TTY          TIME CMD
  23683 pts/1    00:00:00 python
All python processes: # Main python process + 6 unused subprocesses
    PID TTY          TIME CMD
  23683 pts/1    00:00:00 python
  23685 pts/1    00:00:00 python
  23686 pts/1    00:00:00 python
  23687 pts/1    00:00:00 python
  23688 pts/1    00:00:00 python
  23689 pts/1    00:00:00 python
  23690 pts/1    00:00:00 python
All python processes: # Main python process + 3 used subprocesses + 3 unused subprocesses
    PID TTY          TIME CMD
  23683 pts/1    00:00:00 python
  23685 pts/1    00:00:07 python
  23686 pts/1    00:00:07 python
  23687 pts/1    00:00:07 python
  23688 pts/1    00:00:00 python
  23689 pts/1    00:00:00 python
  23690 pts/1    00:00:00 python
```

As seen above, all processes up to *max_workers* were spawned immediately after the jobs were submitted to ProcessPoolExecutor, regardless of the actual number of jobs (3). It is also apparent that only three of those spawned processes were utilized by the CPU, as indicated by the values in the TIME field. The other three processes were not used.

If it wasn't for this behavior, I think there would be a significant performance loss, as the executor would have to continuously calculate how many processes are needed and spawn them throughout it's lifespan. AFAIK, it _seems_ more efficient to spawn *max_workers* processes when the jobs are scheduled, and then use them as needed; rather than spawning the processes as needed.

As a result, I think the current behavior should remain the same; unless someone can come up with a backwards-compatible alternative version and demonstrate its advantages over the current one.

However, I do think the current documentation could do a better at explaining how max_workers actually behaves. See the current explanation: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor.

The current version does not address any of the above points. In fact, the first line seems like it might imply the opposite of what it's actually doing (at least based on my above example):

"An Executor subclass that executes calls asynchronously *using a pool of at most max_workers processes*." (asterisks added for emphasis)

"using a pool of at most max_workers processes" could imply to users that *max_workers* sets an upper bound limit on the number of processes in the pool, but that *max_workers* is only reached if all of those processes are _needed_. Unless I'm misunderstanding something, that's not the case.

I would suggest converting this into a documentation issue, assuming that the experts for the concurrent.futures confirm that the present behavior is intentional and that I'm correctly understanding the OP.
History
Date User Action Args
2020-01-11 04:30:24aerossetrecipients: + aeros, bquinlan, pitrou, methane, yus2047889
2020-01-11 04:30:24aerossetmessageid: <1578717024.9.0.621441956171.issue39207@roundup.psfhosted.org>
2020-01-11 04:30:24aeroslinkissue39207 messages
2020-01-11 04:30:24aeroscreate