Message 293677 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	max
Recipients	bquinlan, ezio.melotti, josh.r, max
Date	2017-05-15.05:24:55
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1494825898.08.0.434363250193.issue29842@psf.upfronthosting.co.za>
In-reply-to

Content
I'm also concerned about this (undocumented) inconsistency between map and Executor.map. I think you would want to make your PR limited to `ThreadPoolExecutor`. The `ProcessPoolExecutor` already does everything you want with its `chunksize` paramater, and adding `prefetch` to it will jeopardize the optimization for which `chunksize` is intended. Actually, I was even thinking whether it might be worth merging `chunksize` and `prefetch` arguments. The semantics of the two arguments is similar but not identical. Specifically, for `ProcessPoolExecutor`, there is pretty clear pressure to increase the value of `chunksize` to reduce amortized IPC costs; there is no IPC with threads, so the pressure to increase `prefetch` is much more situational (e.g., in the busy pool example I give below). For `ThreadPoolExecutor`, I prefer your implementation over the current one, but I want to point out that it is not strictly better, in the sense that with default arguments, there are situations where the current implementation behaves better. In many cases your implementation behaves much better. If the input is too large, it prevents out of memory condition. In addition, if the pool is not busy when `map` is called, your implementation will also be faster, since it will submit the first input for processing earlier. But consider the case where input is produced slower than it can be processed (`iterables` may fetch data from a database, but the callable `fn` may be a fast in-memory transformation). Now suppose the `Executor.map` is called when the pool is busy, so there'll be a delay before processing begins. In this case, the most efficient approach is to get as much input as possible while the pool is busy, since eventually (when the pool is freed up) it will become the bottleneck. This is exactly what the current implementation does. The implementation you propose will (by default) only prefetch a small number of input items. Then when the pool becomes available, it will quickly run out of prefetched input, and so it will be less efficient than the current implementation. This is especially unfortunate since the entire time the pool was busy, `Executor.map` is just blocking the main thread so it's literally doing nothing useful. Of course, the client can tweak `prefetch` argument to achieve better performance. Still, I wanted to make sure this issue is considered before the new implementation is adopted. From the performance perspective, an even more efficient implementation would be one that uses three background threads: - one to prefetch items from the input - one to sends items to the workers for processing - one to yield results as they become available It has a disadvantage of being slightly more complex, so I don't know if it really belongs in the standard library. Its advantage is that it will waste less time: it fetches inputs without pause, it submits them for processing without pause, and it makes results available to the client as soon as they are processed. (I have implemented and tried this approach, but not in productioon.) But even this implementation requires tuning. In the case with the busy pool that I described above, one would want to prefetch as much input as possible, but that may cause too much memory consumption and also possibly waste computation resources (if the most of input produced proves to be unneeded in the end).

I'm also concerned about this (undocumented) inconsistency between map and Executor.map.

I think you would want to make your PR limited to `ThreadPoolExecutor`. The `ProcessPoolExecutor` already does everything you want with its `chunksize` paramater, and adding `prefetch` to it will jeopardize the optimization for which `chunksize` is intended.

Actually, I was even thinking whether it might be worth merging `chunksize` and `prefetch` arguments. The semantics of the two arguments is similar but not identical. Specifically, for `ProcessPoolExecutor`, there is pretty clear pressure to increase the value of `chunksize` to reduce amortized IPC costs; there is no IPC with threads, so the pressure to increase `prefetch` is much more situational (e.g., in the busy pool example I give below).

For `ThreadPoolExecutor`, I prefer your implementation over the current one, but I want to point out that it is not strictly better, in the sense that *with default arguments*, there are situations where the current implementation behaves better.

In many cases your implementation behaves much better. If the input is too large, it prevents out of memory condition. In addition, if the pool is not busy when `map` is called, your implementation will also be faster, since it will submit the first input for processing earlier.

But consider the case where input is produced slower than it can be processed (`iterables` may fetch data from a database, but the callable `fn` may be a fast in-memory transformation). Now suppose the `Executor.map` is called when the pool is busy, so there'll be a delay before processing begins. In this case, the most efficient approach is to get as much input as possible while the pool is busy, since eventually (when the pool is freed up) it will become the bottleneck. This is exactly what the current implementation does.

The implementation you propose will (by default) only prefetch a small number of input items. Then when the pool becomes available, it will quickly run out of prefetched input, and so it will be less efficient than the current implementation. This is especially unfortunate since the entire time the pool was busy, `Executor.map` is just blocking the main thread so it's literally doing nothing useful.

Of course, the client can tweak `prefetch` argument to achieve better performance. Still, I wanted to make sure this issue is considered before the new implementation is adopted.

From the performance perspective, an even more efficient implementation would be one that uses three background threads:

- one to prefetch items from the input
- one to sends items to the workers for processing
- one to yield results as they become available

It has a disadvantage of being slightly more complex, so I don't know if it really belongs in the standard library.

Its advantage is that it will waste less time: it fetches inputs without pause, it submits them for processing without pause, and it makes results available to the client as soon as they are processed. (I have implemented and tried this approach, but not in productioon.)

But even this implementation requires tuning. In the case with the busy pool that I described above, one would want to prefetch as much input as possible, but that may cause too much memory consumption and also possibly waste computation resources (if the most of input produced proves to be unneeded in the end).

History
Date	User	Action	Args
2017-05-15 05:24:58	max	set	recipients: + max, bquinlan, ezio.melotti, josh.r
2017-05-15 05:24:58	max	set	messageid: <1494825898.08.0.434363250193.issue29842@psf.upfronthosting.co.za>
2017-05-15 05:24:58	max	link	issue29842 messages
2017-05-15 05:24:55	max	create