Message 322390 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josh.r
Recipients	Klamann, bquinlan, ezio.melotti, josh.r, max
Date	2018-07-25.19:48:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1532548122.91.0.56676864532.issue29842@psf.upfronthosting.co.za>
In-reply-to

Content
In response to Max's comments: >But consider the case where input is produced slower than it can be processed (`iterables` may fetch data from a database, but the callable `fn` may be a fast in-memory transformation). Now suppose the `Executor.map` is called when the pool is busy, so there'll be a delay before processing begins. In this case, the most efficient approach is to get as much input as possible while the pool is busy, since eventually (when the pool is freed up) it will become the bottleneck. This is exactly what the current implementation does. I'm not sure the "slow input iterable, fast task, competing tasks from other sources" case is all that interesting. Uses of Executor.map in the first place are usually a replacement for complex task submission; perhaps my viewpoint is blinkered, but I see the Executors used for either explicit use of submit or map, rather than mixing and matching (you might use it for both, but rarely interleave usages). Without a mix and match scenario (and importantly, a mix and match scenario where enough work is submitted before the map to occupy all workers, and very little work is submitted after the map begins to space out map tasks such that additional map input is requested while workers are idle), the smallish default prefetch is an improvement, simply by virtue of getting initial results more quickly. The solution of making a dedicated input thread would introduce quite a lot of additional complexity, well beyond what I think it justifiable for a relatively niche use case, especially one with many available workarounds, e.g. 1. Raising the prefetch count explicitly 2. Having the caller listify the iterable (similar to passing an arbitrarily huge prefetch value, with the large prefetch value having the advantage of sending work to the workers immediately, while listifying has the advantage of allowing you to handle any input exceptions up front rather than receiving them lazily during processing) 3. Use cheaper inputs (e.g. the query string, not the results of the DB query) and perform the expensive work as part of the task (after all, the whole point is to parallelize the most expensive work) 4. Using separate Executors so the manually submitted work doesn't interfere with the mapped work, and vice versa 5. Making a separate ThreadPoolExecutor to generate the expensive input values via its own map function (optionally with a larger prefetch count), e.g. instead of with SomeExecutor() as executor: for result in executor.map(func, (get_from_db(query) for query in queries)): do: with SomeExecutor() as executor, ThreadPoolExecutor() as inputexec: inputs = inputexec.map(get_from_db, queries) for result in executor.map(func, inputs): Point is, yes, there will still be niche cases where Executor.map isn't perfect, but this patch is intentionally a bit more minimal to keep the Python code base simple (no marshaling exceptions across thread boundaries) and avoid extreme behavioral changes; it has some smaller changes, e.g. it necessarily means input-iterator-triggered exceptions can be raised after some results are successfully produced, but it doesn't involve adding more implicit threading, marshaling exceptions across threads, etc. Your proposed alternative, with a thread for prefetching inputs, a thread for sending tasks, and a thread for returning results creates a number of problems: 1. As you mentioned, if no prefetch limit is imposed, memory usage remains unbounded; if the input is cheap to generate and slow to process, memory exhaustion is nearly guaranteed for infinite inputs, and more likely for "very large" inputs. I'd prefer the default arguments to be stable in (almost) all cases, rather than try to maximize performance for rare cases at the expense of stability in many cases. 2. When input generation is CPU bound, you've just introduced an additional source of unavoidable GIL contention; granted, after the GIL fixes in 3.2, GIL contention tends to hurt less (before those fixes, I could easily occupy 1.9 cores doing 0.5 cores worth of actual work with just two CPU bound threads). Particularly in the ProcessPoolExecutor case (where avoiding GIL contention is the goal), it's a little weird if you can end up with unavoidable GIL contention in the main process. 3. Exception handling from the input iterator just became a nightmare; in a "single thread performs input pulls and result yield" scenario, the exceptions from the input thread naturally bubble to the caller of Executor.map (possibly after several results have been produced, but eventually). If a separate thread is caching from the input iterator, we'd need to marshal the exception from that thread back to the thread running Executor.map so it's visible to the caller, and providing a traceback that is both accurate and useful is not obvious.

In response to Max's comments:

>But consider the case where input is produced slower than it can be processed (`iterables` may fetch data from a database, but the callable `fn` may be a fast in-memory transformation). Now suppose the `Executor.map` is called when the pool is busy, so there'll be a delay before processing begins. In this case, the most efficient approach is to get as much input as possible while the pool is busy, since eventually (when the pool is freed up) it will become the bottleneck. This is exactly what the current implementation does.

I'm not sure the "slow input iterable, fast task, competing tasks from other sources" case is all that interesting. Uses of Executor.map in the first place are usually a replacement for complex task submission; perhaps my viewpoint is blinkered, but I see the Executors used for *either* explicit use of submit *or* map, rather than mixing and matching (you might use it for both, but rarely interleave usages). Without a mix and match scenario (and importantly, a mix and match scenario where enough work is submitted before the map to occupy all workers, and very little work is submitted after the map begins to space out map tasks such that additional map input is requested while workers are idle), the smallish default prefetch is an improvement, simply by virtue of getting initial results more quickly.

The solution of making a dedicated input thread would introduce quite a lot of additional complexity, well beyond what I think it justifiable for a relatively niche use case, especially one with many available workarounds, e.g.

1. Raising the prefetch count explicitly

2. Having the caller listify the iterable (similar to passing an arbitrarily huge prefetch value, with the large prefetch value having the advantage of sending work to the workers immediately, while listifying has the advantage of allowing you to handle any input exceptions up front rather than receiving them lazily during processing)

3. Use cheaper inputs (e.g. the query string, not the results of the DB query) and perform the expensive work as part of the task (after all, the whole point is to parallelize the most expensive work)

4. Using separate Executors so the manually submitted work doesn't interfere with the mapped work, and vice versa

5. Making a separate ThreadPoolExecutor to generate the expensive input values via its own map function (optionally with a larger prefetch count), e.g. instead of

with SomeExecutor() as executor:
for result in executor.map(func, (get_from_db(query) for query in queries)):

do:

with SomeExecutor() as executor, ThreadPoolExecutor() as inputexec:
inputs = inputexec.map(get_from_db, queries)
for result in executor.map(func, inputs):

Point is, yes, there will still be niche cases where Executor.map isn't perfect, but this patch is intentionally a bit more minimal to keep the Python code base simple (no marshaling exceptions across thread boundaries) and avoid extreme behavioral changes; it has some smaller changes, e.g. it necessarily means input-iterator-triggered exceptions can be raised after some results are successfully produced, but it doesn't involve adding more implicit threading, marshaling exceptions across threads, etc.

Your proposed alternative, with a thread for prefetching inputs, a thread for sending tasks, and a thread for returning results creates a number of problems:

1. As you mentioned, if no prefetch limit is imposed, memory usage remains unbounded; if the input is cheap to generate and slow to process, memory exhaustion is nearly guaranteed for infinite inputs, and more likely for "very large" inputs. I'd prefer the default arguments to be stable in (almost) all cases, rather than try to maximize performance for rare cases at the expense of stability in many cases.

2. When input generation is CPU bound, you've just introduced an additional source of unavoidable GIL contention; granted, after the GIL fixes in 3.2, GIL contention tends to hurt less (before those fixes, I could easily occupy 1.9 cores doing 0.5 cores worth of actual work with just two CPU bound threads). Particularly in the ProcessPoolExecutor case (where avoiding GIL contention is the goal), it's a little weird if you can end up with unavoidable GIL contention in the main process.

3. Exception handling from the input iterator just became a nightmare; in a "single thread performs input pulls and result yield" scenario, the exceptions from the input thread naturally bubble to the caller of Executor.map (possibly after several results have been produced, but eventually). If a separate thread is caching from the input iterator, we'd need to marshal the exception from that thread back to the thread running Executor.map so it's visible to the caller, and providing a traceback that is both accurate and useful is not obvious.

History
Date	User	Action	Args
2018-07-25 19:48:43	josh.r	set	recipients: + josh.r, bquinlan, ezio.melotti, max, Klamann
2018-07-25 19:48:42	josh.r	set	messageid: <1532548122.91.0.56676864532.issue29842@psf.upfronthosting.co.za>
2018-07-25 19:48:42	josh.r	link	issue29842 messages
2018-07-25 19:48:42	josh.r	create