Author josh.r
Recipients Klamann, bquinlan, ezio.melotti, josh.r, max
Date 2018-07-25.19:48:42
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
In response to Max's comments:

>But consider the case where input is produced slower than it can be processed (`iterables` may fetch data from a database, but the callable `fn` may be a fast in-memory transformation). Now suppose the `` is called when the pool is busy, so there'll be a delay before processing begins. In this case, the most efficient approach is to get as much input as possible while the pool is busy, since eventually (when the pool is freed up) it will become the bottleneck. This is exactly what the current implementation does.

I'm not sure the "slow input iterable, fast task, competing tasks from other sources" case is all that interesting. Uses of in the first place are usually a replacement for complex task submission; perhaps my viewpoint is blinkered, but I see the Executors used for *either* explicit use of submit *or* map, rather than mixing and matching (you might use it for both, but rarely interleave usages). Without a mix and match scenario (and importantly, a mix and match scenario where enough work is submitted before the map to occupy all workers, and very little work is submitted after the map begins to space out map tasks such that additional map input is requested while workers are idle), the smallish default prefetch is an improvement, simply by virtue of getting initial results more quickly.

The solution of making a dedicated input thread would introduce quite a lot of additional complexity, well beyond what I think it justifiable for a relatively niche use case, especially one with many available workarounds, e.g.

1. Raising the prefetch count explicitly

2. Having the caller listify the iterable (similar to passing an arbitrarily huge prefetch value, with the large prefetch value having the advantage of sending work to the workers immediately, while listifying has the advantage of allowing you to handle any input exceptions up front rather than receiving them lazily during processing)

3. Use cheaper inputs (e.g. the query string, not the results of the DB query) and perform the expensive work as part of the task (after all, the whole point is to parallelize the most expensive work)

4. Using separate Executors so the manually submitted work doesn't interfere with the mapped work, and vice versa

5. Making a separate ThreadPoolExecutor to generate the expensive input values via its own map function (optionally with a larger prefetch count), e.g. instead of

with SomeExecutor() as executor:
    for result in, (get_from_db(query) for query in queries)):


with SomeExecutor() as executor, ThreadPoolExecutor() as inputexec:
    inputs =, queries)
    for result in, inputs):

Point is, yes, there will still be niche cases where isn't perfect, but this patch is intentionally a bit more minimal to keep the Python code base simple (no marshaling exceptions across thread boundaries) and avoid extreme behavioral changes; it has some smaller changes, e.g. it necessarily means input-iterator-triggered exceptions can be raised after some results are successfully produced, but it doesn't involve adding more implicit threading, marshaling exceptions across threads, etc.

Your proposed alternative, with a thread for prefetching inputs, a thread for sending tasks, and a thread for returning results creates a number of problems:

1. As you mentioned, if no prefetch limit is imposed, memory usage remains unbounded; if the input is cheap to generate and slow to process, memory exhaustion is nearly guaranteed for infinite inputs, and more likely for "very large" inputs. I'd prefer the default arguments to be stable in (almost) all cases, rather than try to maximize performance for rare cases at the expense of stability in many cases.

2. When input generation is CPU bound, you've just introduced an additional source of unavoidable GIL contention; granted, after the GIL fixes in 3.2, GIL contention tends to hurt less (before those fixes, I could easily occupy 1.9 cores doing 0.5 cores worth of actual work with just two CPU bound threads). Particularly in the ProcessPoolExecutor case (where avoiding GIL contention is the goal), it's a little weird if you can end up with unavoidable GIL contention in the main process.

3. Exception handling from the input iterator just became a nightmare; in a "single thread performs input pulls and result yield" scenario, the exceptions from the input thread naturally bubble to the caller of (possibly after several results have been produced, but eventually). If a separate thread is caching from the input iterator, we'd need to marshal the exception from that thread back to the thread running so it's visible to the caller, and providing a traceback that is both accurate and useful is not obvious.
Date User Action Args
2018-07-25 19:48:43josh.rsetrecipients: + josh.r, bquinlan, ezio.melotti, max, Klamann
2018-07-25 19:48:42josh.rsetmessageid: <>
2018-07-25 19:48:42josh.rlinkissue29842 messages
2018-07-25 19:48:42josh.rcreate