classification
Title: Clarify map API in concurrent.futures
Type: behavior Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: David Lukeš, docs@python, pitrou
Priority: normal Keywords: patch

Created on 2017-12-13 17:15 by David Lukeš, last changed 2017-12-21 08:44 by David Lukeš. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4947 merged pitrou, 2017-12-20 17:53
PR 4948 merged python-dev, 2017-12-20 18:06
Messages (9)
msg308221 - (view) Author: David Lukeš (David Lukeš) Date: 2017-12-13 17:15
The docstring for `concurrent.futures.Executor.map` starts by stating that it is "Equivalent to map(func, *iterables)". In the case of Python 3, I would argue this is true only superficially: with `map`, the user expects memory-efficient processing, i.e. that the entire resulting collection will not be held in memory at once unless s/he requests so e.g. with `list(map(...))`. (In Python 2, the expectations are different of course.) On the other hand, while `Executor.map` superficially returns a generator, which seems to align with this expectation, what happens behind the scenes is that the call blocks until all results are computed and only then starts yielding them. In other words, they have to be stored in memory all at once at some point.

The lower-level multiprocessing module also describes `multiprocessing.pool.Pool.map` as "A parallel equivalent of the map() built-in function", but at least it immediately goes on to add that "It blocks until the result is ready.", which is a clear indication that all of the results will have to be stored somewhere before being yielded.

I can think of several ways the situation could be improved, listed here from most conservative to most progressive:

1. Add "It blocks until the result is ready." to the docstring of `Executor.map` as well, preferably somewhere at the beginning.
2. Reword the docstrings of both `Executor.map` and `Pool.map` so that they don't describe the functions as "equivalent" to built-in `map`, which raises the wrong expectations. ("Similar to map(...), but blocks until all results are collected and only then yields them.")
3. I would argue that the function that can be described as semantically equivalent to `map` is actually `Pool.imap`, which yields results as they're being computed. It would be really nice if this could be added to the higher-level `futures` API, along with `Pool.imap_unordered`. `Executor.map` simply doesn't work for very long streams of data.
4. Maybe instead of adding `imap` and `imap_unordered` methods to `Executor`, it would be a good idea to change the signature of `Executor.map(func, *iterables, timeout=None, chunksize=1)` to `Executor.map(func, *iterables, timeout=None, chunksize=1, block=True, ordered=True)`, in order to keep the API simple with good defaults while providing flexibility via keyword arguments.
5. I would go so far as to say that for me personally, the `block` argument to the version of `Executor.map` proposed in #4 above should be `False` by default, because that would make it behave most like built-in `map`, which is the least suprising behavior. But I've observed that for smaller work loads, `imap` tends to be slower than `map`, so I understand it might be a tradeoff between performance and semantics. Still, in a higher-level API meant for non-experts, I think semantics should be emphasized.

If the latter options seem much too radical, please consider at least something along the lines of #1 above, I think it would help people correct their expectations when they first encounter the API :)
msg308646 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-19 14:37
Hi David,

> what happens behind the scenes is that the call blocks until all results are computed and only then starts yielding them.

The current implementation of the Executor.map() generator is:

        def result_iterator():
            try:
                # reverse to keep finishing order
                fs.reverse()
                while fs:
                    # Careful not to keep a reference to the popped future
                    if timeout is None:
                        yield fs.pop().result()
                    else:
                        yield fs.pop().result(end_time - time.time())
            finally:
                for future in fs:
                    future.cancel()


So it seems to me that results are yielded as soon as they arrive (provided they arrive in the right order).
msg308720 - (view) Author: David Lukeš (David Lukeš) Date: 2017-12-20 11:21
Hi Antoine,

Thanks for the response! :) I think the problem lies in the line immediately preceding the code you've posted:

```
fs = [self.submit(fn, *args) for args in zip(*iterables)]
```

In other words, all the jobs are first submitted and their futures stored in a list, which is then iterated over. This approach obviously breaks down when there is a great number of jobs, or when it's part of a pipeline meant for processing jobs continuously as they come.
msg308726 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-20 11:36
I see.  So the problem you are pointing out is that the tasks *arguments* are consumed eagerly.  I agree that may be a problem in some cases, though I think in most cases people are concerned with the *results*.

(note that multiprocessing.Pool() has an imap() method which does what you would like)
msg308728 - (view) Author: David Lukeš (David Lukeš) Date: 2017-12-20 12:06
Yes, sorry for not being quite clear the first time around :)

I eventually found out about Pool.imap (see item 3 on list in OP) and indeed it fits my use case very nicely, but my point was that the documentation is somewhat misleading with respect to the semantics of built-in `map()` in Python 3.

Specifically, I would argue that it is unexpected for a function which claims to be "Equivalent to map(func, *iterables)" to require allocating a list the length of the shortest iterable.

Maybe a code example will make this clearer for potential newcomers to the discussion -- this is what I would expect to happen (= the behavior of built-in `map()` itself), yielding values from the iterable is interleaved with calls to the mapped function:

```
>>> def gen():
...     for i in range(3):
...         print("yielding", i)
...         yield i
...         
>>> def add1(i):
...     print("adding 1 to", i)
...     return i + 1
... 
>>> list(map(add1, gen()))
yielding 0
adding 1 to 0
yielding 1
adding 1 to 1
yielding 2
adding 1 to 2
[1, 2, 3]
```

This is what happens instead with `concurrent.futures.Executor.map()`:

```
>>> def my_map(fn, iterable):
...     lst = list(iterable)
...     for i in lst:
...         yield fn(i)
...         
>>> list(my_map(add1, gen()))
yielding 0
yielding 1
yielding 2
adding 1 to 0
adding 1 to 1
adding 1 to 2
[1, 2, 3]
```
msg308760 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-20 18:06
New changeset a7a751dd7b08a5bb6cb399c1b2a6ca7b24aba51d by Antoine Pitrou in branch 'master':
bpo-32306: Clarify c.f.Executor.map() documentation (#4947)
https://github.com/python/cpython/commit/a7a751dd7b08a5bb6cb399c1b2a6ca7b24aba51d
msg308762 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-20 18:19
New changeset 4aa84e728565a15a82727b9b971126e355f47e9d by Antoine Pitrou (Miss Islington (bot)) in branch '3.6':
bpo-32306: Clarify c.f.Executor.map() documentation (GH-4947) (#4948)
https://github.com/python/cpython/commit/4aa84e728565a15a82727b9b971126e355f47e9d
msg308763 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-20 18:19
I think the documentation is now clearer.  Closing!
msg308858 - (view) Author: David Lukeš (David Lukeš) Date: 2017-12-21 08:44
Perfect, thanks!
History
Date User Action Args
2017-12-21 08:44:55David Lukešsetmessages: + msg308858
2017-12-20 18:19:42pitrousetstatus: open -> closed
resolution: fixed
messages: + msg308763

stage: patch review -> resolved
2017-12-20 18:19:20pitrousetmessages: + msg308762
2017-12-20 18:06:39python-devsetpull_requests: + pull_request4840
2017-12-20 18:06:23pitrousetmessages: + msg308760
2017-12-20 17:53:22pitrousetkeywords: + patch
stage: patch review
pull_requests: + pull_request4839
2017-12-20 12:06:40David Lukešsetmessages: + msg308728
2017-12-20 11:36:38pitrousetmessages: + msg308726
2017-12-20 11:21:24David Lukešsetmessages: + msg308720
2017-12-19 14:37:22pitrousetassignee: docs@python
type: enhancement -> behavior
components: + Documentation
versions: + Python 3.6, Python 3.7, - Python 3.8
nosy: + docs@python, pitrou

messages: + msg308646
2017-12-13 17:15:53David Lukešcreate