Message 308221 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dlukes
Recipients	dlukes
Date	2017-12-13.17:15:53
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1513185353.85.0.213398074469.issue32306@psf.upfronthosting.co.za>
In-reply-to

Content
The docstring for `concurrent.futures.Executor.map` starts by stating that it is "Equivalent to map(func, iterables)". In the case of Python 3, I would argue this is true only superficially: with `map`, the user expects memory-efficient processing, i.e. that the entire resulting collection will not be held in memory at once unless s/he requests so e.g. with `list(map(...))`. (In Python 2, the expectations are different of course.) On the other hand, while `Executor.map` superficially returns a generator, which seems to align with this expectation, what happens behind the scenes is that the call blocks until all results are computed and only then starts yielding them. In other words, they have to be stored in memory all at once at some point. The lower-level multiprocessing module also describes `multiprocessing.pool.Pool.map` as "A parallel equivalent of the map() built-in function", but at least it immediately goes on to add that "It blocks until the result is ready.", which is a clear indication that all of the results will have to be stored somewhere before being yielded. I can think of several ways the situation could be improved, listed here from most conservative to most progressive: 1. Add "It blocks until the result is ready." to the docstring of `Executor.map` as well, preferably somewhere at the beginning. 2. Reword the docstrings of both `Executor.map` and `Pool.map` so that they don't describe the functions as "equivalent" to built-in `map`, which raises the wrong expectations. ("Similar to map(...), but blocks until all results are collected and only then yields them.") 3. I would argue that the function that can be described as semantically equivalent to `map` is actually `Pool.imap`, which yields results as they're being computed. It would be really nice if this could be added to the higher-level `futures` API, along with `Pool.imap_unordered`. `Executor.map` simply doesn't work for very long streams of data. 4. Maybe instead of adding `imap` and `imap_unordered` methods to `Executor`, it would be a good idea to change the signature of `Executor.map(func, iterables, timeout=None, chunksize=1)` to `Executor.map(func, *iterables, timeout=None, chunksize=1, block=True, ordered=True)`, in order to keep the API simple with good defaults while providing flexibility via keyword arguments. 5. I would go so far as to say that for me personally, the `block` argument to the version of `Executor.map` proposed in #4 above should be `False` by default, because that would make it behave most like built-in `map`, which is the least suprising behavior. But I've observed that for smaller work loads, `imap` tends to be slower than `map`, so I understand it might be a tradeoff between performance and semantics. Still, in a higher-level API meant for non-experts, I think semantics should be emphasized. If the latter options seem much too radical, please consider at least something along the lines of #1 above, I think it would help people correct their expectations when they first encounter the API :)

The docstring for `concurrent.futures.Executor.map` starts by stating that it is "Equivalent to map(func, *iterables)". In the case of Python 3, I would argue this is true only superficially: with `map`, the user expects memory-efficient processing, i.e. that the entire resulting collection will not be held in memory at once unless s/he requests so e.g. with `list(map(...))`. (In Python 2, the expectations are different of course.) On the other hand, while `Executor.map` superficially returns a generator, which seems to align with this expectation, what happens behind the scenes is that the call blocks until all results are computed and only then starts yielding them. In other words, they have to be stored in memory all at once at some point.

The lower-level multiprocessing module also describes `multiprocessing.pool.Pool.map` as "A parallel equivalent of the map() built-in function", but at least it immediately goes on to add that "It blocks until the result is ready.", which is a clear indication that all of the results will have to be stored somewhere before being yielded.

I can think of several ways the situation could be improved, listed here from most conservative to most progressive:

1. Add "It blocks until the result is ready." to the docstring of `Executor.map` as well, preferably somewhere at the beginning.
2. Reword the docstrings of both `Executor.map` and `Pool.map` so that they don't describe the functions as "equivalent" to built-in `map`, which raises the wrong expectations. ("Similar to map(...), but blocks until all results are collected and only then yields them.")
3. I would argue that the function that can be described as semantically equivalent to `map` is actually `Pool.imap`, which yields results as they're being computed. It would be really nice if this could be added to the higher-level `futures` API, along with `Pool.imap_unordered`. `Executor.map` simply doesn't work for very long streams of data.
4. Maybe instead of adding `imap` and `imap_unordered` methods to `Executor`, it would be a good idea to change the signature of `Executor.map(func, *iterables, timeout=None, chunksize=1)` to `Executor.map(func, *iterables, timeout=None, chunksize=1, block=True, ordered=True)`, in order to keep the API simple with good defaults while providing flexibility via keyword arguments.
5. I would go so far as to say that for me personally, the `block` argument to the version of `Executor.map` proposed in #4 above should be `False` by default, because that would make it behave most like built-in `map`, which is the least suprising behavior. But I've observed that for smaller work loads, `imap` tends to be slower than `map`, so I understand it might be a tradeoff between performance and semantics. Still, in a higher-level API meant for non-experts, I think semantics should be emphasized.

If the latter options seem much too radical, please consider at least something along the lines of #1 above, I think it would help people correct their expectations when they first encounter the API :)

History
Date	User	Action	Args
2017-12-13 17:15:53	dlukes	set	recipients: + dlukes
2017-12-13 17:15:53	dlukes	set	messageid: <1513185353.85.0.213398074469.issue32306@psf.upfronthosting.co.za>
2017-12-13 17:15:53	dlukes	link	issue32306 messages
2017-12-13 17:15:53	dlukes	create