This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Implement __length_hint__() on map() and filter() to optimize list(map) and list(filter)
Type: performance Stage: resolved
Components: Versions: Python 3.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: MSeifert, alex, nmusolino, rhettinger, serhiy.storchaka, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2016-04-22 12:12 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1077 closed MSeifert, 2017-04-11 04:19
PR 14432 open nmusolino, 2019-06-27 21:14
Messages (16)
msg264007 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 12:12
When I compared the performance of filter() and map() between Python 2.7 and 3.4, I noticed a huge performance drop in Python 3!
http://bugs.python.org/issue26814#msg264003

I didn't analyze yet exactly why Python 3 is so much slower (almost 100% slower for the case of fitler!). Maybe it's because filter() returns a list on Python 2, whereas filter() returns an iterator on Python 3.

In Python 2, filter() and map() use _PyObject_LengthHint(seq, 8) to create the result list. Why don't we do the same in Python 3?

filter.__length_hint__() and map.__length_hint__() would return seq.__length_hint__() of the input sequence, or return 8. It's a weak estimation, but it can help a lot of reduce the number of realloc() when the list is slowly growing.

See also the PEP 424 -- A method for exposing a length hint.

Note: the issue #26814 (fastcall) does make filter() and map() faster on Python 3.6 compared to Python 2.7, but it's not directly related to this issue. IMHO using length hint would make list(filter) and list(map) even faster.
msg264008 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-22 12:49
See also issue14126.

It makes sense to implement map.__length_hint__() and  zip.__length_hint__(). But note that map() and zip() take several iterables, and we should call __length_hint__() for every of them (unless found a one with not implemented __length_hint__()). This can slow down the execution for short sequences.

It is impossible to implement reasonable filter.__length_hint__(), because the length of resulting sequence can be from 0 to the length of the input sequence, and returning the maximal value would be not correct.
msg264010 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 12:57
> But note that map() and zip() take several iterables, and we should call __length_hint__() for every of them (unless found a one with not implemented __length_hint__()). This can slow down the execution for short sequences.

Oh, there is no slot for __length_hint__(). Maybe we should also try to add a new slot for it?

Maybe we can use a fast-path for the most common cases like list(map(func, list))?


> It is impossible to implement reasonable filter.__length_hint__(), because the length of resulting sequence can be from 0 to the length of the input sequence, and returning the maximal value would be not correct.

What I see is that Python 2 is much faster and Python 2 reallocates N items if the input sequence contains N items. But yeah, we have to benchmark such change on Python 3 ;-)
msg264040 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-04-23 02:54
Victor, are you suggesting the following?  "If map has a single iterable, call and return as hint, otherwise give up."  Or something else?
msg264052 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-23 07:18
I checked Python 2 for map(): it gets the length hint of all iterables and
use the maximum, with a default of 8. I'm not sure of what you mean by
errors, the function to get the length hint already catchs and ignores
errors no?
msg264053 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-23 07:40
map and zip should use the minimum, zip_longest should use the maximum, and filter shouldn't have __length_hint__().
msg291461 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-11 04:45
Length hints were intentionally omitted from the Python 3 map() and filter() which used to be in the itertools module.   I explored that notion of iterator length transparency years ago.  While I don't remember all the details, I did record some notes at the top of Lib/test/test_iterlen.py.

BTW, I also don't think the list(map(...)) or list(filter(...)) case is worth optimizing.  For big inputs, manifesting the whole input into a list is usually (but not always) the wrong thing to do if you care about performance (throwing away the iterator's superb L1 and L2 cache performance and throwing away the memory efficiency).
msg291768 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-16 22:16
[STINNER Victor]
> Oh, there is no slot for __length_hint__(). 
> Maybe we should also try to add a new slot for it?

+1
msg292287 - (view) Author: Michael Seifert (MSeifert) * Date: 2017-04-25 23:31
> I explored that notion of iterator length transparency years ago.  While I don't remember all the details, I did record some notes at the top of Lib/test/test_iterlen.py.

But isn't that the point of the length_hint? To provide an *estimate* for the length? So generally I would expect the length_hint to be accurate (at least accurate enough to avoid reallocations) for well-behaved iterators. And I think that's possible for "zip" and "map" (but also for several itertools: product, permutations, combinations, islice, accumulate, starmap, zip_longest). At least in theory.

Also it would allow to prohibit infinite iterators to be passed to "length_hint"-aware functions. For example `list(count())` could raise an exception when `count.length_hint` would return math.inf or sys.maxsize + 1.

The pull request was just because I was curious how it performed and wanted to share the code. Feel free to reject it. The performance improvement wasn't amazing in the micro-benchmarks.
msg292288 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-26 00:39
> But isn't that the point of the length_hint?

Where it could be used reliably, it was used and used broadly.  However, there are circumstances (iterator vs iterable) where it can't know how to make an estimate and is perhaps wildly off.

I would like to mark this tracker item as closed.  IMO it is a dead-end. 

Like you, I share enthusiasm for length_hint and its potential (that I why I created the concept many years ago and worked very hard to apply it everywhere it would fit).  But trust me, it doesn't fit everywhere.  I left it out of imap() for a reason (it just didn't make sense).
msg292289 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-26 00:41
Also, the ifilter() suggestion is a complete non-starter.  There is no way to know in advance whether filter will return all the elements of the input or none of them.  In the absence of other knowledge, the best strategy is what list.append() already does (a strategy of mild over-allocations, not exceeding 12.5% for large lists).
msg292311 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-26 07:06
zip.__length_hint__() must return NotImplemented or raise TypeError if any of iterators don't implement __length_hint__ or its __length_hint__() returns NotImplemented or raises TypeError.

And what should return zip(range(3), range(2**1000)).__length_hint__()? I expect 3, not OverflowError.
msg292357 - (view) Author: Michael Seifert (MSeifert) * Date: 2017-04-26 14:50
> I would like to mark this tracker item as closed.  IMO it is a dead-end. 

Yes, even some (not very uncommon cases) give incorrect results:

it = iter([1,2,3])
zip(it, it, it)

has length_hint 3 but will only yield one item.
msg292358 - (view) Author: Michael Seifert (MSeifert) * Date: 2017-04-26 14:54
> zip.__length_hint__() must return NotImplemented or raise TypeError if any of iterators don't implement __length_hint__ or its __length_hint__() returns NotImplemented or raises TypeError.

> And what should return zip(range(3), range(2**1000)).__length_hint__()? I expect 3, not OverflowError.

That's actually non-trivial because PyObject_LengthHint just returns a Py_ssize_t. To recover NotImplemented will be complicated and there's no way to discriminate if the OverflowError happened in PyObject_LengthHint or in the called __length_hint__. 

But TypeError is correctly re-raised in the tests I made.
msg346779 - (view) Author: Nicholas Musolino (nmusolino) * Date: 2019-06-27 21:55
Before seeing this issue and its closed status, I created PR 14432, which adds `__length_hint__()` to the iterator returned by builtin `map`.

This PR differs from the original 2017 PR by MSeifert in that the code can distinguish between the cases where a length hint (or length) function is not available, versus a case in which the `__length_hint__()` method of the user-supplied iterable raises an exception.

That is, the new PR propagates exceptions (other than TypeError, per PEP 424) raised in the `__length_hint__()` functions of the user-supplied iterators.

I've read the comments in this issue, and the notes in `test_iterlen.py`:

> [Other] iterators [...], such as enumerate and the other itertools,
> are not length transparent because they have no way to distinguish between
> iterables that report static length and iterators whose length changes with
> each call (i.e. the difference between enumerate('abc') and
> enumerate(iter('abc')).

Can anyone provide a concrete example that illustrates the inherent difficulties of computing a length hint for `map`?  

In ad-hoc testing, the current PR handles situations listed there, and I haven't come across some fundamental show-stopper problem.

There are two limitations to the current PR:

1.  The case where a single iterator is supplied as multiple arguments to `map`, as pointed out by MSeifert:  
    it = iter([1,2,3,4])
    map_it = map(f, it, it)  
2.  When a user-supplied `__length_hint__()` method returns a non-integer, it results in a TypeError in an *inner* evaluation of `PyObject_LengthHint()` (per PEP 424).  When this exception reaches an *outer* evaluation of `PyObject_LengthHint()`, it is cleared (per PEP 424), and leads to operator.length_hint's default value being returned.

If we consider issue 2 to be incorrect,  I think I could fix it by raising RuntimeError from the original TypeError, or by somewhat invasive refactoring of `PyObject_LengthHint()` (which I do not recommend).
msg346788 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-27 23:33
The main issue with using length hint is that calling a method is Python is not free. Calling the method may add more overhead than the speedup provided by smarter memory allocations.

The worst case for this optimization should be measured on very small data set, of less than 10 items. We don't want to make Python smaller for operations on 5 items for example.

The idea which remains open is the idea of adding a "slot" for length hint in PyTypeObject. But I would suggest to open a separated issue to explore this idea.

Slots allows to use C functions at raw speed and so reduce the cost of getting the length hint.
History
Date User Action Args
2022-04-11 14:58:30adminsetgithub: 71015
2019-06-27 23:33:52vstinnersetmessages: + msg346788
2019-06-27 21:55:09nmusolinosetnosy: + nmusolino
messages: + msg346779
2019-06-27 21:14:48nmusolinosetpull_requests: + pull_request14249
2017-04-26 15:11:22rhettingersetstatus: open -> closed
resolution: rejected
stage: resolved
2017-04-26 14:54:44MSeifertsetmessages: + msg292358
2017-04-26 14:50:55MSeifertsetmessages: + msg292357
2017-04-26 07:06:56serhiy.storchakasetmessages: + msg292311
2017-04-26 00:41:28rhettingersetmessages: + msg292289
2017-04-26 00:39:08rhettingersetmessages: + msg292288
2017-04-25 23:31:29MSeifertsetnosy: + MSeifert
messages: + msg292287
2017-04-16 22:16:16rhettingersetmessages: + msg291768
2017-04-11 04:45:13rhettingersetmessages: + msg291461
2017-04-11 04:33:01rhettingersetassignee: rhettinger
2017-04-11 04:19:25MSeifertsetpull_requests: + pull_request1220
2016-04-23 07:40:43serhiy.storchakasetmessages: + msg264053
2016-04-23 07:18:30vstinnersetmessages: + msg264052
2016-04-23 02:54:24terry.reedysetnosy: + terry.reedy
messages: + msg264040
2016-04-22 12:57:55vstinnersetmessages: + msg264010
2016-04-22 12:49:57serhiy.storchakasetnosy: + rhettinger
messages: + msg264008
2016-04-22 12:12:40vstinnercreate