Issue26828
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2016-04-22 12:12 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 1077 | closed | MSeifert, 2017-04-11 04:19 | |
PR 14432 | open | nmusolino, 2019-06-27 21:14 |
Messages (16) | |||
---|---|---|---|
msg264007 - (view) | Author: STINNER Victor (vstinner) * | Date: 2016-04-22 12:12 | |
When I compared the performance of filter() and map() between Python 2.7 and 3.4, I noticed a huge performance drop in Python 3! http://bugs.python.org/issue26814#msg264003 I didn't analyze yet exactly why Python 3 is so much slower (almost 100% slower for the case of fitler!). Maybe it's because filter() returns a list on Python 2, whereas filter() returns an iterator on Python 3. In Python 2, filter() and map() use _PyObject_LengthHint(seq, 8) to create the result list. Why don't we do the same in Python 3? filter.__length_hint__() and map.__length_hint__() would return seq.__length_hint__() of the input sequence, or return 8. It's a weak estimation, but it can help a lot of reduce the number of realloc() when the list is slowly growing. See also the PEP 424 -- A method for exposing a length hint. Note: the issue #26814 (fastcall) does make filter() and map() faster on Python 3.6 compared to Python 2.7, but it's not directly related to this issue. IMHO using length hint would make list(filter) and list(map) even faster. |
|||
msg264008 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2016-04-22 12:49 | |
See also issue14126. It makes sense to implement map.__length_hint__() and zip.__length_hint__(). But note that map() and zip() take several iterables, and we should call __length_hint__() for every of them (unless found a one with not implemented __length_hint__()). This can slow down the execution for short sequences. It is impossible to implement reasonable filter.__length_hint__(), because the length of resulting sequence can be from 0 to the length of the input sequence, and returning the maximal value would be not correct. |
|||
msg264010 - (view) | Author: STINNER Victor (vstinner) * | Date: 2016-04-22 12:57 | |
> But note that map() and zip() take several iterables, and we should call __length_hint__() for every of them (unless found a one with not implemented __length_hint__()). This can slow down the execution for short sequences. Oh, there is no slot for __length_hint__(). Maybe we should also try to add a new slot for it? Maybe we can use a fast-path for the most common cases like list(map(func, list))? > It is impossible to implement reasonable filter.__length_hint__(), because the length of resulting sequence can be from 0 to the length of the input sequence, and returning the maximal value would be not correct. What I see is that Python 2 is much faster and Python 2 reallocates N items if the input sequence contains N items. But yeah, we have to benchmark such change on Python 3 ;-) |
|||
msg264040 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2016-04-23 02:54 | |
Victor, are you suggesting the following? "If map has a single iterable, call and return as hint, otherwise give up." Or something else? |
|||
msg264052 - (view) | Author: STINNER Victor (vstinner) * | Date: 2016-04-23 07:18 | |
I checked Python 2 for map(): it gets the length hint of all iterables and use the maximum, with a default of 8. I'm not sure of what you mean by errors, the function to get the length hint already catchs and ignores errors no? |
|||
msg264053 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2016-04-23 07:40 | |
map and zip should use the minimum, zip_longest should use the maximum, and filter shouldn't have __length_hint__(). |
|||
msg291461 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2017-04-11 04:45 | |
Length hints were intentionally omitted from the Python 3 map() and filter() which used to be in the itertools module. I explored that notion of iterator length transparency years ago. While I don't remember all the details, I did record some notes at the top of Lib/test/test_iterlen.py. BTW, I also don't think the list(map(...)) or list(filter(...)) case is worth optimizing. For big inputs, manifesting the whole input into a list is usually (but not always) the wrong thing to do if you care about performance (throwing away the iterator's superb L1 and L2 cache performance and throwing away the memory efficiency). |
|||
msg291768 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2017-04-16 22:16 | |
[STINNER Victor] > Oh, there is no slot for __length_hint__(). > Maybe we should also try to add a new slot for it? +1 |
|||
msg292287 - (view) | Author: Michael Seifert (MSeifert) * | Date: 2017-04-25 23:31 | |
> I explored that notion of iterator length transparency years ago. While I don't remember all the details, I did record some notes at the top of Lib/test/test_iterlen.py. But isn't that the point of the length_hint? To provide an *estimate* for the length? So generally I would expect the length_hint to be accurate (at least accurate enough to avoid reallocations) for well-behaved iterators. And I think that's possible for "zip" and "map" (but also for several itertools: product, permutations, combinations, islice, accumulate, starmap, zip_longest). At least in theory. Also it would allow to prohibit infinite iterators to be passed to "length_hint"-aware functions. For example `list(count())` could raise an exception when `count.length_hint` would return math.inf or sys.maxsize + 1. The pull request was just because I was curious how it performed and wanted to share the code. Feel free to reject it. The performance improvement wasn't amazing in the micro-benchmarks. |
|||
msg292288 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2017-04-26 00:39 | |
> But isn't that the point of the length_hint? Where it could be used reliably, it was used and used broadly. However, there are circumstances (iterator vs iterable) where it can't know how to make an estimate and is perhaps wildly off. I would like to mark this tracker item as closed. IMO it is a dead-end. Like you, I share enthusiasm for length_hint and its potential (that I why I created the concept many years ago and worked very hard to apply it everywhere it would fit). But trust me, it doesn't fit everywhere. I left it out of imap() for a reason (it just didn't make sense). |
|||
msg292289 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2017-04-26 00:41 | |
Also, the ifilter() suggestion is a complete non-starter. There is no way to know in advance whether filter will return all the elements of the input or none of them. In the absence of other knowledge, the best strategy is what list.append() already does (a strategy of mild over-allocations, not exceeding 12.5% for large lists). |
|||
msg292311 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2017-04-26 07:06 | |
zip.__length_hint__() must return NotImplemented or raise TypeError if any of iterators don't implement __length_hint__ or its __length_hint__() returns NotImplemented or raises TypeError. And what should return zip(range(3), range(2**1000)).__length_hint__()? I expect 3, not OverflowError. |
|||
msg292357 - (view) | Author: Michael Seifert (MSeifert) * | Date: 2017-04-26 14:50 | |
> I would like to mark this tracker item as closed. IMO it is a dead-end. Yes, even some (not very uncommon cases) give incorrect results: it = iter([1,2,3]) zip(it, it, it) has length_hint 3 but will only yield one item. |
|||
msg292358 - (view) | Author: Michael Seifert (MSeifert) * | Date: 2017-04-26 14:54 | |
> zip.__length_hint__() must return NotImplemented or raise TypeError if any of iterators don't implement __length_hint__ or its __length_hint__() returns NotImplemented or raises TypeError. > And what should return zip(range(3), range(2**1000)).__length_hint__()? I expect 3, not OverflowError. That's actually non-trivial because PyObject_LengthHint just returns a Py_ssize_t. To recover NotImplemented will be complicated and there's no way to discriminate if the OverflowError happened in PyObject_LengthHint or in the called __length_hint__. But TypeError is correctly re-raised in the tests I made. |
|||
msg346779 - (view) | Author: Nicholas Musolino (nmusolino) * | Date: 2019-06-27 21:55 | |
Before seeing this issue and its closed status, I created PR 14432, which adds `__length_hint__()` to the iterator returned by builtin `map`. This PR differs from the original 2017 PR by MSeifert in that the code can distinguish between the cases where a length hint (or length) function is not available, versus a case in which the `__length_hint__()` method of the user-supplied iterable raises an exception. That is, the new PR propagates exceptions (other than TypeError, per PEP 424) raised in the `__length_hint__()` functions of the user-supplied iterators. I've read the comments in this issue, and the notes in `test_iterlen.py`: > [Other] iterators [...], such as enumerate and the other itertools, > are not length transparent because they have no way to distinguish between > iterables that report static length and iterators whose length changes with > each call (i.e. the difference between enumerate('abc') and > enumerate(iter('abc')). Can anyone provide a concrete example that illustrates the inherent difficulties of computing a length hint for `map`? In ad-hoc testing, the current PR handles situations listed there, and I haven't come across some fundamental show-stopper problem. There are two limitations to the current PR: 1. The case where a single iterator is supplied as multiple arguments to `map`, as pointed out by MSeifert: it = iter([1,2,3,4]) map_it = map(f, it, it) 2. When a user-supplied `__length_hint__()` method returns a non-integer, it results in a TypeError in an *inner* evaluation of `PyObject_LengthHint()` (per PEP 424). When this exception reaches an *outer* evaluation of `PyObject_LengthHint()`, it is cleared (per PEP 424), and leads to operator.length_hint's default value being returned. If we consider issue 2 to be incorrect, I think I could fix it by raising RuntimeError from the original TypeError, or by somewhat invasive refactoring of `PyObject_LengthHint()` (which I do not recommend). |
|||
msg346788 - (view) | Author: STINNER Victor (vstinner) * | Date: 2019-06-27 23:33 | |
The main issue with using length hint is that calling a method is Python is not free. Calling the method may add more overhead than the speedup provided by smarter memory allocations. The worst case for this optimization should be measured on very small data set, of less than 10 items. We don't want to make Python smaller for operations on 5 items for example. The idea which remains open is the idea of adding a "slot" for length hint in PyTypeObject. But I would suggest to open a separated issue to explore this idea. Slots allows to use C functions at raw speed and so reduce the cost of getting the length hint. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:30 | admin | set | github: 71015 |
2019-06-27 23:33:52 | vstinner | set | messages: + msg346788 |
2019-06-27 21:55:09 | nmusolino | set | nosy:
+ nmusolino messages: + msg346779 |
2019-06-27 21:14:48 | nmusolino | set | pull_requests: + pull_request14249 |
2017-04-26 15:11:22 | rhettinger | set | status: open -> closed resolution: rejected stage: resolved |
2017-04-26 14:54:44 | MSeifert | set | messages: + msg292358 |
2017-04-26 14:50:55 | MSeifert | set | messages: + msg292357 |
2017-04-26 07:06:56 | serhiy.storchaka | set | messages: + msg292311 |
2017-04-26 00:41:28 | rhettinger | set | messages: + msg292289 |
2017-04-26 00:39:08 | rhettinger | set | messages: + msg292288 |
2017-04-25 23:31:29 | MSeifert | set | nosy:
+ MSeifert messages: + msg292287 |
2017-04-16 22:16:16 | rhettinger | set | messages: + msg291768 |
2017-04-11 04:45:13 | rhettinger | set | messages: + msg291461 |
2017-04-11 04:33:01 | rhettinger | set | assignee: rhettinger |
2017-04-11 04:19:25 | MSeifert | set | pull_requests: + pull_request1220 |
2016-04-23 07:40:43 | serhiy.storchaka | set | messages: + msg264053 |
2016-04-23 07:18:30 | vstinner | set | messages: + msg264052 |
2016-04-23 02:54:24 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg264040 |
2016-04-22 12:57:55 | vstinner | set | messages: + msg264010 |
2016-04-22 12:49:57 | serhiy.storchaka | set | nosy:
+ rhettinger messages: + msg264008 |
2016-04-22 12:12:40 | vstinner | create |