Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement __length_hint__() on map() and filter() to optimize list(map) and list(filter) #71015

Closed
vstinner opened this issue Apr 22, 2016 · 16 comments
Assignees
Labels
performance Performance or resource usage

Comments

@vstinner
Copy link
Member

BPO 26828
Nosy @rhettinger, @terryjreedy, @vstinner, @alex, @serhiy-storchaka, @MSeifert04, @nmusolino
PRs
  • bpo-26828: __length_hint__ method for map and zip #1077
  • bpo-26828: Add __length_hint__() to builtin map iterator #14432
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2017-04-26.15:11:22.749>
    created_at = <Date 2016-04-22.12:12:40.220>
    labels = ['performance']
    title = 'Implement __length_hint__() on map() and filter() to optimize list(map) and list(filter)'
    updated_at = <Date 2019-06-27.23:33:52.101>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-06-27.23:33:52.101>
    actor = 'vstinner'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2017-04-26.15:11:22.749>
    closer = 'rhettinger'
    components = []
    creation = <Date 2016-04-22.12:12:40.220>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 26828
    keywords = []
    message_count = 16.0
    messages = ['264007', '264008', '264010', '264040', '264052', '264053', '291461', '291768', '292287', '292288', '292289', '292311', '292357', '292358', '346779', '346788']
    nosy_count = 7.0
    nosy_names = ['rhettinger', 'terry.reedy', 'vstinner', 'alex', 'serhiy.storchaka', 'MSeifert', 'nmusolino']
    pr_nums = ['1077', '14432']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26828'
    versions = ['Python 3.6']

    @vstinner
    Copy link
    Member Author

    When I compared the performance of filter() and map() between Python 2.7 and 3.4, I noticed a huge performance drop in Python 3!
    http://bugs.python.org/issue26814#msg264003

    I didn't analyze yet exactly why Python 3 is so much slower (almost 100% slower for the case of fitler!). Maybe it's because filter() returns a list on Python 2, whereas filter() returns an iterator on Python 3.

    In Python 2, filter() and map() use _PyObject_LengthHint(seq, 8) to create the result list. Why don't we do the same in Python 3?

    filter.__length_hint__() and map.__length_hint__() would return seq.__length_hint__() of the input sequence, or return 8. It's a weak estimation, but it can help a lot of reduce the number of realloc() when the list is slowly growing.

    See also the PEP-424 -- A method for exposing a length hint.

    Note: the issue bpo-26814 (fastcall) does make filter() and map() faster on Python 3.6 compared to Python 2.7, but it's not directly related to this issue. IMHO using length hint would make list(filter) and list(map) even faster.

    @vstinner vstinner added the performance Performance or resource usage label Apr 22, 2016
    @serhiy-storchaka
    Copy link
    Member

    See also bpo-14126.

    It makes sense to implement map.__length_hint__() and zip.__length_hint__(). But note that map() and zip() take several iterables, and we should call __length_hint__() for every of them (unless found a one with not implemented __length_hint__()). This can slow down the execution for short sequences.

    It is impossible to implement reasonable filter.__length_hint__(), because the length of resulting sequence can be from 0 to the length of the input sequence, and returning the maximal value would be not correct.

    @vstinner
    Copy link
    Member Author

    But note that map() and zip() take several iterables, and we should call __length_hint__() for every of them (unless found a one with not implemented __length_hint__()). This can slow down the execution for short sequences.

    Oh, there is no slot for __length_hint__(). Maybe we should also try to add a new slot for it?

    Maybe we can use a fast-path for the most common cases like list(map(func, list))?

    It is impossible to implement reasonable filter.__length_hint__(), because the length of resulting sequence can be from 0 to the length of the input sequence, and returning the maximal value would be not correct.

    What I see is that Python 2 is much faster and Python 2 reallocates N items if the input sequence contains N items. But yeah, we have to benchmark such change on Python 3 ;-)

    @terryjreedy
    Copy link
    Member

    Victor, are you suggesting the following? "If map has a single iterable, call and return as hint, otherwise give up." Or something else?

    @vstinner
    Copy link
    Member Author

    I checked Python 2 for map(): it gets the length hint of all iterables and
    use the maximum, with a default of 8. I'm not sure of what you mean by
    errors, the function to get the length hint already catchs and ignores
    errors no?

    @serhiy-storchaka
    Copy link
    Member

    map and zip should use the minimum, zip_longest should use the maximum, and filter shouldn't have __length_hint__().

    @rhettinger rhettinger self-assigned this Apr 11, 2017
    @rhettinger
    Copy link
    Contributor

    Length hints were intentionally omitted from the Python 3 map() and filter() which used to be in the itertools module. I explored that notion of iterator length transparency years ago. While I don't remember all the details, I did record some notes at the top of Lib/test/test_iterlen.py.

    BTW, I also don't think the list(map(...)) or list(filter(...)) case is worth optimizing. For big inputs, manifesting the whole input into a list is usually (but not always) the wrong thing to do if you care about performance (throwing away the iterator's superb L1 and L2 cache performance and throwing away the memory efficiency).

    @rhettinger
    Copy link
    Contributor

    [STINNER Victor]

    Oh, there is no slot for __length_hint__().
    Maybe we should also try to add a new slot for it?

    +1

    @MSeifert04
    Copy link
    Mannequin

    MSeifert04 mannequin commented Apr 25, 2017

    I explored that notion of iterator length transparency years ago. While I don't remember all the details, I did record some notes at the top of Lib/test/test_iterlen.py.

    But isn't that the point of the length_hint? To provide an *estimate* for the length? So generally I would expect the length_hint to be accurate (at least accurate enough to avoid reallocations) for well-behaved iterators. And I think that's possible for "zip" and "map" (but also for several itertools: product, permutations, combinations, islice, accumulate, starmap, zip_longest). At least in theory.

    Also it would allow to prohibit infinite iterators to be passed to "length_hint"-aware functions. For example list(count()) could raise an exception when count.length_hint would return math.inf or sys.maxsize + 1.

    The pull request was just because I was curious how it performed and wanted to share the code. Feel free to reject it. The performance improvement wasn't amazing in the micro-benchmarks.

    @rhettinger
    Copy link
    Contributor

    But isn't that the point of the length_hint?

    Where it could be used reliably, it was used and used broadly. However, there are circumstances (iterator vs iterable) where it can't know how to make an estimate and is perhaps wildly off.

    I would like to mark this tracker item as closed. IMO it is a dead-end.

    Like you, I share enthusiasm for length_hint and its potential (that I why I created the concept many years ago and worked very hard to apply it everywhere it would fit). But trust me, it doesn't fit everywhere. I left it out of imap() for a reason (it just didn't make sense).

    @rhettinger
    Copy link
    Contributor

    Also, the ifilter() suggestion is a complete non-starter. There is no way to know in advance whether filter will return all the elements of the input or none of them. In the absence of other knowledge, the best strategy is what list.append() already does (a strategy of mild over-allocations, not exceeding 12.5% for large lists).

    @serhiy-storchaka
    Copy link
    Member

    zip.__length_hint__() must return NotImplemented or raise TypeError if any of iterators don't implement __length_hint__ or its __length_hint__() returns NotImplemented or raises TypeError.

    And what should return zip(range(3), range(2**1000)).__length_hint__()? I expect 3, not OverflowError.

    @MSeifert04
    Copy link
    Mannequin

    MSeifert04 mannequin commented Apr 26, 2017

    I would like to mark this tracker item as closed. IMO it is a dead-end.

    Yes, even some (not very uncommon cases) give incorrect results:

    it = iter([1,2,3])
    zip(it, it, it)

    has length_hint 3 but will only yield one item.

    @MSeifert04
    Copy link
    Mannequin

    MSeifert04 mannequin commented Apr 26, 2017

    zip.__length_hint__() must return NotImplemented or raise TypeError if any of iterators don't implement __length_hint__ or its __length_hint__() returns NotImplemented or raises TypeError.

    And what should return zip(range(3), range(2**1000)).__length_hint__()? I expect 3, not OverflowError.

    That's actually non-trivial because PyObject_LengthHint just returns a Py_ssize_t. To recover NotImplemented will be complicated and there's no way to discriminate if the OverflowError happened in PyObject_LengthHint or in the called __length_hint__.

    But TypeError is correctly re-raised in the tests I made.

    @nmusolino
    Copy link
    Mannequin

    nmusolino mannequin commented Jun 27, 2019

    Before seeing this issue and its closed status, I created PR 14432, which adds __length_hint__() to the iterator returned by builtin map.

    This PR differs from the original 2017 PR by MSeifert in that the code can distinguish between the cases where a length hint (or length) function is not available, versus a case in which the __length_hint__() method of the user-supplied iterable raises an exception.

    That is, the new PR propagates exceptions (other than TypeError, per PEP-424) raised in the __length_hint__() functions of the user-supplied iterators.

    I've read the comments in this issue, and the notes in test_iterlen.py:

    [Other] iterators [...], such as enumerate and the other itertools,
    are not length transparent because they have no way to distinguish between
    iterables that report static length and iterators whose length changes with
    each call (i.e. the difference between enumerate('abc') and
    enumerate(iter('abc')).

    Can anyone provide a concrete example that illustrates the inherent difficulties of computing a length hint for map?

    In ad-hoc testing, the current PR handles situations listed there, and I haven't come across some fundamental show-stopper problem.

    There are two limitations to the current PR:

    1. The case where a single iterator is supplied as multiple arguments to map, as pointed out by MSeifert:
      it = iter([1,2,3,4])
      map_it = map(f, it, it)
    2. When a user-supplied __length_hint__() method returns a non-integer, it results in a TypeError in an inner evaluation of PyObject_LengthHint() (per PEP-424). When this exception reaches an outer evaluation of PyObject_LengthHint(), it is cleared (per PEP-424), and leads to operator.length_hint's default value being returned.

    If we consider issue 2 to be incorrect, I think I could fix it by raising RuntimeError from the original TypeError, or by somewhat invasive refactoring of PyObject_LengthHint() (which I do not recommend).

    @vstinner
    Copy link
    Member Author

    The main issue with using length hint is that calling a method is Python is not free. Calling the method may add more overhead than the speedup provided by smarter memory allocations.

    The worst case for this optimization should be measured on very small data set, of less than 10 items. We don't want to make Python smaller for operations on 5 items for example.

    The idea which remains open is the idea of adding a "slot" for length hint in PyTypeObject. But I would suggest to open a separated issue to explore this idea.

    Slots allows to use C functions at raw speed and so reduce the cost of getting the length hint.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants