This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: More efficient nlargest()/nsmallest()
Type: performance Stage:
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: benjamin.peterson, mark.dickinson, newacct, rhettinger
Priority: normal Keywords:

Created on 2011-02-11 01:15 by newacct, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
heapq_benchmark.py rhettinger, 2011-02-12 08:28 Benchmarking code for 4 variants.
Messages (6)
msg128357 - (view) Author: (newacct) Date: 2011-02-11 01:15
heapq.nlargest()/nsmallest() currently use a size-k heap to get the k largest/smallest items. This takes O(n log k): heapifying the first k elements (O(k)); iterating through (n-k) elements of the list, each insertion/deletion takes O(log k), for O((n-k)log k); and then sorting the elements at the end for O(k log k).

There are several more efficient ways to do this:

1) O(n + k log n): Put the entire list into the heap at once. This takes O(n). Then we can take out the k largest / smallest from the heap in sorted order, each taking O(log n), for a total of O(n + k log n), which is much better than O(n log k) if k is small and n is big. Unfortunately, this takes a lot (O(n)) of memory.

2) O(n + k log k): Use the linear time (O(n)) selection algorithm to find out what the k'th largest/smallest element is. (http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm) Then, partition (a la Quicksort) the elements into the ones bigger and smaller than this; this is also O(n). Maybe the partitioning can happen as part of the selection algorithm. At the end, sort the k elements, for O(k log k). If we remove the requirement that the result has to be sorted, then this algorithm is just O(n).
msg128358 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-02-11 01:50
I think you should implement and benchmark one of them. I'm dubious that the better asymptotic values will really translate into better performance. For example, the O(n) median selection algorithm has a large constant factor associated with it.
msg128368 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-02-11 08:51
The current routines are designed to consume only k elements of memory.  That is the way it should remain.  Manifesting the entire input iterable into memory is unnecessary and is not cache friendly.

Also, I question your assertions about running time.  In the worst case where the entire input in reverse sorted, it does take O(log k) for each insertion.  For other cases, like random orderings, it takes much less because many of the inputs only need to compare to the very top element of the heap.  In practice for small k and large n, running time close to O(n) is not uncommon.

-----------------
class Int(int):
    def __lt__(self, other):
        global cmps
        cmps += 1
        return int.__lt__(self, other)

from random import shuffle
from heapq import nsmallest

n = 10000
k = 16
data = list(map(Int, range(n)))

for i in range(10):
    shuffle(data)
    cmps = 0
    result = nsmallest(k, data)
    print(cmps, result)

-- test run ------------------

10485 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10560 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10584 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10518 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10582 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10650 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10409 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10577 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10499 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
10603 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
msg128378 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2011-02-11 12:38
[Raymond]
> Also, I question your assertions about running time. [...]

Interesting observation.  I wonder what the average number of comparisons actually is, for random inputs.  A quick back-of-the-envelope calculation suggests that O(max(k*log(k)*log(n), n)) might be a tighter upper bound.

Anyway, I agree with Benjamin that you (newacct) would need to implement the algorithm and demonstrate an obvious improvement.  Even then, it would be hard to swallow changing the algorithms to require O(n) space.
msg128430 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-02-11 20:59
FWIW, I've attached benchmarking code for all three approaches (the current approach and the two proposals).
msg128442 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-02-12 08:28
Updated the benchmarking code to include a 4th variant that accumulates sorted sublists during the partitioning phase.

Results from one run:

n: 100000	k: 100
[105856, 105917, 105951, 105977, 106366] nsmallest
[166465, 166478, 166507, 166639, 166748] heapifying_smallest
[253705, 260410, 264225, 289600, 333626] selecting_smallest
[111048, 117112, 144633, 168109, 402758] partitioning_smallest
History
Date User Action Args
2022-04-11 14:57:12adminsetgithub: 55389
2011-02-12 08:28:41rhettingersetfiles: - heapq_benchmark.py
2011-02-12 08:28:33rhettingersetfiles: + heapq_benchmark.py

messages: + msg128442
2011-02-11 20:59:54rhettingersetfiles: + heapq_benchmark.py

messages: + msg128430
2011-02-11 12:38:24mark.dickinsonsetnosy: + mark.dickinson
messages: + msg128378
2011-02-11 08:51:34rhettingersetstatus: open -> closed
versions: + Python 3.3
messages: + msg128368

assignee: rhettinger
components: + Library (Lib)
resolution: rejected
2011-02-11 01:50:58benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg128358
2011-02-11 01:29:38r.david.murraysetnosy: + rhettinger
2011-02-11 01:15:55newacctcreate