Issue 11180: More efficient nlargest()/nsmallest()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55389

classification

Title:	More efficient nlargest()/nsmallest()
Type:	performance	Stage:
Components:	Library (Lib)	Versions:	Python 3.3

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	benjamin.peterson, mark.dickinson, newacct, rhettinger
Priority:	normal	Keywords:

Created on 2011-02-11 01:15 by newacct, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
heapq_benchmark.py	rhettinger, 2011-02-12 08:28	Benchmarking code for 4 variants.

Messages (6)
msg128357 - (view)	Author: (newacct)	Date: 2011-02-11 01:15
heapq.nlargest()/nsmallest() currently use a size-k heap to get the k largest/smallest items. This takes O(n log k): heapifying the first k elements (O(k)); iterating through (n-k) elements of the list, each insertion/deletion takes O(log k), for O((n-k)log k); and then sorting the elements at the end for O(k log k). There are several more efficient ways to do this: 1) O(n + k log n): Put the entire list into the heap at once. This takes O(n). Then we can take out the k largest / smallest from the heap in sorted order, each taking O(log n), for a total of O(n + k log n), which is much better than O(n log k) if k is small and n is big. Unfortunately, this takes a lot (O(n)) of memory. 2) O(n + k log k): Use the linear time (O(n)) selection algorithm to find out what the k'th largest/smallest element is. (http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm) Then, partition (a la Quicksort) the elements into the ones bigger and smaller than this; this is also O(n). Maybe the partitioning can happen as part of the selection algorithm. At the end, sort the k elements, for O(k log k). If we remove the requirement that the result has to be sorted, then this algorithm is just O(n).
msg128358 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2011-02-11 01:50
I think you should implement and benchmark one of them. I'm dubious that the better asymptotic values will really translate into better performance. For example, the O(n) median selection algorithm has a large constant factor associated with it.
msg128368 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2011-02-11 08:51
The current routines are designed to consume only k elements of memory. That is the way it should remain. Manifesting the entire input iterable into memory is unnecessary and is not cache friendly. Also, I question your assertions about running time. In the worst case where the entire input in reverse sorted, it does take O(log k) for each insertion. For other cases, like random orderings, it takes much less because many of the inputs only need to compare to the very top element of the heap. In practice for small k and large n, running time close to O(n) is not uncommon. ----------------- class Int(int): def __lt__(self, other): global cmps cmps += 1 return int.__lt__(self, other) from random import shuffle from heapq import nsmallest n = 10000 k = 16 data = list(map(Int, range(n))) for i in range(10): shuffle(data) cmps = 0 result = nsmallest(k, data) print(cmps, result) -- test run ------------------ 10485 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10560 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10584 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10518 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10582 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10650 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10409 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10577 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10499 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] 10603 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
msg128378 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2011-02-11 12:38
[Raymond] > Also, I question your assertions about running time. [...] Interesting observation. I wonder what the average number of comparisons actually is, for random inputs. A quick back-of-the-envelope calculation suggests that O(max(klog(k)log(n), n)) might be a tighter upper bound. Anyway, I agree with Benjamin that you (newacct) would need to implement the algorithm and demonstrate an obvious improvement. Even then, it would be hard to swallow changing the algorithms to require O(n) space.
msg128430 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2011-02-11 20:59
FWIW, I've attached benchmarking code for all three approaches (the current approach and the two proposals).
msg128442 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2011-02-12 08:28
Updated the benchmarking code to include a 4th variant that accumulates sorted sublists during the partitioning phase. Results from one run: n: 100000 k: 100 [105856, 105917, 105951, 105977, 106366] nsmallest [166465, 166478, 166507, 166639, 166748] heapifying_smallest [253705, 260410, 264225, 289600, 333626] selecting_smallest [111048, 117112, 144633, 168109, 402758] partitioning_smallest

History
Date	User	Action	Args
2022-04-11 14:57:12	admin	set	github: 55389
2011-02-12 08:28:41	rhettinger	set	files: - heapq_benchmark.py
2011-02-12 08:28:33	rhettinger	set	files: + heapq_benchmark.py messages: + msg128442
2011-02-11 20:59:54	rhettinger	set	files: + heapq_benchmark.py messages: + msg128430
2011-02-11 12:38:24	mark.dickinson	set	nosy: + mark.dickinson messages: + msg128378
2011-02-11 08:51:34	rhettinger	set	status: open -> closed versions: + Python 3.3 messages: + msg128368 assignee: rhettinger components: + Library (Lib) resolution: rejected
2011-02-11 01:50:58	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg128358
2011-02-11 01:29:38	r.david.murray	set	nosy: + rhettinger
2011-02-11 01:15:55	newacct	create