Message 128357 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	newacct
Recipients	newacct
Date	2011-02-11.01:15:55
SpamBayes Score	4.8910236e-09
Marked as misclassified	No
Message-id	<1297386956.44.0.637022416635.issue11180@psf.upfronthosting.co.za>
In-reply-to

Content
heapq.nlargest()/nsmallest() currently use a size-k heap to get the k largest/smallest items. This takes O(n log k): heapifying the first k elements (O(k)); iterating through (n-k) elements of the list, each insertion/deletion takes O(log k), for O((n-k)log k); and then sorting the elements at the end for O(k log k). There are several more efficient ways to do this: 1) O(n + k log n): Put the entire list into the heap at once. This takes O(n). Then we can take out the k largest / smallest from the heap in sorted order, each taking O(log n), for a total of O(n + k log n), which is much better than O(n log k) if k is small and n is big. Unfortunately, this takes a lot (O(n)) of memory. 2) O(n + k log k): Use the linear time (O(n)) selection algorithm to find out what the k'th largest/smallest element is. (http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm) Then, partition (a la Quicksort) the elements into the ones bigger and smaller than this; this is also O(n). Maybe the partitioning can happen as part of the selection algorithm. At the end, sort the k elements, for O(k log k). If we remove the requirement that the result has to be sorted, then this algorithm is just O(n).

heapq.nlargest()/nsmallest() currently use a size-k heap to get the k largest/smallest items. This takes O(n log k): heapifying the first k elements (O(k)); iterating through (n-k) elements of the list, each insertion/deletion takes O(log k), for O((n-k)log k); and then sorting the elements at the end for O(k log k).

There are several more efficient ways to do this:

1) O(n + k log n): Put the entire list into the heap at once. This takes O(n). Then we can take out the k largest / smallest from the heap in sorted order, each taking O(log n), for a total of O(n + k log n), which is much better than O(n log k) if k is small and n is big. Unfortunately, this takes a lot (O(n)) of memory.

2) O(n + k log k): Use the linear time (O(n)) selection algorithm to find out what the k'th largest/smallest element is. (http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm) Then, partition (a la Quicksort) the elements into the ones bigger and smaller than this; this is also O(n). Maybe the partitioning can happen as part of the selection algorithm. At the end, sort the k elements, for O(k log k). If we remove the requirement that the result has to be sorted, then this algorithm is just O(n).

History
Date	User	Action	Args
2011-02-11 01:15:56	newacct	set	recipients: + newacct
2011-02-11 01:15:56	newacct	set	messageid: <1297386956.44.0.637022416635.issue11180@psf.upfronthosting.co.za>
2011-02-11 01:15:55	newacct	link	issue11180 messages
2011-02-11 01:15:55	newacct	create