Issue 6614: heapq.nsmallest and nlargest should be smarter/more usable/more consistent

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50863

classification

Title:	heapq.nsmallest and nlargest should be smarter/more usable/more consistent
Type:	performance	Stage:
Components:		Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	jab, rhettinger
Priority:	normal	Keywords:

Created on 2009-07-31 16:56 by jab, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg91134 - (view)	Author: Joshua Bronson (jab) *	Date: 2009-07-31 16:55
From http://docs.python.org/library/heapq.html: > The latter two functions (nlargest and nsmallest) perform best for > smaller values of n. For larger values, it is more efficient to use > the sorted() function. Also, when n==1, it is more efficient to use > the builtin min() and max() functions. A quick usability win for the heapq module: Be more specific than "smaller" and "larger", e.g. "when n is O(...(len(iterable)))". From http://groups.google.com/group/comp.lang.python/msg/9556f56ae25ecf3b: > I find it interesting that the heapq functions tell you in the > documentation that they aren't suitable for use where n==1 or where > n is near the total size of the sequence whereas random.sample() > chooses what it thinks is the best algorithm based on the input. At > the very least I would have thought the heapq functions could check > for n==1 and call min/max when it is. +1. It looks like the pure Python implementation of nsmallest is actually already choosing either an insort algorithm or a minheap algorithm based on whether n * 10 <= len(iterable), but the C implementation of nsmallest unconditionally uses a minheap algorithm. Neither the pure Python nor the C implementation of nlargest chooses its algorithm conditionally. For the sake of consistency and usability, all implementations of nsmallest and nlargest should choose the most efficient algorithm from among min/max, insort, and minheap, and the docs should be updated to describe this behavior. Also, in looking through the heapq.py that came with my Python 2.6.2 distribution, I noticed that line 196 seems to have no reason to be there: _heappushpop = heappushpop This is the only occurrence of "_heappushpop" in the file. I made a patch for heapq.py, but since I'm not a CPython hacker, I thought it might be better to leave the changes up to someone who could do both the pure Python and the C implementations in tandem. I'd be happy to contribute documentation when the time comes if that would help.
msg91135 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-07-31 17:18
FWIW, 2.7 and 3.1 already have automatic selection of sort()/min()/max() alternatives. They use pure python to dispatch to the underlying C functions: http://svn.python.org/view/python/branches/release31-maint/Lib/heapq.py?revision=73579&view=markup
msg91137 - (view)	Author: Joshua Bronson (jab) *	Date: 2009-07-31 18:18
Oh, that's great! (I also noticed that the previously inutile line "_heappushpop = heappushpop" is now doing something in the heapq.py you linked to, nice.) It looks like the docs haven't been updated yet though. For instance, http://docs.python.org/3.1/library/heapq.html still says "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted() function. Also, when n==1, it is more efficient to use the builtin min() and max() functions." Also, I notice the pure Python implementation of nsmallest still does that check to see if n * 10 <= len(iterable), and if so uses an insort-based algorithm. It looks like this is still undocumented, inconsistent with the C implementation, and asymmetrical to the implementations of nlargest. If this branch is remaining there on purpose, I'd love see its existence mentioned and its theoretical basis explained in the docs, along with any similar branches implemented elsewhere in the code, if they should be.
msg91138 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-07-31 18:28
I prefer the docs the way they are. They help the reader understand the relationship between min, max, nsmallest, nlargest, and sorted. I'm not sure where you got the n * 10 <= len(iterable) switch-over point. That is arbitrary. The correct switchover point depends on the cost of the comparison function, whether the length of the input is known, whether the input data is partially ordered, memory constraints, whether a key function is used, and on other factors. FWIW, I also wrote the logic for random.sample(). The switchover logic was straight-forward because performance depended on factors that were fully known (length of input, sample size, memory utilization, and average number of probes for each strategy).
msg91139 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-07-31 18:33
One other thought: When memory is tight, the programmer needs to be able to select the heap algorithm in favor of sorted() even for relatively large values of n. I do not want an automatic switchover point that takes away a programmer's choice between speed and space efficiency.
msg91142 - (view)	Author: Joshua Bronson (jab) *	Date: 2009-07-31 19:22
> I prefer the docs the way they are. They help the reader understand > the relationship between min, max, nsmallest, nlargest, and sorted. Except that it's no longer true that "when n==1, it is more efficient to use the builtin min() and max() functions." Shouldn't this be updated to say "when n==1, it is equivalent to using the builtin min() and max() functions"? > I'm not sure where you got the n * 10 <= len(iterable) switch-over > point. It's right in the file you linked to. Search for "n * 10" in http://svn.python.org/view/python/branches/release31-maint/Lib/heapq.py? revision=73579&view=markup. > That is arbitrary. The correct switchover point depends on the > cost of the comparison function, whether the length of the input is > known, whether the input data is partially ordered, memory constraints, > whether a key function is used, and on other factors. So should that be removed, then? > FWIW, I also wrote the logic for random.sample(). The switchover logic > was straight-forward because performance depended on factors that were > fully known (length of input, sample size, memory utilization, and > average number of probes for each strategy). > One other thought: When memory is tight, the programmer needs to be > able to select the heap algorithm in favor of sorted() even for > relatively large values of n. I do not want an automatic switchover > point that takes away a programmer's choice between speed and space > efficiency. Agreed, and thanks for the additional info.
msg91146 - (view)	Author: Joshua Bronson (jab) *	Date: 2009-07-31 19:36
One more thing: > I prefer the docs the way they are. They help the reader understand > the relationship between min, max, nsmallest, nlargest, and sorted. The docs still use the unspecific language "for smaller values of n" and "for larger values". I think careful readers would appreciate an addition along the lines of what you wrote earlier -- the optimal switchover point depends on the cost of the comparison function, the ordering of the input data, etc.
msg91147 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-07-31 19:39
> Except that it's no longer true that "when n==1, it is > more efficient to use the builtin min() and max() functions." There's still the dispatch overhead. If someone needs a n==1 case, they should use min/max for both speed and clarity. Also, it is important the the docs communicate the relationship between min/max, nlargest/nsmallest, and sorted. > It's right in the file you linked to. Search for "n * 10" in ... That is in the pure python version of nsmallest() and that code is not used (it is overriden by the C version). The actual C implementation works differently -- it uses an underlying maxheap so it can use a cleaner algorithm that doesn't need switchover tricks. If you feel like "kicking the ball around", please continue the discussion on comp.lang.python -- I think we're done here.
msg91152 - (view)	Author: Joshua Bronson (jab) *	Date: 2009-07-31 20:42
> That is in the pure python version of nsmallest() and that > code is not used (it is overriden by the C version). So just because it isn't used by CPython it should remain in there even though as you said yourself it's completely without basis? What about non CPython? I'm not trying to "kick the ball around" here, we're both on the same team just trying to make Python better.
msg91153 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-07-31 21:23
There is a basis for the pure python version to switch to bisect. There is not a basis for having the final wrapped C function switch to using sorted(). That is a programmer decision. The pure python version is there to show how it could be done using a minheap and it is a bit of hack to get it to work at all. The C version uses a maxheap and does not need logic for switching to bisect. It is clean and fast. I'm sorry, but I've lost interest in continuing this discussion.

History
Date	User	Action	Args
2022-04-11 14:56:51	admin	set	github: 50863
2009-07-31 21:23:28	rhettinger	set	messages: + msg91153
2009-07-31 20:42:06	jab	set	messages: + msg91152
2009-07-31 19:39:37	rhettinger	set	messages: + msg91147
2009-07-31 19:36:02	jab	set	messages: + msg91146
2009-07-31 19:22:32	jab	set	messages: + msg91142
2009-07-31 18:33:05	rhettinger	set	messages: + msg91139
2009-07-31 18:28:13	rhettinger	set	messages: + msg91138
2009-07-31 18:18:44	jab	set	messages: + msg91137
2009-07-31 17:18:07	rhettinger	set	status: open -> closed type: behavior -> performance assignee: rhettinger versions: + Python 3.2, - Python 2.6, Python 2.5, Python 3.0, Python 3.1 nosy: + rhettinger messages: + msg91135 resolution: out of date
2009-07-31 16:56:01	jab	create