classification
Title: heapq.nsmallest and nlargest should be smarter/more usable/more consistent
Type: performance Stage:
Components: Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: jab, rhettinger
Priority: normal Keywords:

Created on 2009-07-31 16:56 by jab, last changed 2009-07-31 21:23 by rhettinger. This issue is now closed.

Messages (10)
msg91134 - (view) Author: Joshua Bronson (jab) * Date: 2009-07-31 16:55
From http://docs.python.org/library/heapq.html:
> The latter two functions (nlargest and nsmallest) perform best for
> smaller values of n. For larger values, it is more efficient to use
> the sorted() function. Also, when n==1, it is more efficient to use
> the builtin min() and max() functions.

A quick usability win for the heapq module: Be more specific than "smaller" and 
"larger", e.g. "when n is O(...(len(iterable)))".

From http://groups.google.com/group/comp.lang.python/msg/9556f56ae25ecf3b:
> I find it interesting that the heapq functions tell you in the
> documentation that they aren't suitable for use where n==1 or where
> n is near the total size of the sequence whereas random.sample()
> chooses what it thinks is the best algorithm based on the input. At
> the very least I would have thought the heapq functions could check
> for n==1 and call min/max when it is.

+1. It looks like the pure Python implementation of nsmallest is actually already 
choosing either an insort algorithm or a minheap algorithm based on whether n * 10 <= 
len(iterable), but the C implementation of nsmallest unconditionally uses a minheap 
algorithm. Neither the pure Python nor the C implementation of nlargest chooses its 
algorithm conditionally. For the sake of consistency and usability, all 
implementations of nsmallest and nlargest should choose the most efficient algorithm 
from among min/max, insort, and minheap, and the docs should be updated to describe 
this behavior.

Also, in looking through the heapq.py that came with my Python 2.6.2 distribution, I 
noticed that line 196 seems to have no reason to be there:
    _heappushpop = heappushpop                                      
This is the only occurrence of "_heappushpop" in the file.

I made a patch for heapq.py, but since I'm not a CPython hacker, I thought it might 
be better to leave the changes up to someone who could do both the pure Python and 
the C implementations in tandem. I'd be happy to contribute documentation when the 
time comes if that would help.
msg91135 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-07-31 17:18
FWIW, 2.7 and 3.1 already have automatic selection of sort()/min()/max()
alternatives.  They use pure python to dispatch to the underlying C
functions:

http://svn.python.org/view/python/branches/release31-maint/Lib/heapq.py?revision=73579&view=markup
msg91137 - (view) Author: Joshua Bronson (jab) * Date: 2009-07-31 18:18
Oh, that's great!

(I also noticed that the previously inutile line "_heappushpop = heappushpop" 
is now doing something in the heapq.py you linked to, nice.)

It looks like the docs haven't been updated yet though. For instance, 
http://docs.python.org/3.1/library/heapq.html still says "The latter two 
functions perform best for smaller values of n. For larger values, it is more 
efficient to use the sorted() function. Also, when n==1, it is more efficient 
to use the builtin min() and max() functions."

Also, I notice the pure Python implementation of nsmallest still does that 
check to see if n * 10 <= len(iterable), and if so uses an insort-based 
algorithm. It looks like this is still undocumented, inconsistent with the C 
implementation, and asymmetrical to the implementations of nlargest. If this 
branch is remaining there on purpose, I'd love see its existence mentioned and 
its theoretical basis explained in the docs, along with any similar branches 
implemented elsewhere in the code, if they should be.
msg91138 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-07-31 18:28
I prefer the docs the way they are.  They help the reader understand the
relationship between min, max, nsmallest, nlargest, and sorted.

I'm not sure where you got the n * 10 <= len(iterable) switch-over
point.  That is arbitrary.  The correct switchover point depends on the
cost of the comparison function, whether the length of the input is
known, whether the input data is partially ordered, memory constraints,
whether a key function is used, and on other factors.   

FWIW, I also wrote the logic for random.sample().  The switchover logic
was straight-forward because performance depended on factors that were
fully known (length of input, sample size, memory utilization, and
average number of probes for each strategy).
msg91139 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-07-31 18:33
One other thought:  When memory is tight, the programmer needs to be
able to select the heap algorithm in favor of sorted() even for
relatively large values of n.  I do not want an automatic switchover
point that takes away a programmer's choice between speed and space
efficiency.
msg91142 - (view) Author: Joshua Bronson (jab) * Date: 2009-07-31 19:22
> I prefer the docs the way they are.  They help the reader understand
> the relationship between min, max, nsmallest, nlargest, and sorted.

Except that it's no longer true that "when n==1, it is more efficient to use the 
builtin min() and max() functions." Shouldn't this be updated to say "when n==1, 
it is equivalent to using the builtin min() and max() functions"?

> I'm not sure where you got the n * 10 <= len(iterable) switch-over
> point.

It's right in the file you linked to. Search for "n * 10" in 
http://svn.python.org/view/python/branches/release31-maint/Lib/heapq.py?
revision=73579&view=markup.

> That is arbitrary.  The correct switchover point depends on the
> cost of the comparison function, whether the length of the input is
> known, whether the input data is partially ordered, memory constraints,
> whether a key function is used, and on other factors.   

So should that be removed, then?

> FWIW, I also wrote the logic for random.sample().  The switchover logic
> was straight-forward because performance depended on factors that were
> fully known (length of input, sample size, memory utilization, and
> average number of probes for each strategy).
> One other thought:  When memory is tight, the programmer needs to be
> able to select the heap algorithm in favor of sorted() even for
> relatively large values of n.  I do not want an automatic switchover
> point that takes away a programmer's choice between speed and space
> efficiency.

Agreed, and thanks for the additional info.
msg91146 - (view) Author: Joshua Bronson (jab) * Date: 2009-07-31 19:36
One more thing:

> I prefer the docs the way they are.  They help the reader understand
> the relationship between min, max, nsmallest, nlargest, and sorted.

The docs still use the unspecific language "for smaller values of n" and 
"for larger values". I think careful readers would appreciate an addition 
along the lines of what you wrote earlier -- the optimal switchover point 
depends on the cost of the comparison function, the ordering of the input 
data, etc.
msg91147 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-07-31 19:39
> Except that it's no longer true that "when n==1, it is
> more efficient to use the builtin min() and max() functions."

There's still the dispatch overhead.
If someone needs a n==1 case, they
*should* use min/max for both speed
and clarity.

Also, it is important the the docs
communicate the relationship between
min/max, nlargest/nsmallest, and sorted.


> It's right in the file you linked to. Search for "n * 10" in ...

That is in the pure python version of nsmallest() and that
code is not used (it is overriden by the C version).  The
actual C implementation works differently -- it uses an
underlying maxheap so it can use a cleaner algorithm that
doesn't need switchover tricks.

If you feel like "kicking the ball around", please continue the
discussion on comp.lang.python -- I think we're done here.
msg91152 - (view) Author: Joshua Bronson (jab) * Date: 2009-07-31 20:42
> That is in the pure python version of nsmallest() and that
> code is not used (it is overriden by the C version).

So just because it isn't used by CPython it should remain in there even 
though as you said yourself it's completely without basis? What about non 
CPython? I'm not trying to "kick the ball around" here, we're both on the 
same team just trying to make Python better.
msg91153 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-07-31 21:23
There is a basis for the pure python version to switch to bisect.

There is not a basis for having the final wrapped C function switch to
using sorted().  That is a programmer decision.

The pure python version is there to show how it could be done using a
minheap and it is a bit of hack to get it to work at all.  The C version
uses a maxheap and does not need logic for switching to bisect.  It is
clean and fast.

I'm sorry, but I've lost interest in continuing this discussion.
History
Date User Action Args
2009-07-31 21:23:28rhettingersetmessages: + msg91153
2009-07-31 20:42:06jabsetmessages: + msg91152
2009-07-31 19:39:37rhettingersetmessages: + msg91147
2009-07-31 19:36:02jabsetmessages: + msg91146
2009-07-31 19:22:32jabsetmessages: + msg91142
2009-07-31 18:33:05rhettingersetmessages: + msg91139
2009-07-31 18:28:13rhettingersetmessages: + msg91138
2009-07-31 18:18:44jabsetmessages: + msg91137
2009-07-31 17:18:07rhettingersetstatus: open -> closed

type: behavior -> performance
assignee: rhettinger
versions: + Python 3.2, - Python 2.6, Python 2.5, Python 3.0, Python 3.1
nosy: + rhettinger

messages: + msg91135
resolution: out of date
2009-07-31 16:56:01jabcreate