This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Extra heapq nlargest/nsmallest option for including ties
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: gsakkis, rhettinger
Priority: normal Keywords:

Created on 2009-04-02 16:43 by gsakkis, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (13)
msg85222 - (view) Author: George Sakkis (gsakkis) Date: 2009-04-02 16:43
It would be useful in many cases if heapq.nlargest and heapq.nsmallest
grew an optional boolean parameter, say `ties` (defaulting to False)
that when True, it would return more than `n` items if there are ties.
To illustrate:

>>> s = [4,3,5,7,4,7,4,3]
>>> for i in xrange(1,len(s)+1): print i,heapq.nsmallest(i,s)
...
1 [3]
2 [3, 3]
3 [3, 3, 4]
4 [3, 3, 4, 4]
5 [3, 3, 4, 4, 4]
6 [3, 3, 4, 4, 4, 5]
7 [3, 3, 4, 4, 4, 5, 7]
8 [3, 3, 4, 4, 4, 5, 7, 7]

>>> for i in xrange(1,len(s)+1): print i,heapq.nsmallest(i,s)
...
1 [3, 3]
2 [3, 3]
3 [3, 3, 4, 4, 4]
4 [3, 3, 4, 4, 4]
5 [3, 3, 4, 4, 4]
6 [3, 3, 4, 4, 4, 5]
7 [3, 3, 4, 4, 4, 5, 7, 7]
8 [3, 3, 4, 4, 4, 5, 7, 7]
msg85223 - (view) Author: George Sakkis (gsakkis) Date: 2009-04-02 16:45
The second call should of course be:

>>> for i in xrange(1,len(s)+1): print i,heapq.nsmallest(i,s,ties=True)
msg85224 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-02 17:15
I haven't seen this come up before.  Am curious about your use cases.
msg85230 - (view) Author: George Sakkis (gsakkis) Date: 2009-04-02 17:56
There's nothing special about my use cases; I'd even go as far as to
suggest that this is more often than not the desired behavior in general.

Say that you have to select the top 3 chess players and there are two
players with equal Elo rating at positions 3 and 4. Whom do you select?
Without a tie-breaking method, it's only fair to select both and return
4 players in total instead of exactly 3. 

The current method selects "arbitrarily" (at least with respect to the
key function) which of the equally-keyed items to return. This is
necessary in some cases by external constraints (say, you can hire only
1 person) but there are quite a few cases that "fairness" is more
important than a hard constraint on the number of returned objects.
msg85232 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-02 18:23
In that case, I think it best to leave nsmallest/nlargest as-is.   By
appending ties to the result, it becomes harder to implement policy
decisions on what to do with those ties (perhaps listing them separately
or splitting their prizes n-ways).  A preferred approach is to use
existing tools such as:  last=result[1]; [last]*s.count(last). 
Alternatively, result=sorted(s) works well with bisect to find the cut
points.  Also, my instincts say that it is weird ask for n-items and get
a result with potentially more than n.  

Looking at the source code for nsmallest/nlargest, I think this
build-out would complicate the code quite a bit, so the use cases would
need to be more compelling.  The way the nsmallest/nlargest work is that
they maintain an n-length buffer and loop over the entire input while
remembering only the best-n-so-far.  That approach doesn't easily extend
to tracking best-n-so-far-and-all-ties.
msg85235 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-02 18:30
That should have been:  last = result[-1]; [last]*s.count(last).
msg85241 - (view) Author: George Sakkis (gsakkis) Date: 2009-04-02 19:01
> In that case, I think it best to leave nsmallest/nlargest as-is.   By
> appending ties to the result, it becomes harder to implement policy
> decisions on what to do with those ties (perhaps listing them separately
> or splitting their prizes n-ways).  

I wouldn't worry about the further post-processing too much; it's
optional and up to the client to implement if necessary (in my current
use case it's not). Regardless, that's orthogonal to having a general,
reliable and efficient function in the standard library that reduces an
initial sequence of thousands/millions down to a dozen or two,
preserving the ties.

> A preferred approach is to use
> existing tools such as:  last=result[1]; [last]*s.count(last).
> Alternatively, result=sorted(s) works well with bisect to find the cut
> points.  Also, my instincts say that it is weird ask for n-items and get
> a result with potentially more than n.

That's a documentation issue, it would be explicitly addressed in the
description of `ties`. I'm not proposing to change the current default
behavior; the whole change would be fully backwards compatible.

> Looking at the source code for nsmallest/nlargest, I think this
> build-out would complicate the code quite a bit, so the use cases would
> need to be more compelling.  The way the nsmallest/nlargest work is that
> they maintain an n-length buffer and loop over the entire input while
> remembering only the best-n-so-far.  That approach doesn't easily extend
> to tracking best-n-so-far-and-all-ties.

I haven't thought about the implementation yet but before I do, would a
patch help reopen this ticket ? If so, I'm willing to take a stab at it.
msg85244 - (view) Author: George Sakkis (gsakkis) Date: 2009-04-02 19:07
> That should have been:  last = result[-1]; [last]*s.count(last).

Nice, though that's not equivalent if the objects' identity is
significant for what happens next (which typically is the case when ties
are preserved). The sorted/bisect solution doesn't have this problem.
msg85246 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-02 19:15
I recommend posting an ASPN recipe.  That's what I do with a lot of
ideas that are under development or that don't clear the bar for going
into the standard library.
msg85255 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-02 19:40
FWIW, an easy way to pare down millions of entries to dozens is to use a
bigger heap:

from heapq import nsmallest

def nsmallest_with_ties(n, iterable, scale=2):
    ext = n * scale
    s = nsmallest(ext, iterable)
    lastplace = s[n-1]
    if s[-1] == lastplace:
        raise ValueError('may not have found all ties')
    for i in range(n, ext):
        if s[i] != lastplace:
            break
    return s[:i]
    

n = 3
s = [4,3,5,7,4,7,4,3]
print nsmallest_with_ties(3, s)
msg85260 - (view) Author: George Sakkis (gsakkis) Date: 2009-04-02 20:09
> I recommend posting an ASPN recipe.  That's what I do with a lot of
> ideas that are under development or that don't clear the bar for going
> into the standard library.

Will do. Thanks for the quick turnaround!
msg85261 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-04-02 20:14
Here's a test case for your recipe

dataset = [9] * 1000000 + [0,1,2,3,4,5,6,8]
for i in range(1, 20):
    print nsmallest(i, dataset, ties=True)
msg85523 - (view) Author: George Sakkis (gsakkis) Date: 2009-04-05 16:06
Posted recipe at http://code.activestate.com/recipes/576712/. You were
right, the implementation gets significantly hairier but I think it's
worth having this option. It's also faster than using sorted/bisect as
len(seq)/N increases and there is no pathologically high number of ties
at the last position (as in your test case).
History
Date User Action Args
2022-04-11 14:56:47adminsetgithub: 49919
2009-04-05 16:06:19gsakkissetmessages: + msg85523
2009-04-02 20:14:13rhettingersetmessages: + msg85261
2009-04-02 20:09:46gsakkissetmessages: + msg85260
2009-04-02 19:40:59rhettingersetmessages: + msg85255
2009-04-02 19:15:31rhettingersetmessages: + msg85246
2009-04-02 19:10:12gsakkissetstatus: open -> closed
2009-04-02 19:07:53gsakkissetmessages: + msg85244
2009-04-02 19:01:26gsakkissetstatus: closed -> open

messages: + msg85241
2009-04-02 18:30:42rhettingersetmessages: + msg85235
2009-04-02 18:23:56rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg85232
2009-04-02 17:56:57gsakkissetmessages: + msg85230
2009-04-02 17:15:35rhettingersetmessages: + msg85224
2009-04-02 16:49:53rhettingersetassignee: rhettinger

nosy: + rhettinger
2009-04-02 16:48:59gsakkissettitle: Extra heap nlargest/nsmallest option for including ties -> Extra heapq nlargest/nsmallest option for including ties
2009-04-02 16:45:02gsakkissetmessages: + msg85223
2009-04-02 16:43:16gsakkiscreate