classification
Title: Add "key" argument to "bisect" module functions
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Dima.Tisnek, alex, bls, jafo, josh.rosenberg, mark.dickinson, milko.krachounov, rhettinger, tebeka, terry.reedy, vadmium
Priority: normal Keywords: patch

Created on 2008-11-19 17:17 by tebeka, last changed 2014-03-06 22:42 by josh.rosenberg.

Files
File name Uploaded Description Edit
bisect-trunk.patch milko.krachounov, 2009-11-02 17:42 bisect for trunk
bisect-py3k.patch milko.krachounov, 2009-11-02 17:43 patch for py3k
bench_bisect_key.py milko.krachounov, 2009-11-02 17:43 benchmark
SortedCollection.py rhettinger, 2010-04-16 03:39 Draft for a Sorted Collection class
Messages (23)
msg76060 - (view) Author: Miki Tebeka (tebeka) * Date: 2008-11-19 17:17
It'd be helpful of the functions in the "bisect" modules will have a
"key" argument just like "sort".
msg76073 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-19 21:39
This request has come up repeatedly (and been rejected) in the past.  See issues 2954, 3374, 1185383, 1462228, 1451588, 1619060.

Could you perhaps explain your particular use case for this?  A few truly 
convincing use-cases might increase the chances of this getting accepted.
msg76084 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-11-20 00:05
Miki, the issue is that bisect calls tend to be made repeatedly, so the
key function can be called over and over again for the same argument. 
It is almost always a better design to simply decorate the list so the
key function never gets called more than once per element in the list. 
If we added key= to bisect, it would encourage bad design and steer
people after from better solutions.
msg76092 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-20 10:11
What about cases where performance is unimportant, or where the key 
function is fast (e.g. an attribute access)?  Then something like

bisect(a, x, key=attrgetter('size'))

is easy to write and read.  Mightn't this be considered good design,
from some perspectives?

Another thought:  if your list is a list of user-defined objects then a 
natural way to do the 'decorate' step of DSU might be to add a 'key' 
attribute to each object, rather than the usual method of constructing 
pairs.  (This has the advantage that you might not have to bother with the 
'undecorate' step.)  With a key argument, bisect could make use of this 
technique too.

Disclaimer: I haven't personally had any need for a key argument on 
bisect, so all this is hypothetical.  That's why I'm asking for real use-
cases.
msg76098 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-11-20 12:44
I had said "almost always".  Sure, if you don't care about performance
or scalability, a key= argument would be a net win.

We're responsible for creating an API that steers most programmers in
the right direction (Tim sez "we read Knuth so you don't have to"). 
Algorithmically, the bisect functions are at the wrong level of
granularity for applying a key function.  

For user-defined objects, there is no need for a key-attribute since can
just supply a custom comparison method:

class UserDefined:
  . . .
  def cmp(self, other):
      return cmp(self.key, other.key)
msg76099 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-20 12:58
One case I've been thinking about is that of maintaining a list of Decimal 
objects that are sorted by absolute value.  For this, having to create a 
list of (abs(x), x) pairs just seems clumsy compared to using a key 
argument to bisect.

Perhaps this is a contrived use case, but it doesn't seem totally 
implausible.
msg76203 - (view) Author: Miki Tebeka (tebeka) * Date: 2008-11-21 20:45
I agree you can get around this with defining __cmp__, however same goes
to "sort" and it was added anyway.

My take on it is that sometimes I need to find something in a big list
of objects, and I don't like to do DSU and not add __cmp__ to the
objects (since some of the lists might be sorted by different attributes
- say one list for time and one line for price).

I'd prefer if we do implement "key" and add a warning in the docs it
might slow you down. Which what will happen in the case of __cmp__ anyway.

I don't see why the "key" function should be called all the time on
inserted item, it's very easy to cache this value

def bisect(a, x, lo=0, hi=None, key=lambda x: x):
    assert low >= 0, "low must be non-negative"
    hi = hi or len(a)

    x_key = key(x) 
    while lo < hi:
        mid = (lo+hi)//2
        if x_key < key(a[mid]): hi = mid
        else: lo = mid+1
    return lo

(I'd also wish for "identity" built in, however this is another subject :)
msg76230 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2008-11-22 02:30
Just a reminder that __cmp__ is gone in 3.0.
I presume bisect, like sort, only requires __lt__ and perhaps __eq__,
though I can find no doc of either.
msg94836 - (view) Author: Milko Krachounov (milko.krachounov) Date: 2009-11-02 17:42
I've been bugged by the lack of key= argument for bisect for some time
now, and today I got to read this and the previous issues about the
matter. I still fail to understand the reasons for the rejections. It
might encourage bad design in which expensive key functions will get
called more than once for the same element. However:

1. Not all key= functions are computationally expensive, in fact a quick
search shows that most key= uses come are with itemgetter, attrgetter or
some cousin.
2. Even for a computationally expensive function precached key isn't
necessarily the right answer. And if the keys are huge, a key= function
can actually be useful in implementing intelligent caching.
3. The reason for rejection is merely a performance issue which *can* be
fixed without any significant changes to the design of the app should it
prove to be a real issue.


I was writing a module that keeps a sorted list of items in which items
can be added, removed, or changed (they might get reordered). I used
bisect to avoid useless slow linear searches, which would be one
obstacle for the app to scale well. However, the sort key can be changed
at any time, so implementing __lt__ wasn't an option. Keeping a second
list with the keys is both memory and computationally expensive as
inserting into a list is slow. It's not a shiny example of a bisect
use-case as the only thing I've used my module so far is to display the
files in a directory, which means the lists are small and the changes
are seldom, so any implementation is good enough.  But bisect with key=
would have been best if it was available. What's worse, I have a buggy
unreadable ad hoc implementation of bisect in my code right now. ;)

I see the following as reasons why key= would provide benefit:
1. If you have a sorted list, bisect is a net win. Having a key= would
enable you to utilize it without refactoring anything. The lack of key
may as well encourage you to continue using linear searches, or other
sub-optimal solutions.
2. Using key= is more readable and less error-prone than keeping two
lists in simple cases where defining a class with __le__ is an overkill.
Two examples I had where this was the case:
a) A class I implemented to pick numbers from a certain discrete random
distribution with bisect. Oh, but yeah, the implementation without key=
is twice faster.
b) I gave a hand to someone who was implementing a dictionary looked up
the words as you typed them with bisect.
3. Insort. Insort is already slow enough as it is, a key= wouldn't slow
it down much if at all. In fact, if you are keeping two lists, key=
would speed it up. Insort with key= is a net win.

I've implemented key= and quickly hacked a benchmark to show what
performance hit it might have, and to put things into perspective. The
results:

1. Cheap key function, integers and abs(), 1000000 items, 5000 bisects
or insorts:
a) Bisect with a key: 0.02205014s
b) Bisect with a second list: 0.01328707s
c) Insort with a key: 5.83688211s
d) Bisect with a second list, and two inserts: 16.17302299s

2. Cheap key function, ~4000 char bytestrings and len(), 100000 items,
5000 bisects or insorts:
a) Bisect with a key: 0.01829195s
b) Bisect with a second list: 0.00945401s
c) Insort with a key: 0.25511408s
d) Bisect with a second list, and two inserts: 0.49303603s

3. Expensive key function, ~4000 char bytestrings and str.lower(),
100000 (500 MB) items, 5000 bisects or insorts:
a) Bisect with a key: 1.26837015s
b) Bisect with a second list: 0.08390594s
c) Insort with a key: 1.50406289s
d) Bisect with a second list, and two inserts: 0.57737398s

4. Expensive key function, ~4000 char bytestrings and str.lower(),
500000 (2.5 GB) items, 5000 bisects or insorts:
a) Bisect with a key: 1.46136308s
b) Bisect with a second list: 0.08768606s
c) Insort with a key: 3.05218720s
d) Bisect with a second list, and two inserts: 6.43227386s

5. Expensive key function, ~4000 char bytestrings and str.strip(),
100000 (500 MB) items, 5000 bisects or insorts:
a) Bisect with a key: 0.03311396s
b) Bisect with a second list: 0.01374602s
c) Insort with a key: 0.27423000s
d) Bisect with a second list, and two inserts: 0.49585080s

6. Expensive key function, ~4000 char bytestrings and str.strip(),
500000 (2.5 GB) items, 5000 bisects or insorts:
a) Bisect with a key: 0.04530501
b) Bisect with a second list: 0.01912594
c) Insort with a key: 1.62209797
d) Bisect with a second list, and two inserts: 5.91734695

Also, I tried to bench linear searches, but as they had to run in Python
code they aren't representative of anything. In the integer test they
went for about 250 seconds without recalculating the key, and for about
500 with. Also, replacing them with .index() gave about 60 seconds if I
ensured there's high probability that the number is in the list, and for
500 if I didn't.

In short, key= for bisect would be convenient and neat, really useful in
rare cases, leading to more readable code in the most common cases, and
I'm unconvinced of the perceived harm that it would cause.

I attach a patch implementing it for trunk, py3k, and the benchmark
script I used to test it (on trunk).
msg103143 - (view) Author: Brian Scearce (bls) Date: 2010-04-14 20:09
This was closed over a year ago, but since mark.dickinson was asking for convincing use-cases: I'm breaking up a file into line-delimited chunks.  These chunks are non-overlapping, contiguous, and tend to be fairly large, so I'm just recording the start line of each chunk in a 2-ple:

mapping = [
  (10, 'first chunk'),
  (50, 'second chunk'),
  (60, 'third chunk')
]

Lines 10-49 are in the first chunk, lines 50-59 are in the second, lines 60+ are in the third.  So:

def CategorizeLine(line, mapping):
   loc = bisect.bisect([m[0] for m in mapping], line)
   if loc == 0:
      return None # before first chunk
   return mapping[loc-1][1]

It Would Be Nice if I could write the second line as:

   loc = bisect.bisect(mapping, line, key=lambda m:m[0])

The bisect documentation suggests pre-computing the key list, but it seems messy and error-prone to keep a redundant data structure in sync with its source.  I could also rewrite my "mapping" data structure to be two parallel lists instead of one list of 2-ples, but this data structure is more readable and extensible and handles empty lists more naturally.
msg103145 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-04-14 20:40
I'll take another look at this one.
msg103155 - (view) Author: Brian Scearce (bls) Date: 2010-04-14 22:01
For what it's worth, after I posted my comment, I realized I could use tuple comparison semantics:

   loc = bisect.bisect(mapping, (line,))

since my key happens to be at index 0.  "key=" would still be nice.
msg103158 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-04-15 00:30
What would you guys think about adding a class that uses bisect internally so that the keyfunc() never gets called more than once per key?  This could be added to the docs as a cut-and-pastable recipe or it could be added to the module directly:

sd = SortedCollection(iterable, key=keyfunc)
s[i]                  --> getitem at position i
len(s)                --> length
sd.insert_left(item)  --> None
sd.insert_right(item) --> None
sd.find_left(key)     --> item
sd.find_right(key)    --> item
sd.keyfunc            --> the key function 
list(sd)              --> sorted list of items
list(reversed(sd))    --> reverse sorted list of items
s.clear()


The implementation would follow this pattern:

 def insert_left(self, item):
   key = self._keyfunc(item)
   i = bisect_left(self._keys, key)
   self._keys.insert(i, key)
   self._items.insert(i, item)

 def find_left(self, key):
   i = bisect_left(self._keys, key)
   return self._items[i]

 def __iter__(self):
   return iter(self._items)

 def __getitem__(self, i):
   return self._items[i]
msg103180 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-04-15 03:09
Attaching a draft of a sorted collection class.

Any comments?
msg103294 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2010-04-16 06:05
Looks nice to me, however I wonder if there isn't some overlap with the requests to add a key= kwarg to heapq methods (and the discussion about adding a Heap class there).
msg113222 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-08 01:03
Added a link to the a SortedCollections recipe, added example of how to do searches, and noted the O(n) performance of the insort() functions.

See r83786
msg115176 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2010-08-29 08:26
This issue came up on #python IRC, and that combined with the number of times this has been duplicated makes me think that maybe the mention of the SortedCollection recipe should be a little more prominent.

Perhaps either moved up by the method list, or something like: "Note: No key argument is available because of performance issues.  Please consider either the SortedCollection recipe or altering the objects comparison methods.

?  It just seems like it keeps coming up, and in this case came up after the recipe reference was added.
msg115189 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-29 19:36
> maybe the mention of the SortedCollection recipe should 
> be a little more prominent.

Thanks for the suggestion.  Will look at moving the note higher on the page.
msg115288 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-09-01 06:59
Fixed in r84383.
msg152967 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-02-09 17:42
On python-ideas, thread "Optional key to `bisect`'s functions?" Guido wrote "Bingo. That clinches it. We need to add key=." 'That' being the fact that values that have keys may not be comparable themselves (in py3), so that comparing (key,value) pairs may raise TypeError. This follows a reply by him yesterday saying that he has wanted this feature occasionally. I am re-opening this rather than a new issue because there is already a patch with tests.
msg152969 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-02-09 17:48
I'm +1 on adding this feature. See discussion in python-ideas.

PS. It should also be added to heapq.
msg152983 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-02-09 20:39
Already on tracker as
#13742 Add a key parameter (like sorted) to heapq.merge
msg208020 - (view) Author: Dima Tisnek (Dima.Tisnek) Date: 2014-01-13 11:15
I've worked around this in 2.6/2.7 like this:

class Arr:
    def __getitem__(self, i):
        return foo(i)  # your key function
    def __len__(self):
        return 10000000  # your max index value

bisect.bisect(Arr(), value, ...)
History
Date User Action Args
2014-03-06 22:42:10josh.rosenbergsetnosy: + josh.rosenberg
2014-02-10 15:59:50gvanrossumsetnosy: - gvanrossum
2014-02-10 11:11:14vadmiumsetnosy: + vadmium
2014-01-13 11:15:34Dima.Tisneksetnosy: + Dima.Tisnek
messages: + msg208020
2012-09-12 20:15:07terry.reedysetversions: + Python 3.4, - Python 3.3
2012-02-09 20:39:30terry.reedysetmessages: + msg152983
2012-02-09 17:48:58gvanrossumsetmessages: + msg152969
2012-02-09 17:42:24terry.reedysetstatus: closed -> open
priority: low -> normal

components: + Library (Lib), - Documentation
versions: + Python 3.3, - Python 3.0, Python 2.7
nosy: + gvanrossum

messages: + msg152967
resolution: fixed ->
stage: patch review
2010-09-01 06:59:16rhettingersetstatus: open -> closed
resolution: duplicate -> fixed
messages: + msg115288
2010-08-29 19:36:36rhettingersetstatus: closed -> open

messages: + msg115189
components: + Documentation, - Library (Lib)
nosy: rhettinger, terry.reedy, jafo, tebeka, mark.dickinson, alex, milko.krachounov, bls
2010-08-29 08:26:09jafosetnosy: + jafo
messages: + msg115176
2010-08-08 01:03:41rhettingersetstatus: open -> closed

messages: + msg113222
2010-04-16 06:05:05alexsetnosy: + alex
messages: + msg103294
2010-04-16 03:39:54rhettingersetfiles: + SortedCollection.py
2010-04-16 03:39:26rhettingersetfiles: - SortedCollection.py
2010-04-15 03:09:24rhettingersetfiles: + SortedCollection.py

messages: + msg103180
2010-04-15 00:30:21rhettingersetpriority: low

messages: + msg103158
2010-04-14 22:01:07blssetmessages: + msg103155
2010-04-14 20:40:41rhettingersetstatus: closed -> open

messages: + msg103145
2010-04-14 20:09:25blssetnosy: + bls
messages: + msg103143
2009-11-02 17:51:02rhettingersetassignee: rhettinger
2009-11-02 17:43:47milko.krachounovsetfiles: + bench_bisect_key.py
2009-11-02 17:43:13milko.krachounovsetfiles: + bisect-py3k.patch
2009-11-02 17:42:51milko.krachounovsetfiles: + bisect-trunk.patch

nosy: + milko.krachounov
messages: + msg94836

keywords: + patch
2008-11-22 02:30:02terry.reedysetnosy: + terry.reedy
messages: + msg76230
2008-11-21 20:45:19tebekasetmessages: + msg76203
2008-11-20 12:58:08mark.dickinsonsetmessages: + msg76099
2008-11-20 12:52:32rhettingersetstatus: open -> closed
resolution: duplicate
2008-11-20 12:44:07rhettingersetmessages: + msg76098
2008-11-20 10:11:09mark.dickinsonsetmessages: + msg76092
2008-11-20 00:05:09rhettingersetnosy: + rhettinger
messages: + msg76084
2008-11-19 21:39:57mark.dickinsonsetnosy: + mark.dickinson
messages: + msg76073
2008-11-19 17:17:47tebekasettype: enhancement
components: + Library (Lib)
versions: + Python 3.0, Python 2.7
2008-11-19 17:17:27tebekacreate