classification
Title: bisect on presorted list
Type: enhancement Stage:
Components: Extension Modules Versions: Python 2.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: gvanrossum, karlb, rhettinger, timehorse
Priority: normal Keywords:

Created on 2006-12-19 21:14 by timehorse, last changed 2008-03-24 08:31 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
bisect.py timehorse, 2006-12-19 21:14 Proposed change to bisect.py with C-Implementation disabled for testing
Messages (7)
msg54953 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2006-12-19 21:14
The python and c implementation of bisect do not support custom-sorted lists using the list.sort method.

In order to support an arbitrarily sorted list via sort(cmp, key, reverse), I have added 3 corresponding parameters to the bisect methods for bisection and insort (insert-sorted) corresponding to the parameters in sort.  This would be useful if a list is initially sorted by its sort method and then the client wishes to maintain the sort order (or reverse-sort order) while inserting an element.  In this case, being able to use the same, arbitrary binary function cmp, unary function key and boolean reverse flag to preserve the list order.

The change imposes 3 new branch conditions and potential no-op function calls for when key is None.  I have here implemented and partially tested the python implementation and if someone besides me would find this useful, I will update the _bisectmodule.c for this change as well.

The Heap functions may also find use of an arbitrary predicate function so I may look at that later, but because bisect goes hand in hand with sorting, I wanted to tackle that first.
msg54954 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2006-12-19 21:43
I'm -1 on this patch.  At first blush it would seem nice to progagate sort's notion of a key= function; however, sort() is an all at once operation that can guarantee the function gets called only once per key.  In contrast, bisect() is more granualar so consecutive calls may need to invoke the same key= function again and again.  This is almost always the the-wrong-way-to-do-it (the key function should be precomputed and either stored separately or follow a decorate-sort pattern).  By including custom sorting in bisect's API we would be diverting users away from better approaches.

A better idea would be to create a recipe for a SortedList class that performed pre-calculated custom keys upon insertion and maintained an internal, decorated list.
msg54955 - (view) Author: Karl Bartel (karlb) Date: 2007-08-17 15:09
+1 for adding a reverse parameter.
msg63489 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-03-12 22:40
Guido, what do you think about this one?

It is easy to do and has been requested several times in various forums.
The seeming reasonable basis for the request is that sort and bisect
should fit together like a nut and bolt -- it is somewhat odd to be able
to sort a list by a key function but not be able to search or insert
according to the same key.

My main reservation is that the existence of the API would tend to
encourage bad design.  Elsewhere, the intent of the key function is to
be cheaper than a cmp function (such as with sort() which calls the key
function once per element instead of many times for cmp).  But with
successive bisect/insort calls, a key function could be called multiple
times for the same value.  

The same issue also applies to heapq:
http://bugs.python.org/issue1158313 and
http://bugs.python.org/issue1162363 .
msg63491 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-03-12 23:14
Sorry, I will claim ignorance on this one. I don't recall the last time
I've used a bisection, but it was probably around the time bisect.py was
first added to the standard library.  I do recall using heap sort as a
way to compute the top N items.  I've always done that by explicitly
mapping the raw data to a list of (key, value) tuples.

The arguments worrying about bad design make some sense to me;
consistency isn't always the panacea that people want it to be...
msg63503 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-03-13 12:36
Thanks Raymond for checking on the status of this issue and Guido for 
thinking about it too.  I still support the basic concept for the 
reasons I specified in the original description, namely maintaining a 
sorted list (or random-access iterable).  Thus, although in some ways it 
seems like syntactic sugar to make bisect and heap match the parameter 
list of sort, there is good reason to do so.

That having been said, the re-computation of the /key/ function for 
typically the same values at each invocation of an insort_* method can 
potentially add a lot of otherwise unnecessary calls.

So, having thought about this further, it seems to me that, somehow, the 
bisect/insort routines functions need to cache the results of each 
evaluation of key in the same way that sort can maintain that 
information in a local variable.  Ideally, IMHO, the keys computed in 
sort could be cached and stored with the list upon which it was 
operated, but this is cumbersome, would not work for generic types and 
would be invalid if for some reason the sort key was different from the 
bisect key.  Typically the later case would not occur but since it might 
lead to unexpected behavior it should definitely be avoided.

Instead, I am now thinking that the only way we can safely maintain 
caching information is to implement a persistent object to contain it 
that can be shared between each invocation of bisect/insort so as to not 
force re-evaluation of the same key argument.  This clearly could be 
done with a class.

So I would propose that a bisect class be created that would hold the 
list, the current key function and the cached key values.  Of course, 
special care would need to be taken to take into account mutable 
sequences that could be changed between bisect calls.

The class could be implements in a number of ways.  For instance, it 
could create a callable instance that accepts a list and other 
parameters.  It will then look at its state, including a check of /is/ 
with the input and the last list operated upon, and if that test fails, 
the 'key cache' is invalidated.  Next, if a key method is provided, it 
is checked against the last key function used (/is/) and if it is not, 
the 'key cache' is invalidated.  Then, to prevent mutability problems, 
the 'key cache' would actually be implemented as a dictionary mapping 
item to key-value and each time a key needed to be computed, the input 
would first be checked against the cache and only if it was not there 
would the key function be called.  This should allow for preventing 
undefined behavior when handling invalid cases.  Of course, if the 
client wishes to use bisect on 2 different lists or 2 different keys, 
sie could always just create 2 different bisect objects.  There are of 
course other ways to accomplish this, such as binding the bisect object 
to a list and key at construction that cannot be reset (at least, should 
not be) once invoked.  I am also wary of implementing the 'key cache' as 
a dictionary as it adds a hash-table lookup which is already potentially 
expensive.  Ideally, if the bisect object could force its associated 
list to be immutable for the life of the bisect, this would get around 
the problem of external inserts and deletes that would invalidate the 
'key cache'.

Obviously, there may be another way to define such a class but I think 
the only way you can safely make sure the bisect functions don't unnecessarily compute key without modifying the list is to allow for the 
caching of values between invocations, and the only way you can do that 
safely is by putting the cache in some helper class.  Of course, a class 
is only needed for when key is used.  comp and reverse, AFAIK, are not 
cached when sorting.  So the existing, generic bisect algorithms would 
still be useful and in fact, you'd want the bisect class to be optional, 
IMHO.

Of course, Guido's suggestion of just mapping your list into a list of 
(key, value) tuples is a good workaround for just the key case (assuming 
key, comp and reverse flags are not all used at the same time for 
sorting).  It probably should be a cookbook item that could be 
documented with the bisect (and heap) libraries.  It can't completely 
solve the problem, but it will solve some conditions in the short term.  
Long term, I still think we should consider bisect, potentially with a 
class definition.
msg64402 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-03-24 08:31
Thanks for the thoughtful reply.  Am going to close for the reasons 
agreed by most of the respondants.  See additional explanation in the 
closing of issue 1451588 .

That all-at-once action of sort() allows it to take advantage of a 
single key-function call per element; in contrast, successive calls to 
bisect functions are more finely grained and the key-function doesn't 
make sense anymore.
History
Date User Action Args
2008-03-24 08:31:04rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg64402
2008-03-13 12:36:25timehorsesetmessages: + msg63503
2008-03-12 23:14:52gvanrossumsetassignee: gvanrossum -> rhettinger
messages: + msg63491
2008-03-12 22:40:55rhettingersetassignee: rhettinger -> gvanrossum
messages: + msg63489
nosy: + gvanrossum
2006-12-19 21:14:15timehorsecreate