Message 63503 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	timehorse
Recipients	gvanrossum, karlb, rhettinger, timehorse
Date	2008-03-13.12:36:22
SpamBayes Score	0.0024102472
Marked as misclassified	No
Message-id	<1205411785.77.0.785708243016.issue1619060@psf.upfronthosting.co.za>
In-reply-to

Content
Thanks Raymond for checking on the status of this issue and Guido for thinking about it too. I still support the basic concept for the reasons I specified in the original description, namely maintaining a sorted list (or random-access iterable). Thus, although in some ways it seems like syntactic sugar to make bisect and heap match the parameter list of sort, there is good reason to do so. That having been said, the re-computation of the /key/ function for typically the same values at each invocation of an insort_* method can potentially add a lot of otherwise unnecessary calls. So, having thought about this further, it seems to me that, somehow, the bisect/insort routines functions need to cache the results of each evaluation of key in the same way that sort can maintain that information in a local variable. Ideally, IMHO, the keys computed in sort could be cached and stored with the list upon which it was operated, but this is cumbersome, would not work for generic types and would be invalid if for some reason the sort key was different from the bisect key. Typically the later case would not occur but since it might lead to unexpected behavior it should definitely be avoided. Instead, I am now thinking that the only way we can safely maintain caching information is to implement a persistent object to contain it that can be shared between each invocation of bisect/insort so as to not force re-evaluation of the same key argument. This clearly could be done with a class. So I would propose that a bisect class be created that would hold the list, the current key function and the cached key values. Of course, special care would need to be taken to take into account mutable sequences that could be changed between bisect calls. The class could be implements in a number of ways. For instance, it could create a callable instance that accepts a list and other parameters. It will then look at its state, including a check of /is/ with the input and the last list operated upon, and if that test fails, the 'key cache' is invalidated. Next, if a key method is provided, it is checked against the last key function used (/is/) and if it is not, the 'key cache' is invalidated. Then, to prevent mutability problems, the 'key cache' would actually be implemented as a dictionary mapping item to key-value and each time a key needed to be computed, the input would first be checked against the cache and only if it was not there would the key function be called. This should allow for preventing undefined behavior when handling invalid cases. Of course, if the client wishes to use bisect on 2 different lists or 2 different keys, sie could always just create 2 different bisect objects. There are of course other ways to accomplish this, such as binding the bisect object to a list and key at construction that cannot be reset (at least, should not be) once invoked. I am also wary of implementing the 'key cache' as a dictionary as it adds a hash-table lookup which is already potentially expensive. Ideally, if the bisect object could force its associated list to be immutable for the life of the bisect, this would get around the problem of external inserts and deletes that would invalidate the 'key cache'. Obviously, there may be another way to define such a class but I think the only way you can safely make sure the bisect functions don't unnecessarily compute key without modifying the list is to allow for the caching of values between invocations, and the only way you can do that safely is by putting the cache in some helper class. Of course, a class is only needed for when key is used. comp and reverse, AFAIK, are not cached when sorting. So the existing, generic bisect algorithms would still be useful and in fact, you'd want the bisect class to be optional, IMHO. Of course, Guido's suggestion of just mapping your list into a list of (key, value) tuples is a good workaround for just the key case (assuming key, comp and reverse flags are not all used at the same time for sorting). It probably should be a cookbook item that could be documented with the bisect (and heap) libraries. It can't completely solve the problem, but it will solve some conditions in the short term. Long term, I still think we should consider bisect, potentially with a class definition.

Thanks Raymond for checking on the status of this issue and Guido for 
thinking about it too.  I still support the basic concept for the 
reasons I specified in the original description, namely maintaining a 
sorted list (or random-access iterable).  Thus, although in some ways it 
seems like syntactic sugar to make bisect and heap match the parameter 
list of sort, there is good reason to do so.

That having been said, the re-computation of the /key/ function for 
typically the same values at each invocation of an insort_* method can 
potentially add a lot of otherwise unnecessary calls.

So, having thought about this further, it seems to me that, somehow, the 
bisect/insort routines functions need to cache the results of each 
evaluation of key in the same way that sort can maintain that 
information in a local variable.  Ideally, IMHO, the keys computed in 
sort could be cached and stored with the list upon which it was 
operated, but this is cumbersome, would not work for generic types and 
would be invalid if for some reason the sort key was different from the 
bisect key.  Typically the later case would not occur but since it might 
lead to unexpected behavior it should definitely be avoided.

Instead, I am now thinking that the only way we can safely maintain 
caching information is to implement a persistent object to contain it 
that can be shared between each invocation of bisect/insort so as to not 
force re-evaluation of the same key argument.  This clearly could be 
done with a class.

So I would propose that a bisect class be created that would hold the 
list, the current key function and the cached key values.  Of course, 
special care would need to be taken to take into account mutable 
sequences that could be changed between bisect calls.

The class could be implements in a number of ways.  For instance, it 
could create a callable instance that accepts a list and other 
parameters.  It will then look at its state, including a check of /is/ 
with the input and the last list operated upon, and if that test fails, 
the 'key cache' is invalidated.  Next, if a key method is provided, it 
is checked against the last key function used (/is/) and if it is not, 
the 'key cache' is invalidated.  Then, to prevent mutability problems, 
the 'key cache' would actually be implemented as a dictionary mapping 
item to key-value and each time a key needed to be computed, the input 
would first be checked against the cache and only if it was not there 
would the key function be called.  This should allow for preventing 
undefined behavior when handling invalid cases.  Of course, if the 
client wishes to use bisect on 2 different lists or 2 different keys, 
sie could always just create 2 different bisect objects.  There are of 
course other ways to accomplish this, such as binding the bisect object 
to a list and key at construction that cannot be reset (at least, should 
not be) once invoked.  I am also wary of implementing the 'key cache' as 
a dictionary as it adds a hash-table lookup which is already potentially 
expensive.  Ideally, if the bisect object could force its associated 
list to be immutable for the life of the bisect, this would get around 
the problem of external inserts and deletes that would invalidate the 
'key cache'.

Obviously, there may be another way to define such a class but I think 
the only way you can safely make sure the bisect functions don't unnecessarily compute key without modifying the list is to allow for the 
caching of values between invocations, and the only way you can do that 
safely is by putting the cache in some helper class.  Of course, a class 
is only needed for when key is used.  comp and reverse, AFAIK, are not 
cached when sorting.  So the existing, generic bisect algorithms would 
still be useful and in fact, you'd want the bisect class to be optional, 
IMHO.

Of course, Guido's suggestion of just mapping your list into a list of 
(key, value) tuples is a good workaround for just the key case (assuming 
key, comp and reverse flags are not all used at the same time for 
sorting).  It probably should be a cookbook item that could be 
documented with the bisect (and heap) libraries.  It can't completely 
solve the problem, but it will solve some conditions in the short term.  
Long term, I still think we should consider bisect, potentially with a 
class definition.

History
Date	User	Action	Args
2008-03-13 12:36:26	timehorse	set	spambayes_score: 0.00241025 -> 0.0024102472 recipients: + timehorse, gvanrossum, rhettinger, karlb
2008-03-13 12:36:25	timehorse	set	spambayes_score: 0.00241025 -> 0.00241025 messageid: <1205411785.77.0.785708243016.issue1619060@psf.upfronthosting.co.za>
2008-03-13 12:36:25	timehorse	link	issue1619060 messages
2008-03-13 12:36:22	timehorse	create