Title: add optional make_key argument to lru_cache
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.10
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Itayazolay, Jim.Jewett, bar.harel, fbidu, rhettinger
Priority: normal Keywords: patch

Created on 2020-07-06 16:51 by Itayazolay, last changed 2020-07-18 02:11 by rhettinger. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 21353 closed Itayazolay, 2020-07-06 16:52
Messages (11)
msg373141 - (view) Author: Itay azolay (Itayazolay) * Date: 2020-07-06 16:51
I'd like to add optional argument to lru_cache.
This argument is a user given function that will replace the default behaviour of creating a key from the args/kwds of the function.

for example:

def my_make_key(my_list):
  return my_list[0]

@lru_cache(128, make_key=my_make_key)
def cached_func(my_list):
  return sum(my_list)

This will creating a cached function that accepts immutable.
Also, It will allow user to add custom functions from knowledge about the expected function input, without the need to create custom classes and/or overriding __hash__
msg373153 - (view) Author: Bar Harel (bar.harel) * Date: 2020-07-06 17:47
May I suggest calling it key? Will comply with other Python functions.
msg373197 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-07-06 23:47
>>> from functools import lru_cache
>>> def my_make_key(my_list):
...   return my_list[0]
>>> @lru_cache(128, make_key=my_make_key)
... def cached_func(my_list):
...   return sum(my_list)
>>> cached_func([10, 20, 30])
>>> cached_func([10, 11, 12]) # <-- Why would we want this to return 60?

This seems unsafe.
msg373209 - (view) Author: Itay azolay (Itayazolay) * Date: 2020-07-07 07:18
Yes, you're right.
It's a bad example, I tried to simplify it, and I ended up oversimplifying it.
Real-life cases are of course more complicated.
What I wanted to accomplish, is very similar to the `key` argument in sorted/min/max/etc..
let my try and give you an example.

assume we have a complex data type, with a timestamp-signed attribute.
our single item will look as follows:
SingleItem = (unique_timestamp, <mutable_data_structure> )
now assume we have a function "extensive_computation"

def extensive_computation(items: List[SingleItem]):
  # very hard work

As developer, I know that the every item has unique timestamp.
So for a list of N timestamp, when they are the same, the result of the computation will be the same.

def item_cache_key(items: List[SingleItem]):
  return (timestamp for timestamp, data in items)

I would like to then create:

@lru_cache(128, key=item_cache_key):
def cache_extensive_computation(items):

Does that makes more sense?
msg373212 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-07-07 09:23
Thanks, I see what you're trying to do now:

1) Given a slow function 
2) that takes a complex argument 
   2a)  that includes a hashable unique identifier 
   2b)  and some unhashable data
3) Cache the function result using only the unique identifier

The lru_cache() currently can't be used directly because
all the function arguments must be hashable.

The proposed solution:
1) Write a helper function
   1a) that hash the same signature as the original function
   1b) that returns only the hashable unique identifier
2) With a single @decorator application, connect
   2a) the original function
   2b) the helper function
   2c) and the lru_cache logic

A few areas of concern come to mind:

* People have come to expect cached calls to be very cheap, but it is easy to write input transformations that aren't cheap (i.e. looping over all the inputs as in your example or converting entire mutable structures to immutable structures).

* While key-functions are relatively well understood, when we use them elsewhere key-functions only get called once per element.  Here, the lru_cache() would call the key function every time even if the arguments are identical.  This will be surprising to some users.

* The helper function signature needs exactly match the wrapped function.  Changes would need to be made in both places.

* It would be hard to debug if the helper function return values ever stop being unique.  For example, if the timestamps start getting rounded to the nearest second, they will sporadically become non-unique.

* The lru_cache signature makes it awkward to add more arguments.  That is why your examples had to explicitly specify a maxsize of 128 even though 128 is the default. 

* API simplicity was an early design goal.  Already, I made a mistake by accepting the "typed" argument which is almost never used but regularly causes confusion and affects learnability.

* The use case is predicated on having a large unhashable dataset accompanied by a hashable identifier that is assumed to be unique.  This probably isn't common enough to warrant an API extension.  

Out of curiosity, what are you doing now without the proposed extension?  

As a first try, I would likely write a dataclass to be explicit about the types and about which fields are used in hashing and equality testing:

    class ItemsList:
        unique_id: float
        data: dict = field(hash=False, compare=False)

I expect that dataclasses like this will emerge as the standard solution whenever people need a mapping or dict to work with keys that have a mix of hashable and unhashable components.  This will work with the lru_cache(), dict(), defaultdict(), ChainMap(), set(), frozenset(), etc.
msg373285 - (view) Author: Itay azolay (Itayazolay) * Date: 2020-07-08 08:03
Thanks, you have some very good points.
Let my try to address them

* cache functions really are expected to be cheap, but what they really need to be is *cheaper*. If my computation is expensive enough, I might be okay with making a less, still somewhat expensive computation instead. I believe it's for the developer to decide.

* key is usually used in a sequence of elements contexts, but it is expected to run multiple times. I believe that this is expected(what else could someone expect to happen?). I believe this is solvable through good docs(or change the name of the parameter?) 

* I believe that a matching signature key-make function is a good thing. It will enforce the user to address the key-make function if he changes the behaviour of the cached function, and he would rethink the cache, otherwise it will not work.

* I can't argue about API simplicity, you probably have much more experience there. However, I believe that if we can agree that this is a useful feature, we can find a way to make the API clear and welcoming.
BTW, I agree with the problems with the typed argument, never quite understood when can this be useful.

I'd like to compare the key argument suggested here, to key argument through other python functions. let's take `sorted` as example.
sorted supports key to be able to sort other types of data structures,
even though I like your suggestion, to use dataclass, I believe that if it's applicable here, we can say the same thing for sorted.
we could require sorted to work the same way:

@total_ordering # If I'm not mistaken
class MyData:
   def __gt__(self, other):
     return self.field > other.field


I think we both see the reason why this wouldn't be optimal in some cases here.
Without the key function, the sorted function doesn't support a big part of python objects.
I think the same applies for LRU cache. Right now, we just can use it with all python objects. we have to change the API, the way we move data around, the way we keep our objects, just so that lru_cache would work.
And after all that, lru_cache will just break if someone send some data in a list instead of tuple. I think that cause a lot of developers to give up the default stdlib lru_cache.

In my case, I have a few list of lists, each list indicates an event that happened. In each event, there is a unique timestamp. 
I have an object, that have few different lists
class Myobj:
    events_1: List[list]
    events_2: List[list]

I have a small, esoteric function, that looks like that now:
def calc(list_of_events):
  # calculation

and is being called from multiple places in the code, which takes a lot of time, like that
calc(events_1) # multiple times
calc(events_2) # multiple times

I wanted to cache the function calc, but now I have to do something like that:
def calc_events_1(myobj):

def calc_events_2(myobj):

right now I can't change the API of the lists, because they are being used in multiple places, some of this least(I have multiple events-lists) are being converted to numpy, some doesn't.

Regarding API, we could make it simpler by either use must have kwargs, like lru_cache(maxsize, typed, *, key=None)
or, like the property setters/getters case
def function(args, ...):
@function.make_key # or key, whatever name is good
def _(args, ...):
  return new_key

However I like the second option less.

msg373505 - (view) Author: Felipe Rodrigues (fbidu) * Date: 2020-07-11 02:48
Hello all,

I love discussions about caching! While I do see the point of Itayazolay's proposal, I think that it should be on the _caller_ to solve the caching issue, and not on the _callee_ to provide a way to make it happen.

That is, I think that if one wants to use a cache, they should make sure their arguments are hashable and not that we should modify called functions to provide workarounds for those.
msg373506 - (view) Author: Felipe Rodrigues (fbidu) * Date: 2020-07-11 02:50
Correction: (...) on the _caller_ to solve the hashing* issue (...)
msg373522 - (view) Author: Itay azolay (Itayazolay) * Date: 2020-07-11 10:18
Hey Felipe! Thanks for stepping in!
I do get your argument. 
However, in my opinion, I try to argue the same thing for max or sorted.
"if one wants to use `sorted`, they should make sure their arguments are comparable".
However, it is not the case, since we do have the `key` argument for sorted or max. 
Also, I don't believe caching equals hashing. 
Maybe from the technical point view, it does, but in reality, One can(and probably will) cache unhashable object, whether we give the option to do so or not.
I think, embedding the key argument in lru_cache, we allow the caller(developer) to solve the caching issue, in a way that is right according to his implementation of the cached function and its arguments.

Unrelated, this is my first feature proposal for python. I want to thank you for taking the time to think and answer with some very good arguments and respect, I truly enjoy this little debate we have here :)
msg373535 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2020-07-11 18:54
Going back to Raymond's analysis, this is useful when at least some of the parameters either do not change the result, or are not hashable.

At a minimum, you need to figure out which parameters those are, and whether to drop them or transform them.

Is this already sufficiently rare or tricky that a subclass is justified, instead of trying to shoehorn things into a single key method?
msg373869 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-07-18 02:11
I've left this open for a week to collect comments.  I concur with Felipe that this should be the responsibility of the caller.  And I concur with Jim that the use cases are likely too rare to warrant shoehorning in the extra functionality.

For my part, I'm concerned that this would be bug bait.  The result of the function would need to be uniquely associated with a particular output, independent of the contents of the mutable arguments.  I don't think it is good design for function calls to be disregarding the part of the input that was originally used to compute the output — in a way that notion conflict with the core idea of the of the lru_cache giving the same outputs for the same inputs.

Likewise, debugging and exception handing would be complicated by having two underlying user functions.  Exceptions could arise from three sources: the make_key() function, the cached function, and the lru cache internals.

Further, if the make_key() function turns out not to be unique, the ensuring bugs would be hard to find.  In the OP's later example, if we get two distinct inputs which happen to have the same timestamp, the error would pass without warning and would be hard to reproduce.  Similarly, the make_key() function has to assume that the mutable portion of the data hasn't mutated, but a key feature of mutable data is that it can mutate.  That too would result in a hard to find bug.

Also, while the OP has a specific use case in mind, we can't know in advance how others will use make_key().  It could be misused to convert mutable data to immutable data, possibly resulting in a cached function that is slower than the original (caching mean() for example).  Or the make_key() function could had side-effects.  The OP's original "my_list[0]" example showed yet another way that bugs could arise. It is problematic that that particular bug might not have been caught by a test suite that didn't know to call the function twice with a common first argument but with differing subsequent arguments.

Another possible bug trap is that it would make it easier to accidentally cache mutable function results without getting a warning.  For example:
      @lru_cache(128, make_key = lambda target, size, uniq_id, data: uniq_id)
      def search(target, size, uniq_id, data):
           i = data.index(target)
           return data[i : i+size]    # <== bug because this is mutable if data is mutable
I think the suggestion was a neat idea (and thought provoking), but I think we should decline because 1) the use case is uncommon 2) because it makes the API harder to understand 3) because it is bug bait and 4) because the responsibility should probably be with the caller.

Thank you for the suggestion.  It was inspired.  Don't be discouraged — many of my proposals don't get accepted as well.  Please continue to submit suggestions.

P.S. make_key() shouldn't be called a key-function so that we don't make that term ambiguous.  Elsewhere key-functions are all about comparison logic rather than hashing.  They can be created by functools.cmp_to_key(), they tend to be used only once per data element, and the same function tends to be inoperable with all the tools that accept key-functions.  So the term make_key() was likely the better terminology.  Perhaps get_unique_id() would have been even less ambiguous.
Date User Action Args
2020-07-18 02:11:25rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg373869

stage: patch review -> resolved
2020-07-11 18:54:42Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg373535
2020-07-11 10:18:45Itayazolaysetmessages: + msg373522
2020-07-11 02:50:09fbidusetmessages: + msg373506
2020-07-11 02:48:06fbidusetnosy: + fbidu
messages: + msg373505
2020-07-08 08:03:26Itayazolaysetmessages: + msg373285
2020-07-07 09:23:27rhettingersetmessages: + msg373212
2020-07-07 07:18:54Itayazolaysetmessages: + msg373209
2020-07-06 23:53:59rhettingersetmessages: - msg373196
2020-07-06 23:47:44rhettingersetmessages: + msg373197
2020-07-06 23:40:31rhettingersetmessages: + msg373196
2020-07-06 23:39:09rhettingersetmessages: - msg373194
2020-07-06 23:34:45rhettingersetassignee: rhettinger
messages: + msg373194
2020-07-06 17:50:31xtreaksetnosy: + rhettinger
2020-07-06 17:47:50bar.harelsetnosy: + bar.harel
messages: + msg373153
2020-07-06 16:52:40Itayazolaysetkeywords: + patch
stage: patch review
pull_requests: + pull_request20499
2020-07-06 16:51:10Itayazolaycreate