Message 373212 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	Itayazolay, bar.harel, rhettinger
Date	2020-07-07.09:23:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1594113807.27.0.413581589757.issue41220@roundup.psfhosted.org>
In-reply-to

Content
Thanks, I see what you're trying to do now: 1) Given a slow function 2) that takes a complex argument 2a) that includes a hashable unique identifier 2b) and some unhashable data 3) Cache the function result using only the unique identifier The lru_cache() currently can't be used directly because all the function arguments must be hashable. The proposed solution: 1) Write a helper function 1a) that hash the same signature as the original function 1b) that returns only the hashable unique identifier 2) With a single @decorator application, connect 2a) the original function 2b) the helper function 2c) and the lru_cache logic A few areas of concern come to mind: * People have come to expect cached calls to be very cheap, but it is easy to write input transformations that aren't cheap (i.e. looping over all the inputs as in your example or converting entire mutable structures to immutable structures). * While key-functions are relatively well understood, when we use them elsewhere key-functions only get called once per element. Here, the lru_cache() would call the key function every time even if the arguments are identical. This will be surprising to some users. * The helper function signature needs exactly match the wrapped function. Changes would need to be made in both places. * It would be hard to debug if the helper function return values ever stop being unique. For example, if the timestamps start getting rounded to the nearest second, they will sporadically become non-unique. * The lru_cache signature makes it awkward to add more arguments. That is why your examples had to explicitly specify a maxsize of 128 even though 128 is the default. * API simplicity was an early design goal. Already, I made a mistake by accepting the "typed" argument which is almost never used but regularly causes confusion and affects learnability. * The use case is predicated on having a large unhashable dataset accompanied by a hashable identifier that is assumed to be unique. This probably isn't common enough to warrant an API extension. Out of curiosity, what are you doing now without the proposed extension? As a first try, I would likely write a dataclass to be explicit about the types and about which fields are used in hashing and equality testing: @dataclass(unsafe_hash=True) class ItemsList: unique_id: float data: dict = field(hash=False, compare=False) I expect that dataclasses like this will emerge as the standard solution whenever people need a mapping or dict to work with keys that have a mix of hashable and unhashable components. This will work with the lru_cache(), dict(), defaultdict(), ChainMap(), set(), frozenset(), etc.

Thanks, I see what you're trying to do now:

1) Given a slow function 
2) that takes a complex argument 
   2a)  that includes a hashable unique identifier 
   2b)  and some unhashable data
3) Cache the function result using only the unique identifier

The lru_cache() currently can't be used directly because
all the function arguments must be hashable.

The proposed solution:
1) Write a helper function
   1a) that hash the same signature as the original function
   1b) that returns only the hashable unique identifier
2) With a single @decorator application, connect
   2a) the original function
   2b) the helper function
   2c) and the lru_cache logic


A few areas of concern come to mind:

* People have come to expect cached calls to be very cheap, but it is easy to write input transformations that aren't cheap (i.e. looping over all the inputs as in your example or converting entire mutable structures to immutable structures).

* While key-functions are relatively well understood, when we use them elsewhere key-functions only get called once per element.  Here, the lru_cache() would call the key function every time even if the arguments are identical.  This will be surprising to some users.

* The helper function signature needs exactly match the wrapped function.  Changes would need to be made in both places.

* It would be hard to debug if the helper function return values ever stop being unique.  For example, if the timestamps start getting rounded to the nearest second, they will sporadically become non-unique.

* The lru_cache signature makes it awkward to add more arguments.  That is why your examples had to explicitly specify a maxsize of 128 even though 128 is the default. 

* API simplicity was an early design goal.  Already, I made a mistake by accepting the "typed" argument which is almost never used but regularly causes confusion and affects learnability.

* The use case is predicated on having a large unhashable dataset accompanied by a hashable identifier that is assumed to be unique.  This probably isn't common enough to warrant an API extension.  

Out of curiosity, what are you doing now without the proposed extension?  

As a first try, I would likely write a dataclass to be explicit about the types and about which fields are used in hashing and equality testing:

    @dataclass(unsafe_hash=True)
    class ItemsList:
        unique_id: float
        data: dict = field(hash=False, compare=False)

I expect that dataclasses like this will emerge as the standard solution whenever people need a mapping or dict to work with keys that have a mix of hashable and unhashable components.  This will work with the lru_cache(), dict(), defaultdict(), ChainMap(), set(), frozenset(), etc.

History
Date	User	Action	Args
2020-07-07 09:23:27	rhettinger	set	recipients: + rhettinger, bar.harel, Itayazolay
2020-07-07 09:23:27	rhettinger	set	messageid: <1594113807.27.0.413581589757.issue41220@roundup.psfhosted.org>
2020-07-07 09:23:27	rhettinger	link	issue41220 messages
2020-07-07 09:23:26	rhettinger	create