New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filecmp.cmp needs a documented way to clear cache #56011
Comments
I have a program which calls filecmp.cmp a lot. Without doing this, filecmp's cache uses up all the memory in the computer. There needs to be a documented interface to clear the cache. I suggest a function Without a documented interface, there is no standard way to clear the Alternatively, one might disable the caching code. One shouldn't have to look at the source code of a library function |
I've looked at the code for Python 3, and there isn't anything there that An alternative approach would be to limit the size of the cache, so that |
Putting in a size limit is reasonable. We did this for fnmatch not that long ago (bpo-7846). That was in fact the inspiration for lru_cache. |
Patch for 3.3 and 3.2 |
Patch for 2.7. |
Oops, there was a typo in the 2.7 patch ("import _thread" instead of |
Why use an ordered dict instead of functools.lru_cache? |
Because the lru_cache decorator doesn't provide any way to invalidate Perhaps I should factor out the duplicated code into a separate class |
I question whether this should be backported. Please discuss with the RM. |
Will do. Are you referring specifically to 2.7, or to 3.1 and 3.2 as well? |
Georg? Benjamin? Do you think this fix should be backported? |
-1 on backporting. |
OK. I'll try to put together something cleaner just for 3.3, then. |
Nadeem, I want to review this but won't have a chance to do it right away. Offhand, it seems like we could use the existing functools.lru_cache() for this if the stats were included as part of the key: cache[f1, f2, s1, s2]=outcome. Also, I want to take a fresh look at the cache strategy (saving diffs of two files vs saving file contents individually) and think about whether than makes any sense at all for real world use cases (is there a common need to compare the same file pairs over and over again or is the typical use the comparison of many different file pairs). There may even be a better way to approach the underlying problem using hashes of entire files (md5, sha1, etc). |
There are many possible solutions to this problem. However, there have been several suggestions on simple fixes that don't change the API, all of which fix the resource leak. Doing nothing will not fix the resource leak. How about a simple fix right now, using a lru cache, fixing all versions of Python, and perhaps come up with a superior solution at a later date? |
We will do something. The next release isn't for a while, so there is time to put thought into it rather than making an immediate check-in. |
I like that idea. A hash-based approach could speed up the detection of |
New changeset 11568c59d9d4 by Raymond Hettinger in branch '2.7': |
New changeset 2bacaf6a80c4 by Raymond Hettinger in branch '3.2': New changeset 8f4466619e1c by Raymond Hettinger in branch 'default': |
Made a simple fix to keep the cache from growing without bound. |
After more thought, will just close this report. If a new project emerges to improve the design of filecmp, it can be done in a separate tracker entry. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: