classification
Title: filecmp.cmp needs a documented way to clear cache
Type: resource usage Stage: needs patch
Components: Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: benjamin.peterson, eric.araujo, georg.brandl, lopgok, nadeem.vawda, python-dev, r.david.murray, rhettinger
Priority: normal Keywords: patch

Created on 2011-04-08 03:37 by lopgok, last changed 2011-06-26 09:07 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
filecmp-lru-cache-3.3.diff nadeem.vawda, 2011-04-08 16:54 Limit the size of filecmp._cache using LRU replacement review
filecmp-lru-cache-2.7.diff nadeem.vawda, 2011-04-08 17:09 Corrected version of 2.7 patch
Messages (21)
msg133290 - (view) Author: jeff deifik (lopgok) Date: 2011-04-08 03:37
I have a program which calls filecmp.cmp a lot.
It runs out of memory.
I read the source to filecmp, and then I periodically set
filecmp._cache = {}

Without doing this, filecmp's cache uses up all the memory in the computer.

There needs to be a documented interface to clear the cache.

I suggest a function
def clear_cache:
    _cache = {}

Without a documented interface, there is no standard way to clear the
cache. It is possible different versions of python will require
different methods to clear the cache, which will reduce python code
portability and is a bad idea.

Alternatively, one might disable the caching code.

One shouldn't have to look at the source code of a library function
to see why it is consuming memory.
msg133300 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-08 09:16
I've looked at the code for Python 3, and there isn't anything there that
prevents this from happening there, either. So the fix should be applied
to 3.2 and 3.3 as well.

An alternative approach would be to limit the size of the cache, so that
the caller doesn't need to explicitly clear the cache. Something along
the lines of functools.lru_cache() should do the trick. I don't think
it'll be possible to use lru_cache() itself, though - it doesn't provide
a mechanism to invalidate cache entries when they become stale (and in
any case, it doesn't exist in 2.7).
msg133304 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-08 11:09
Putting in a size limit is reasonable.  We did this for fnmatch not that long ago (issue 7846).  That was in fact the inspiration for lru_cache.
msg133326 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-08 16:54
Patch for 3.3 and 3.2
msg133327 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-08 16:55
Patch for 2.7.
msg133328 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-08 17:09
Oops, there was a typo in the 2.7 patch ("import _thread" instead of
"import thread"). Corrected patch attached.
msg133386 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-04-09 13:49
Why use an ordered dict instead of functools.lru_cache?
msg133394 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-09 14:40
Because the lru_cache decorator doesn't provide any way to invalidate
stale cache entries.

Perhaps I should factor out the duplicated code into a separate class
that can then also be exposed to users of the stdlib. But that would only
apply to 3.3, so the uglier fix is still necessary for older versions.
msg133398 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-04-09 15:57
I question whether this should be backported.  Please discuss with the RM.
msg133401 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-09 16:46
> I question whether this should be backported.  Please discuss with the RM.

Will do. Are you referring specifically to 2.7, or to 3.1 and 3.2 as well?
msg133470 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-04-10 15:33
Georg? Benjamin? Do you think this fix should be backported?
msg136009 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-05-15 06:37
-1 on backporting.
msg136029 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-05-15 12:31
OK. I'll try to put together something cleaner just for 3.3, then.
msg137773 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-06-06 19:59
Nadeem, I want to review this but won't have a chance to do it right away.  Offhand, it seems like we could use the existing functools.lru_cache() for this if the stats were included as part of the key:  cache[f1, f2, s1, s2]=outcome.

Also, I want to take a fresh look at the cache strategy (saving diffs of two files vs saving file contents individually) and think about whether than makes any sense at all for real world use cases (is there a common need to compare the same file pairs over and over again or is the typical use the comparison of many different file pairs).   There may even be a better way to approach the underlying problem using hashes of entire files (md5, sha1, etc).
msg137774 - (view) Author: jeff deifik (lopgok) Date: 2011-06-06 20:17
There are many possible solutions to this problem.
Personally, I think mine is the simplest, though it changes the API.

However, there have been several suggestions on simple fixes that don't change the API, all of which fix the resource leak.

Doing nothing will not fix the resource leak.

How about a simple fix right now, using a lru cache, fixing all versions of Python, and perhaps come up with a superior solution at a later date?
msg137775 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-06-06 20:28
We will do something.  The next release isn't for a while, so there is time to put thought into it rather than making an immediate check-in.
msg137782 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2011-06-07 00:30
> Also, I want to take a fresh look at the cache strategy (saving diffs
> of two files vs saving file contents individually) and think about
> whether than makes any sense at all for real world use cases
> (is there a common need to compare the same file pairs over and over
> again or is the typical use the comparison of many different file
> pairs).   There may even be a better way to approach the underlying
> problem using hashes of entire files (md5, sha1, etc).

I like that idea. A hash-based approach could speed up the detection of
non-equal files quite a bit.
msg139083 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-06-25 15:14
New changeset 11568c59d9d4 by Raymond Hettinger in branch '2.7':
Issue 11802:  filecmp cache was growing without bound.
http://hg.python.org/cpython/rev/11568c59d9d4
msg139085 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-06-25 15:21
New changeset 2bacaf6a80c4 by Raymond Hettinger in branch '3.2':
Issue 11802:  filecmp cache was growing without bound.
http://hg.python.org/cpython/rev/2bacaf6a80c4

New changeset 8f4466619e1c by Raymond Hettinger in branch 'default':
Issue 11802:  filecmp cache was growing without bound.
http://hg.python.org/cpython/rev/8f4466619e1c
msg139086 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-06-25 15:24
Made a simple fix to keep the cache from growing without bound.
Leaving this open for 3.3 as a feature request to implement a more sophisticated strategy using file hashes or somesuch.
msg139155 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-06-26 09:07
After more thought, will just close this report.  If a new project emerges to improve the design of filecmp, it can be done in a separate tracker entry.
History
Date User Action Args
2011-06-26 09:07:25rhettingersetstatus: open -> closed
resolution: later -> fixed
messages: + msg139155
2011-06-25 15:24:50rhettingersetmessages: + msg139086
2011-06-25 15:21:29python-devsetmessages: + msg139085
2011-06-25 15:15:00python-devsetnosy: + python-dev
messages: + msg139083
2011-06-07 00:30:18nadeem.vawdasetmessages: + msg137782
2011-06-06 20:28:43rhettingersetmessages: + msg137775
2011-06-06 20:17:36lopgoksetmessages: + msg137774
2011-06-06 19:59:26rhettingersetassignee: nadeem.vawda -> rhettinger
resolution: later
messages: + msg137773
2011-05-15 12:31:07nadeem.vawdasetassignee: nadeem.vawda
stage: patch review -> needs patch
messages: + msg136029
versions: - Python 3.1, Python 2.7, Python 3.2
2011-05-15 06:37:36georg.brandlsetmessages: + msg136009
2011-04-10 15:33:48nadeem.vawdasetnosy: + georg.brandl, benjamin.peterson
messages: + msg133470
2011-04-09 16:46:20nadeem.vawdasetmessages: + msg133401
2011-04-09 15:57:16rhettingersetmessages: + msg133398
2011-04-09 14:40:49nadeem.vawdasetmessages: + msg133394
2011-04-09 13:49:57eric.araujosetmessages: + msg133386
versions: + Python 3.1
2011-04-09 13:49:09eric.araujosetfiles: - filecmp-lru-cache-2.7.diff
2011-04-08 17:09:59nadeem.vawdasetfiles: + filecmp-lru-cache-2.7.diff

messages: + msg133328
2011-04-08 16:56:51nadeem.vawdasetstage: needs patch -> patch review
2011-04-08 16:55:59nadeem.vawdasetfiles: + filecmp-lru-cache-2.7.diff

messages: + msg133327
2011-04-08 16:54:38nadeem.vawdasetfiles: + filecmp-lru-cache-3.3.diff
keywords: + patch
messages: + msg133326
2011-04-08 15:55:46eric.araujosetnosy: + rhettinger, eric.araujo
2011-04-08 11:09:17r.david.murraysetnosy: + r.david.murray
messages: + msg133304
2011-04-08 09:16:42nadeem.vawdasetversions: + Python 3.2, Python 3.3
nosy: + nadeem.vawda

messages: + msg133300

type: resource usage
stage: needs patch
2011-04-08 03:37:23lopgokcreate