Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filecmp.cmp needs a documented way to clear cache #56011

Closed
lopgok mannequin opened this issue Apr 8, 2011 · 21 comments
Closed

filecmp.cmp needs a documented way to clear cache #56011

lopgok mannequin opened this issue Apr 8, 2011 · 21 comments
Assignees
Labels
performance Performance or resource usage

Comments

@lopgok
Copy link
Mannequin

lopgok mannequin commented Apr 8, 2011

BPO 11802
Nosy @birkenfeld, @rhettinger, @benjaminp, @merwok, @bitdancer
Files
  • filecmp-lru-cache-3.3.diff: Limit the size of filecmp._cache using LRU replacement
  • filecmp-lru-cache-2.7.diff: Corrected version of 2.7 patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2011-06-26.09:07:25.764>
    created_at = <Date 2011-04-08.03:37:23.217>
    labels = ['performance']
    title = 'filecmp.cmp needs a documented way to clear cache'
    updated_at = <Date 2011-06-26.09:07:25.763>
    user = 'https://bugs.python.org/lopgok'

    bugs.python.org fields:

    activity = <Date 2011-06-26.09:07:25.763>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2011-06-26.09:07:25.764>
    closer = 'rhettinger'
    components = []
    creation = <Date 2011-04-08.03:37:23.217>
    creator = 'lopgok'
    dependencies = []
    files = ['21584', '21586']
    hgrepos = []
    issue_num = 11802
    keywords = ['patch']
    message_count = 21.0
    messages = ['133290', '133300', '133304', '133326', '133327', '133328', '133386', '133394', '133398', '133401', '133470', '136009', '136029', '137773', '137774', '137775', '137782', '139083', '139085', '139086', '139155']
    nosy_count = 8.0
    nosy_names = ['georg.brandl', 'rhettinger', 'nadeem.vawda', 'benjamin.peterson', 'eric.araujo', 'r.david.murray', 'lopgok', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue11802'
    versions = ['Python 3.3']

    @lopgok
    Copy link
    Mannequin Author

    lopgok mannequin commented Apr 8, 2011

    I have a program which calls filecmp.cmp a lot.
    It runs out of memory.
    I read the source to filecmp, and then I periodically set
    filecmp._cache = {}

    Without doing this, filecmp's cache uses up all the memory in the computer.

    There needs to be a documented interface to clear the cache.

    I suggest a function
    def clear_cache:
    _cache = {}

    Without a documented interface, there is no standard way to clear the
    cache. It is possible different versions of python will require
    different methods to clear the cache, which will reduce python code
    portability and is a bad idea.

    Alternatively, one might disable the caching code.

    One shouldn't have to look at the source code of a library function
    to see why it is consuming memory.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 8, 2011

    I've looked at the code for Python 3, and there isn't anything there that
    prevents this from happening there, either. So the fix should be applied
    to 3.2 and 3.3 as well.

    An alternative approach would be to limit the size of the cache, so that
    the caller doesn't need to explicitly clear the cache. Something along
    the lines of functools.lru_cache() should do the trick. I don't think
    it'll be possible to use lru_cache() itself, though - it doesn't provide
    a mechanism to invalidate cache entries when they become stale (and in
    any case, it doesn't exist in 2.7).

    @nadeemvawda nadeemvawda mannequin added the performance Performance or resource usage label Apr 8, 2011
    @bitdancer
    Copy link
    Member

    Putting in a size limit is reasonable. We did this for fnmatch not that long ago (bpo-7846). That was in fact the inspiration for lru_cache.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 8, 2011

    Patch for 3.3 and 3.2

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 8, 2011

    Patch for 2.7.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 8, 2011

    Oops, there was a typo in the 2.7 patch ("import _thread" instead of
    "import thread"). Corrected patch attached.

    @merwok
    Copy link
    Member

    merwok commented Apr 9, 2011

    Why use an ordered dict instead of functools.lru_cache?

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 9, 2011

    Because the lru_cache decorator doesn't provide any way to invalidate
    stale cache entries.

    Perhaps I should factor out the duplicated code into a separate class
    that can then also be exposed to users of the stdlib. But that would only
    apply to 3.3, so the uglier fix is still necessary for older versions.

    @rhettinger
    Copy link
    Contributor

    I question whether this should be backported. Please discuss with the RM.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 9, 2011

    I question whether this should be backported. Please discuss with the RM.

    Will do. Are you referring specifically to 2.7, or to 3.1 and 3.2 as well?

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Apr 10, 2011

    Georg? Benjamin? Do you think this fix should be backported?

    @birkenfeld
    Copy link
    Member

    -1 on backporting.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented May 15, 2011

    OK. I'll try to put together something cleaner just for 3.3, then.

    @nadeemvawda nadeemvawda mannequin self-assigned this May 15, 2011
    @rhettinger
    Copy link
    Contributor

    Nadeem, I want to review this but won't have a chance to do it right away. Offhand, it seems like we could use the existing functools.lru_cache() for this if the stats were included as part of the key: cache[f1, f2, s1, s2]=outcome.

    Also, I want to take a fresh look at the cache strategy (saving diffs of two files vs saving file contents individually) and think about whether than makes any sense at all for real world use cases (is there a common need to compare the same file pairs over and over again or is the typical use the comparison of many different file pairs). There may even be a better way to approach the underlying problem using hashes of entire files (md5, sha1, etc).

    @rhettinger rhettinger assigned rhettinger and unassigned nadeemvawda Jun 6, 2011
    @lopgok
    Copy link
    Mannequin Author

    lopgok mannequin commented Jun 6, 2011

    There are many possible solutions to this problem.
    Personally, I think mine is the simplest, though it changes the API.

    However, there have been several suggestions on simple fixes that don't change the API, all of which fix the resource leak.

    Doing nothing will not fix the resource leak.

    How about a simple fix right now, using a lru cache, fixing all versions of Python, and perhaps come up with a superior solution at a later date?

    @rhettinger
    Copy link
    Contributor

    We will do something. The next release isn't for a while, so there is time to put thought into it rather than making an immediate check-in.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Jun 7, 2011

    Also, I want to take a fresh look at the cache strategy (saving diffs
    of two files vs saving file contents individually) and think about
    whether than makes any sense at all for real world use cases
    (is there a common need to compare the same file pairs over and over
    again or is the typical use the comparison of many different file
    pairs). There may even be a better way to approach the underlying
    problem using hashes of entire files (md5, sha1, etc).

    I like that idea. A hash-based approach could speed up the detection of
    non-equal files quite a bit.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 25, 2011

    New changeset 11568c59d9d4 by Raymond Hettinger in branch '2.7':
    bpo-11802: filecmp cache was growing without bound.
    http://hg.python.org/cpython/rev/11568c59d9d4

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 25, 2011

    New changeset 2bacaf6a80c4 by Raymond Hettinger in branch '3.2':
    bpo-11802: filecmp cache was growing without bound.
    http://hg.python.org/cpython/rev/2bacaf6a80c4

    New changeset 8f4466619e1c by Raymond Hettinger in branch 'default':
    bpo-11802: filecmp cache was growing without bound.
    http://hg.python.org/cpython/rev/8f4466619e1c

    @rhettinger
    Copy link
    Contributor

    Made a simple fix to keep the cache from growing without bound.
    Leaving this open for 3.3 as a feature request to implement a more sophisticated strategy using file hashes or somesuch.

    @rhettinger
    Copy link
    Contributor

    After more thought, will just close this report. If a new project emerges to improve the design of filecmp, it can be done in a separate tracker entry.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants