This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

classification
Title: Fnmatch cache is never cleared during usage
Type: resource usage Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: Tom.Rini, andrewclegg, eric.araujo, pitrou, r.david.murray, rhettinger
Priority: normal Keywords: needs review, patch

Created on 2010-02-03 15:02 by andrewclegg, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
py27-fnmatch.patch andrewclegg, 2010-07-09 09:57
py3k-fnmatch.patch andrewclegg, 2010-07-09 09:57
py3k-fnmatch.patch andrewclegg, 2010-07-09 10:55
Messages (21)
msg98786 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-02-03 15:02
The fnmatch module has a cache of translation between glob patterns and compiled regular expressions. However this cache is never emptied; only added to. I am writing a python program which as part of its execution checks millions of unique globs - this causes the fnmatch cache to grow enormous.

Attached is a patch to limit the size of the fnmatch cache to 100 (chosen to be the same size as the re module).
msg99385 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-02-16 05:02
Hello

The feature seems useful to me and the patch straightforward.

I wonder why you used a global name for the max length (“_MAXCACHE”) instead of inlining the value where it is used. Does it feel more constanty? ;) I fear that the leading underscore won’t discourage users from touching it, so I’d suggest making it public or completely hidden. On second thought, you could argue that Python is a matter of consenting adults, and that leading underscore is okay.

Regards
msg99392 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-02-16 09:38
Hi,

I used a global name for a couple of reasons: it is consistent with the cache itself and the size of the cache being defined in the same place; and because I wanted to allow the value to be modified by users.

I used the leading underscore to give a weak indication of internal use - for most users, it can be completely ignored. If someone wanted to modify the maximum size of the cache however, this would be a reasonable and safe thing for them to do.

Cheers
msg99429 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-02-16 18:04
Hi

Your rationale makes perfect sense. Perhaps add a comment suggesting 
it’s ok to change the value.

Cheers
msg99430 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-02-16 18:10
I haven't reviewed the patch, but if the intent is that the user can change the global, then it should not start with an _, and it should be a documented part of the API.  If you want to suggest that users should not normally change it, then you can do that in the docs.  

If you give something an _, then it is not considered part of the public API and it (the internal API, not the value) is subject to change, which means you should *not* suggest that users change it.  If they find it and want to change it anyway, that's their lookout.  That's the consenting adults part :)
msg99464 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-02-17 10:08
"If you give something an _, then it is not considered part of the public API and it (the internal API, not the value) is subject to change, which means you should *not* suggest that users change it.  If they find it and want to change it anyway, that's their lookout.  That's the consenting adults part :)"

This sums it up well - it should be considered a detail of implementation, thus the _. When I said I wanted to allow the value to be modified, I was thinking of the 'consenting adult' type of usage, not everyday usage. Of course there is no reasonable way to prevent the value being modified anyway.

So - _MAXCACHE can stay where it is, but should not be documented or be included in __all__. Does that seem reasonable?
msg109532 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 10:47
Hi,

This bug seems to have stalled - how can I get it moved forward? There don't seem to be any objections with the patch as-is, and the problem seems clear.

Cheers
msg109543 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-08 13:53
OK, I've looked at the patch now.  (1) it needs unit tests.  (2) a py3k port would be helpful, since the code is very different in py3k, and we are now applying patches first to the py3k branch and then backporting.

I think this can be considered a bug (the ever-expanding cache), so I'm OK with applying this in 2.7.1 once it has unit tests.

Note that using _MAXCACHE (leading underscore) and not documenting it follows the 're' model, so that is the best choice.
msg109545 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 14:50
Attached is the Py3K version of the patch, and a unit test (Py3K only).
msg109547 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 15:04
Sorry, messed up indentation on last patch, trying again...
msg109549 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 15:08
Attached is a patch and unit test against release27-maint.
msg109552 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-08 16:14
Thanks for working on this.  I tried the py3k patch but it doesn't apply cleanly.  Patch says "patch: **** malformed patch at line 57:" followed by an apparently blank line.

Also, could you please generate the patches from the top level of the checkout?  It makes it easier to apply patches if they all have the same starting point.
msg109704 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-09 09:57
OK, regenerated both patches from the top level of the checkout (paths within the patch are 'Lib/fnmatch.py' and 'Lib/test/test_fnmatch.py'.

Sorry about the corrupted patch before - I've tested this version by reverting my changes and reapplying it, and it seems to work cleanly.

Cheers
msg109708 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-09 10:41
Instead of writing `bytes('foo', encoding='ascii')`, you can just write `b'foo'`. Otherwise, patch looks ok.
msg109710 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-09 10:55
Updated py3k patch to use the more concise bytes syntax, cheers.
msg109734 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-09 13:37
Committed to py3k in r82730, 3.1 in r82733, 2.7 in r82732, and, since 2.6.6 is still to come, 2.6 in r82740.

Andrew: FYI I changed the assertTrue to assertLessEqual to take advantage of the better failure messages provided by the newer unittest methods.

Thanks for the patch.
msg113358 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-09 02:07
For Py3.2, applying the new functools.lru_cache().
See r83871.
msg139454 - (view) Author: Tom Rini (Tom.Rini) Date: 2011-06-29 22:59
Did a change later make this user-configurable?  I've got some code here that now runs so slow as to be unusable because nothing is ever cached anymore.  Thanks!
msg139486 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-30 15:07
Well, it's unnoficially configurable pre-3.2 (set _MAXCACHE on the module).  The lru_cache implementation doesn't have such an undocumented way of tweaking the limit.  In neither case, however, is it true that "nothing is ever cached".  You are saying that you are dealing with more than 100 patterns in a single application?  (256 if you are using 3.2, and in that case it is an LRU cache and so should perform well unless your more-than-256 patterns are used in a cyclical fashion.)
msg139494 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-30 15:38
Grr.  "it is *not* true that nothing is ever cached".
msg139623 - (view) Author: Tom Rini (Tom.Rini) Date: 2011-07-02 03:17
So yes, we have some code that was doing, roughly:
if any(fnmatchcase(key, pat) for pat in pattern):
   refs.add(key)
on 200-300 elements, a lot.  That said, many of them were not globs so we've worked around this to only use fnmatchcase on globs and just key in set otherwise.  Things are now usable on python 2.6.6+ (so Ubuntu 10.10) and noise otherwise.
History
Date User Action Args
2022-04-11 14:56:57adminsetgithub: 52094
2011-07-02 03:17:17Tom.Rinisetmessages: + msg139623
2011-06-30 15:38:58r.david.murraysetmessages: + msg139494
2011-06-30 15:07:59r.david.murraysetmessages: + msg139486
2011-06-29 22:59:15Tom.Rinisetnosy: + Tom.Rini
messages: + msg139454
2010-08-09 02:07:29rhettingersetnosy: + rhettinger
messages: + msg113358
2010-07-09 13:37:55r.david.murraysetstatus: open -> closed
versions: + Python 2.6
messages: + msg109734

resolution: accepted -> fixed
stage: test needed -> resolved
2010-07-09 12:23:48r.david.murraysetassignee: r.david.murray
2010-07-09 10:55:12andrewcleggsetfiles: + py3k-fnmatch.patch

messages: + msg109710
2010-07-09 10:41:30pitrousetnosy: + pitrou
messages: + msg109708
2010-07-09 09:57:36andrewcleggsetfiles: + py3k-fnmatch.patch
2010-07-09 09:57:28andrewcleggsetfiles: + py27-fnmatch.patch

messages: + msg109704
2010-07-09 09:54:29andrewcleggsetfiles: - py27-fnmatch.patch
2010-07-09 09:54:26andrewcleggsetfiles: - py3k-fnmatch.patch
2010-07-08 16:14:47r.david.murraysetmessages: + msg109552
2010-07-08 15:08:47andrewcleggsetfiles: - fnmatch.patch
2010-07-08 15:08:17andrewcleggsetfiles: + py27-fnmatch.patch

messages: + msg109549
2010-07-08 15:04:25andrewcleggsetfiles: + py3k-fnmatch.patch

messages: + msg109547
2010-07-08 15:04:07andrewcleggsetfiles: - py3k-fnmatch.patch
2010-07-08 14:50:57andrewcleggsetfiles: + py3k-fnmatch.patch

messages: + msg109545
2010-07-08 13:53:13r.david.murraysetresolution: accepted
messages: + msg109543
versions: - Python 2.6
2010-07-08 10:47:01andrewcleggsetmessages: + msg109532
2010-02-17 10:08:40andrewcleggsetmessages: + msg99464
2010-02-16 18:10:54r.david.murraysettype: behavior -> resource usage

messages: + msg99430
nosy: + r.david.murray
2010-02-16 18:04:06eric.araujosetmessages: + msg99429
2010-02-16 09:38:45andrewcleggsetmessages: + msg99392
2010-02-16 05:02:10eric.araujosetnosy: + eric.araujo
messages: + msg99385
2010-02-03 15:29:24brian.curtinsetpriority: normal
keywords: + needs review
stage: test needed
type: behavior
versions: + Python 2.6, Python 3.1, Python 2.7, Python 3.2, - Python 2.5
2010-02-03 15:02:13andrewcleggcreate