Issue 28293: Don't completely dump the regex cache when full

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72480

classification

Title:	Don't completely dump the regex cache when full
Type:	performance	Stage:	resolved
Components:	Library (Lib), Regular Expressions	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	SilentGhost, barry, ezio.melotti, methane, mrabarnett, rhettinger, serhiy.storchaka, vstinner, xiang.zhang
Priority:	low	Keywords:	patch

Created on 2016-09-28 04:49 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re_cache_pop.diff	rhettinger, 2016-09-28 04:49		review
re_cache_del_first.patch	serhiy.storchaka, 2016-09-28 05:30
re_cache_ordered_dict_del_first.patch	serhiy.storchaka, 2016-09-28 06:14
re_cache_ordered_dict_popitem.patch	serhiy.storchaka, 2016-09-28 06:30

Pull Requests
URL	Status	Linked	Edit
PR 3768	merged	serhiy.storchaka, 2017-09-26 14:54

Messages (20)
msg277580 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-09-28 04:49
When the regex cache gets filled, it is cleared in its entirety. Instead, it only needs to evict one arbitrary entry. This will save the us from having to rebuild and recache frequently used regexes.
msg277582 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-28 05:00
LGTM.
msg277583 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2016-09-28 05:06
But with the compact dict implementation, popitem is not going to evict an arbitrary entry but always the last one. Will this cause two interchangeably used regexes always need to recompile?
msg277585 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-28 05:30
Nice catch! Here is a patch that deletes the first key.
msg277586 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-09-28 05:46
Perhaps: _cache.pop(next(iter(_cache))) The for-loop version indirect about what it is trying to do and relies on an obscure quirk of exactly when it is an error to mutate while iterating. I do like that the side-effect of the compact dict is that is lets us choose the oldest entry to evict.
msg277587 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-28 06:01
> Perhaps: _cache.pop(next(iter(_cache))) This can raise KeyError if the cache is cleared in other thread. And it is a little slower. I remember why I didn't propose this idea earlier. This depends on the ordering of dict. But this is implementation detail, and in other Python implementation this can cause recompiling two interchangeably used regexes. No, this no longer LGTM. Maybe use OrderedDict? But this adds heavy dependency to the re module. Yet one idea: use two caches. Look first in the one cache, then in the other. When the first cache full, clear the second cache and swap caches.
msg277588 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-28 06:14
> Maybe use OrderedDict? But this adds heavy dependency to the re module. We can use the same trick as in the enum module. Here is a patch.
msg277589 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2016-09-28 06:20
Why not OrderedDict.popitem(last=False)?
msg277590 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-09-28 06:25
This has gotten crazy. I withdraw the suggestion.
msg277591 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-28 06:30
> Why not OrderedDict.popitem(last=False)? Yes, this should work. Here is a patch.
msg277620 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2016-09-28 14:04
Serhiy, your patch doesn't seem to apply cleanly. Also, it's perhaps worth opening the issue if you'd like to continue working on it.
msg277667 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-28 21:59
It is applied cleanly on the default branch. I don't think there is a need to open a new issue. The last patch implements the original Raymond's intention, but taking into account all comments: drops the oldest item, doesn't depend on implementation details of dict, thread-safe.
msg277677 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-09-29 01:22
I would rather not introduce an OrderedDict dependency to the re module while other implementations are using the pure python version or where the OD implementation is in flux. This is a barely worth it proposal, so we should either make a modest, minimalistic change or just drop it. Really, its too bad that the regex got contorted in a way that precluded the use of the lru_cache (DEBUG isn't the central feature of regex and is only minimally useful).
msg277682 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-29 04:56
The re module now depends on the enum module, and the latter already depends on OrderedDict. Thus the patch doesn't introduce new dependency. Python implementation of OrderedDict doesn't override __getitem__ and therefore don't slow down the common case. I considered all this. In contrary, Python implementation of lru_cache slows down the code (that is why using it was dropped in the past). But I did not object to this because re.sub() is less used than re.match() or re.search() and already is much slower.
msg278885 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-18 14:42
Can't we redesign the function to reuse @functools.lru_cache(_MAXCACHE) but take care of the debug flag? For example, split the function in smaller parts.
msg278920 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-10-18 18:43
This is unlikely possible with current API of lru_cache. Testing flags before looking up in a cache adds an overhead.
msg278937 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-18 21:12
re_cache_ordered_dict_popitem.patch: LGTM, it can be pushed to the default branch. -- "This is unlikely possible with current API of lru_cache. Testing flags before looking up in a cache adds an overhead." Oh wait, I looked at the code and this case is complex because the function depends on the current locale if the LOCALE flag is set... A lookup in the the cache requires to get the current locale, but not if the LOCALE flag is not set... It seems too complex for @lru_cache().
msg283983 - (view)	Author: Inada Naoki (methane) *	Date: 2016-12-25 03:07
LGTM for re_cache_ordered_dict_popitem.patch
msg303054 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-26 16:47
New changeset 114454e9f6addbcb364e9a37102c8131ae2da1dd by Serhiy Storchaka in branch 'master': bpo-28293: Don't completely dump the regex cache when full. (#3768) https://github.com/python/cpython/commit/114454e9f6addbcb364e9a37102c8131ae2da1dd
msg303055 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-26 16:58
Thank you for your reviews. Currently the caching logic is simpler, it no longer depends on the LOCALE flag and the current locale. But it still is too complex for lru_cache. It needs a way to bypass caching based on arguments or result.

History
Date	User	Action	Args
2022-04-11 14:58:37	admin	set	github: 72480
2017-09-26 16:58:08	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg303055 stage: patch review -> resolved
2017-09-26 16:47:38	serhiy.storchaka	set	messages: + msg303054
2017-09-26 14:56:00	serhiy.storchaka	set	nosy: + barry
2017-09-26 14:54:17	serhiy.storchaka	set	pull_requests: + pull_request3753
2016-12-25 03:07:28	methane	set	nosy: + methane messages: + msg283983
2016-10-30 16:54:57	serhiy.storchaka	set	versions: - Python 3.6
2016-10-18 21:12:47	vstinner	set	messages: + msg278937
2016-10-18 18:43:57	serhiy.storchaka	set	messages: + msg278920
2016-10-18 14:42:55	vstinner	set	nosy: + vstinner messages: + msg278885
2016-10-16 22:58:48	rhettinger	set	assignee: rhettinger -> serhiy.storchaka
2016-10-16 22:16:48	serhiy.storchaka	set	nosy: + ezio.melotti, mrabarnett components: + Regular Expressions
2016-09-29 04:56:25	serhiy.storchaka	set	messages: + msg277682
2016-09-29 01:23:00	rhettinger	set	messages: + msg277677
2016-09-28 21:59:01	serhiy.storchaka	set	status: closed -> open resolution: rejected -> (no value) messages: + msg277667
2016-09-28 14:04:28	SilentGhost	set	nosy: + SilentGhost messages: + msg277620
2016-09-28 06:30:39	serhiy.storchaka	set	files: + re_cache_ordered_dict_popitem.patch messages: + msg277591
2016-09-28 06:25:49	rhettinger	set	status: open -> closed resolution: rejected messages: + msg277590
2016-09-28 06:20:00	xiang.zhang	set	messages: + msg277589
2016-09-28 06:14:01	serhiy.storchaka	set	files: + re_cache_ordered_dict_del_first.patch messages: + msg277588
2016-09-28 06:01:35	serhiy.storchaka	set	messages: + msg277587
2016-09-28 05:46:46	rhettinger	set	messages: + msg277586
2016-09-28 05:30:28	serhiy.storchaka	set	files: + re_cache_del_first.patch messages: + msg277585 stage: commit review -> patch review
2016-09-28 05:06:30	xiang.zhang	set	nosy: + xiang.zhang messages: + msg277583
2016-09-28 05:00:00	serhiy.storchaka	set	assignee: serhiy.storchaka -> rhettinger messages: + msg277582 stage: patch review -> commit review
2016-09-28 04:49:29	rhettinger	create