Issue 40480: "fnmatch" exponential execution time

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84660

classification

Title:	"fnmatch" exponential execution time
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.9

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	tim.peters	Nosy List:	Anthony Sottile, kleshni, nedbat, tim.peters
Priority:	normal	Keywords:	patch

Created on 2020-05-03 09:37 by kleshni, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 19908	merged	tim.peters, 2020-05-04 18:45
PR 20049	merged	tim.peters, 2020-05-12 01:13

Messages (13)
msg367963 - (view)	Author: (kleshni)	Date: 2020-05-03 09:37
Hello. The following code hangs: import fnmatch fnmatch.fnmatchcase("a" * 32, "" 16 + "b") Measurements show that its execution time rises exponentially with the number of asterisks. Bash and SQLite 3 process this glob instantly. This is because "fnmatchcase" generates a regular expression with repeated dots: (?s:................b)\\Z It's equivalent to: (?s:.*b)\\Z But works in exponential time. So the solution is to replace multiple asterisks with a single one, so the latter regexp is generated instead.
msg367982 - (view)	Author: (kleshni)	Date: 2020-05-03 16:42
>So the solution is to replace multiple asterisks with a single one No, this is not a solution. This glob also hangs: fnmatch.fnmatchcase("a" * 32, "a" 16 + "b") And again, Bash and SQLite 3 work perfectly.
msg367983 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-03 17:06
Note that doctest has the same kind of potential problem with matching ellipsis (0 or more characters) in expected output blocks. Backtracking isn't needed at all to resolve patterns of that limited kind, but I don't think Python's re module supports enough gimmicks to disable backtracking. So instead doctest has its own _ellipsis_match() function to do it in worst-case linear time.
msg368002 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-04 00:00
"The trick" with this kind of pattern is that a `` should consume as little as possible to allow the next fixed portion to match. Apply that at every stage, and it succeeds or it doesn't. Backtracking can't change that outcome. For, e.g., the shell pattern: abcd a regexp to do this without backtracking would be like this, IF Python's re engine supported "atomic groups" (but it doesn't): a(?>.?bc)(?>.?d).\Z The same effect can be gotten in a more convoluted way, via positive lookahead assertions and backreferencing (which Python's re engine does support): a(?=(.?bc))\1(?=(.?d))\2.\Z Assertions are "one and done": if the overall match fails, assertions don't try alternatives. So that's _a_ way to proceed. I'm unclear on that it's worth it, though. Stuff like "aaaaaa*" is just hard to swallow as a shell pattern that would occur in real life.
msg368075 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-04 18:40
Changed version to 3.9, because anything done would change the regexp generated, and fnmatch.translate()` makes that regexp visible.
msg368222 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-06 02:28
New changeset b9c46a2c2d7fc68457bff641f78932d66f5e5f59 by Tim Peters in branch 'master': bpo-40480 "fnmatch" exponential execution time (GH-19908) https://github.com/python/cpython/commit/b9c46a2c2d7fc68457bff641f78932d66f5e5f59
msg368645 - (view)	Author: Ned Batchelder (nedbat) *	Date: 2020-05-11 20:02
This change has caused a problem for coverage.py. The full details are here: https://github.com/nedbat/coveragepy/issues/988#issuecomment-626926513 Coverage.py combines fnmatch-produced regexes by joining them with pipes into one larger regex. The \<g1> group in the regexes now makes that larger regex invalid.
msg368655 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-11 21:24
Ned, would it be possible to rewrite code of the form: if giant pasted regexp matches: to: if any(p matches for p in patterns): That should work under any version of Python. There's no guarantee that regexps _can_ be pasted together and still work, so I can't call this change "a bug". That pasting regexps together "worked" before was an implementation accident. I'd be happy to change it anyway, except I know of no way to use Python's re engine without backreferences that can avoid exponential-time behavior in some cases. In some other regexp engines, yes (e.g., as the code comments note, in those that support "atomic grouping"), but not in Python's. Nor does Python's re engine support reusing backreference names or numbers. So I know of no way to restore the ability to paste regexps together that wouldn't reintroduce the possiblity of exponential time failure :-(
msg368683 - (view)	Author: Anthony Sottile (Anthony Sottile) *	Date: 2020-05-11 23:52
one way might be to give the groups "unique" names (perhaps hashing the input string?) ((this is what I attempted to do in a little bit of code which tried to "backport" (group)*+ and (group)++))
msg368686 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-12 00:12
I don't want something probabilistic. Fix it or don't ;-) One thing that would work, but at the cost of non-determinism: do the same as now, but obtain the number part of the group name by applying next() to a module-global private instance of itertools.count(). That will keep the numbers increasing "forever", and across calls. The point to using .count() is that it's atomic (i.e., won't repeat a number if multiple threads happen to be constructing regexps simultaneously). It's a darned silly amount of effort, though ;-)
msg368695 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-12 02:19
New changeset b1b4c790e7d3b5f4244450aefe3d8f01710c13f7 by Tim Peters in branch 'master': bpo-40480: restore ability to join fnmatch.translate() results (GH-20049) https://github.com/python/cpython/commit/b1b4c790e7d3b5f4244450aefe3d8f01710c13f7
msg368714 - (view)	Author: Ned Batchelder (nedbat) *	Date: 2020-05-12 09:46
Wow, thanks Tim. To be honest, I was coming around to your original point of view that it was too much to promise that these regexes could be combined the way I'm doing it. If we have to undo this latest change, I can live with it.
msg368734 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-05-12 14:54
Ned, I'm happy to do this. While the ability to join wasn't documented, it's not an unreasonable expectation. I'm not sure it's possible to fail _unless_ the regexps use named groups (and/or numbered backreferences) - and nobody in their right mind would expect regexps for such simple patterns to do such a thing ;-) So chances seem decent the regression your user stumbled into wouldn't be the only one to pop up. The fix was actually quite easy (and thanks to Anthony for nudging me in that direction!). The annoying part was writing a test given that the precise group names generated are no longer predictable :-(

History
Date	User	Action	Args
2022-04-11 14:59:30	admin	set	github: 84660
2020-05-12 14:54:46	tim.peters	set	messages: + msg368734
2020-05-12 09:46:13	nedbat	set	messages: + msg368714
2020-05-12 02:19:41	tim.peters	set	messages: + msg368695
2020-05-12 01:13:09	tim.peters	set	pull_requests: + pull_request19358
2020-05-12 00:12:46	tim.peters	set	messages: + msg368686
2020-05-11 23:52:18	Anthony Sottile	set	nosy: + Anthony Sottile messages: + msg368683
2020-05-11 21:24:20	tim.peters	set	messages: + msg368655
2020-05-11 20:02:25	nedbat	set	nosy: + nedbat messages: + msg368645
2020-05-06 02:31:54	tim.peters	set	status: open -> closed assignee: tim.peters resolution: fixed stage: patch review -> resolved
2020-05-06 02:28:31	tim.peters	set	messages: + msg368222
2020-05-04 18:45:49	tim.peters	set	keywords: + patch stage: needs patch -> patch review pull_requests: + pull_request19223
2020-05-04 18:40:10	tim.peters	set	stage: needs patch messages: + msg368075 versions: + Python 3.9, - Python 3.8
2020-05-04 00:00:38	tim.peters	set	messages: + msg368002
2020-05-03 17:06:41	tim.peters	set	nosy: + tim.peters messages: + msg367983
2020-05-03 16:42:59	kleshni	set	messages: + msg367982
2020-05-03 09:37:07	kleshni	create