Issue 16563: re.match loops forever on simple regexp

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60767

classification

Title:	re.match loops forever on simple regexp
Type:	crash	Stage:
Components:	Regular Expressions	Versions:	Python 2.6

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	the re module can perform poorly: O(2n) versus O(n2) View: 1662581
Assigned To:		Nosy List:	ezio.melotti, lpd, mark.dickinson, mrabarnett, tim.peters
Priority:	normal	Keywords:

Created on 2012-11-27 00:07 by lpd, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (7)
msg176458 - (view)	Author: L. Peter Deutsch (lpd)	Date: 2012-11-27 00:07
I've read a number of reports of exponential-time regexp matching, but this regexp uses no unusual features, requires no backtracking, and only loops "forever" on certain input strings. I listed the Python version # as 2.6; I actually observed the behavior in 2.5.1 and 2.5.2, but I'm almost certain it's still there, because I saw the same behavior in a very recent build of Google's V8 interpreter, which I believe uses the same regexp engine. Here's the test case: import re re_utf8 = r'^([\x00-\x7f]+\|[\xc0-\xdf][\x80-\xbf]\|[\xe0-\xef][\x80-\xbf][\x80-\xbf]\|[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf])*$' s = "\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x14\x00\x00\x00\x01\x00\x00,`\x00\x00\x004\x00\x01\x8d" print re.match(re_utf8, s) If you pass s[0:34] or s[34:35] as the argument of re.match, it returns the correct answer, but the code above loops apparently forever.
msg176459 - (view)	Author: Tim Peters (tim.peters) *	Date: 2012-11-27 00:42
There's actually enormous backtracking here. Try this much shorter regexp and you'll see much the same behavior: re_utf8 = r'^([\x00-\x7f]+)*$' That's the original re_utf8 with all but the first alternative removed. Looks like passing s[0:34] "works" because it eliminates the trailing \x8d that prevents the regexp from matching the whole string. Because the regexp cannot match the whole string, it takes a very long time to try all the futile combinations implied by the nested quantifiers. As the much simpler re_utf8 above shows, it's not the alternatives in the regexp that matter here, it's the nested quantifiers.
msg176460 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-11-27 00:43
I think the problem is the first '+', but I'm not sure if this is similar to other problems (I remember something similar to (x+)* causing problems). (On a side note: the regex matches non-valid utf-8, see table 3-7 on http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf).
msg176461 - (view)	Author: Tim Peters (tim.peters) *	Date: 2012-11-27 00:50
Yes, if you remove the first "+", the example quickly prints None (i.e., reports that the regexp cannot match the string).
msg176465 - (view)	Author: L. Peter Deutsch (lpd)	Date: 2012-11-27 07:56
It never occurred to me that the regexp package would be so poorly designed that a pattern that so clearly never requires backtracking could require exponential time. I'll change the pattern (taking out the + has no effect on what strings it matches) and leave it up to others to decide whether the performance issue is worth addressing. And thanks for the pointer to the table in the Unicode standard.
msg176466 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-11-27 08:45
Closing as a duplicate of issue 1662581.
msg176469 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-11-27 08:58
@lpd: you may want to look at the 'regex' module on PyPI [1] to see if it solves your problems. The idea is that some form of this should eventually go into the standard library, but we've been lacking volunteers for the huge code review task involved. [1] http://pypi.python.org/pypi/regex.

History
Date	User	Action	Args
2022-04-11 14:57:38	admin	set	github: 60767
2012-11-27 08:58:46	mark.dickinson	set	messages: + msg176469
2012-11-27 08:45:44	mark.dickinson	set	status: open -> closed nosy: + mark.dickinson messages: + msg176466 superseder: the re module can perform poorly: O(2n) versus O(n2) resolution: duplicate
2012-11-27 07:56:26	lpd	set	messages: + msg176465
2012-11-27 00:50:21	tim.peters	set	messages: + msg176461
2012-11-27 00:43:55	ezio.melotti	set	messages: + msg176460
2012-11-27 00:42:18	tim.peters	set	nosy: + tim.peters messages: + msg176459
2012-11-27 00:07:43	lpd	create