Issue 753711: Infinite Loop in RE

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38637

classification

Title:	Infinite Loop in RE
Type:		Stage:
Components:	Regular Expressions	Versions:	Python 2.2

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	effbot	Nosy List:	brett.cannon, effbot, glchapman, mbogosian, sjones
Priority:	low	Keywords:

Created on 2003-06-13 03:27 by mbogosian, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Messages (5)
msg16368 - (view)	Author: Matt Bogosian (mbogosian)	Date: 2003-06-13 03:27
This may be similar to <http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=439997>, but I don't think the current behavior (infinite loop/unbound execution) is correct. Here's the python version: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #!/usr/bin/python import re pattern = '^UPDATE\s+\w+\s+SET\s+locked_until\s=\s(\S+\s)+,' # This won't match (missing ',' at end str = 'UPDATE arena_teams SET locked_until = CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)' re.search(pattern, str, re.I) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - When run, this script just pegs the CPU and hangs (I'm running RedHat 8.0). Note: if I change the pattern slightly it still doesn't work: pattern = '^UPDATE\s+\w+\s+SET\s+locked_until\s=\s([^ =,]+\s)+,' It's not until the pattern actually matches that things get better (with both the original and modified patterns): # Pattern now matches (added a ',' at the end) str = 'UPDATE arena_teams SET locked_until = CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP),' I tried doing the same thing in Perl: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #!/usr/bin/perl 'UPDATE arena_teams SET locked_until = CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)' =~ '/UPDATE\s+\w+\s+SET\s+locked_until\s=\s(\S+\s*)+,/i'; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This of course runs just fine. Any ideas?
msg16369 - (view)	Author: Shannon Jones (sjones)	Date: 2003-06-13 12:15
Logged In: YES user_id=589306 If you change your regex and sample data to the following simpler forms, you get the same problem: pattern = '(\S+\s*)+,' str = 'CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)' If you remove some letters from str, the search does finish but it takes several seconds. I believe the problem is that your regex takes exponential time to execute based on the length of str. Here is a recent python-dev post that talks about the problem: http://mail.python.org/pipermail/python-dev/2003-May/035916.html The link you provided in the bug report talks about this as well. You can also do a web search for "regular expression exponential" to see more information. I believe that any regex engine (that has similar features to Python's engine) will have some regular expressions that take exponential time or space. Perl has some cases as well (search for "exponential" in perldoc perlre). This is just how regular expressions are and IMHO does not indicate a bug in Python. As far as how to fix the regular expression, I can't say. I'm sure there is a way to "rephrase" what you want to get it to work. Maybe try asking in the comp.lang.python newsgroup. Good luck!
msg16370 - (view)	Author: Shannon Jones (sjones)	Date: 2003-06-13 21:42
Logged In: YES user_id=589306 If you change your regex and sample data to the following simpler forms, you get the same problem: pattern = '(\S+\s*)+,' str = 'CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)' If you remove some letters from str, the search does finish but it takes several seconds. I believe the problem is that your regex takes exponential time to execute based on the length of str. Here is a recent python-dev post that talks about the problem: http://mail.python.org/pipermail/python-dev/2003-May/035916.html The link you provided in the bug report talks about this as well. You can also do a web search for "regular expression exponential" to see more information. I believe that any regex engine (that has similar features to Python's engine) will have some regular expressions that take exponential time or space. Perl has some cases as well (search for "exponential" in perldoc perlre). This is just how regular expressions are and IMHO does not indicate a bug in Python. As far as how to fix the regular expression, I can't say. I'm sure there is a way to "rephrase" what you want to get it to work. Maybe try asking in the comp.lang.python newsgroup. Good luck!
msg16371 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2003-06-15 04:28
Logged In: YES user_id=357491 sjones is right about it taking exponential time. A regex engine works with greedy quantifiers by sucking up all that it can and then trying the next part of the regex and continues until it fails or succeeds. If it fails, though, it backtracks one regex character, gives up a character it claimed, and tries again. It does this for a long time. How long did you wait for the regex to complete? I bet if you left it for a very long time it would complete.
msg16372 - (view)	Author: Greg Chapman (glchapman)	Date: 2003-06-16 18:44
Logged In: YES user_id=86307 FWIW, one way to efficiently match this sort of pattern (where you have literal text at the end of something complex) would be to use the match once and don't backtrack operator ('(?>pattern)'). SRE doesn't have the (?> operator, but it can be emulated by '(?=(pattern))\1'. So one way to avoid the exponential time would be to use something like this: rx = re.compile( r'''^UPDATE\s+\w+\s+SET\s+locked_until\s=\s (?=( ([^=,]+\s*)+ ))\1 , #trailing literal ''', re.I \| re.X ) By the way, it would be fairly easy to add the (?> operator to SRE; it's almost identical to a postive lookahead assertion. The only difference is that, upon successfully matching the subexpression inside the parentheses, the position in the string being matched is advanced to the end of the text matched by the subexpression. Also, in case anyone's interested in why Perl is able to fail so quickly here, it appears that the Perl engine keeps track of positions (in the target text) where something like '(pattern)+' has already been tried and has failed, so it can quickly fail if backtracking causes an attempt to match again at that position (at least that's my interpretation of the numerous "already tried at this position..." messages in the trace from running the supplied pattern and text with "use re 'debug';").

History
Date	User	Action	Args
2022-04-10 16:09:10	admin	set	github: 38637
2003-06-13 03:27:56	mbogosian	create