classification
Title: Exponential behavior in regular expression
Type: enhancement Stage:
Components: Regular Expressions Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: edemaine, tim.peters
Priority: normal Keywords:

Created on 2006-07-02 12:26 by edemaine, last changed 2006-07-02 21:40 by tim.peters. This issue is now closed.

Files
File name Uploaded Description Edit
re_bug.py edemaine, 2006-07-02 12:26 Python test code that makes 're' very slow
Messages (3)
msg29009 - (view) Author: Erik Demaine (edemaine) Date: 2006-07-02 12:26
're' seems to have serious performance trouble with
nested Kleene stars in the regular expression, if the
matched text is fairly long.  Attached is an example,
naturally arising in some LaTeX parsing [admittedly not
the only way to do it], along with a text generator
parameterized by a repetition count n.  Here is simple
timing data on a Pentium 4 1.5GHz with 1.5GB RAM as a
function of n:

[...]
n=4: 0:00:00.015000
n=5: 0:00:00.032000
n=6: 0:00:00.140000
n=7: 0:00:00.594000
n=8: 0:00:02.203000
n=9: 0:00:08.859000
n=10: 0:00:39.641000
n=11: 0:02:44.172000
n=12: 0:10:23.500000

This seems far slower than it should be, but I don't
know the algorithm used by 're'.  Is this behavior
expected?  If so, should it be optimized away by
changing the algorithm?

The generated text consists of a few lines of preamble,
then a variable number n of copies of a partiuclar
line, followed by a few lines of postscript.  The first
line of the postscript causes the regular expression
*not* to match, and 're' spends a long time to find
that out.  Removing that line from the postscript, and
causing the regular expression to match, makes the
program run instantly.

I get the same behavior on Python 2.4 and 2.5b1, on
Windows and Linux, and with re.sub and re.search.
msg29010 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-07-02 13:13
Logged In: YES 
user_id=31435

Yes, it's easy to provoke exponential-time behavior.  For a
simple example, the regexp

^((a+)*)*$

takes O(3**n) time to fail to match strings of the form

"a"*n + "b"

Python's matcher is a backtracking engine, much like Perl's,
and most other languages' non-POSIX re facilities.  There's
nary a DFA in sight ;-)  Jeffrey Friedl's thick O'Reilly
book "Mastering Regular Expressions" is mostly about the
pragmatics of using such engines efficiently:

http://regex.info/

Note that there's no current hope that this will change: 
because of gimmicks like backreferences, these aren't
CompSci's idea of regular expressions, and no "thoroughly
efficient" implementation technique is known.  For example,
this teensy regexp:

^(aa+)\\1+$

matches strings of a's whose length isn't prime, and finds a
non-trivial factor when the length is composite.  For
harder-to-solve but messier examples:

http://perl.plover.com/NPC/
msg29011 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-07-02 21:40
Logged In: YES 
user_id=31435

Since this isn't going to change, I'm closing this.

I don't know exactly what your regexp is intended to match,
but I expect this will help speed it enormously:  inside a
group, one of the alternatives is the negated character
class (NCC):

[^{}%]

One of the other alternatives starts with "{" and another
with "%".  That's very good, because those three
alternatives are mutually exclusive based on just the
current character in the target string.

However, yet another alternative starts with a backslash,
and it's thus ambiguous whether the backslash should be
matched by that alternative or by the NCC.  Because this is
a backtracking engine, and the NCC is the first alternative,
it tries the NCC first and won't try the backslash
alternative unless it's impossible to find a match having
tried the NCC.  That can cause exponential-time
failing-match behavior all by itself.

If it's the case that a backslash in this context is
_always_ supposed to match the

\\.

alternative, then adding a backslash to the NCC removes the
ambiguity and greatly speeds (at least) failing matches:

[^{}%\\]

Then which alternative is supposed to match is entirely
determined by the current character in the target string, so
when backtracking on failure all other alternatives fail at
once, and backtracking continues with at worst insignificant
delay.

Adding a backslash to the "inner" NCC helps a little, but
adding one to the "outer" NCC too is very effective:

n=0: 0:00:00
n=1: 0:00:00
n=2: 0:00:00
...
n=97: 0:00:00
n=98: 0:00:00
n=99: 0:00:00

See Friedl's book for much more along these lines.
History
Date User Action Args
2006-07-02 12:26:58edemainecreate