Message 29009 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	edemaine
Recipients
Date	2006-07-02.12:26:58
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
're' seems to have serious performance trouble with nested Kleene stars in the regular expression, if the matched text is fairly long. Attached is an example, naturally arising in some LaTeX parsing [admittedly not the only way to do it], along with a text generator parameterized by a repetition count n. Here is simple timing data on a Pentium 4 1.5GHz with 1.5GB RAM as a function of n: [...] n=4: 0:00:00.015000 n=5: 0:00:00.032000 n=6: 0:00:00.140000 n=7: 0:00:00.594000 n=8: 0:00:02.203000 n=9: 0:00:08.859000 n=10: 0:00:39.641000 n=11: 0:02:44.172000 n=12: 0:10:23.500000 This seems far slower than it should be, but I don't know the algorithm used by 're'. Is this behavior expected? If so, should it be optimized away by changing the algorithm? The generated text consists of a few lines of preamble, then a variable number n of copies of a partiuclar line, followed by a few lines of postscript. The first line of the postscript causes the regular expression not to match, and 're' spends a long time to find that out. Removing that line from the postscript, and causing the regular expression to match, makes the program run instantly. I get the same behavior on Python 2.4 and 2.5b1, on Windows and Linux, and with re.sub and re.search.

're' seems to have serious performance trouble with
nested Kleene stars in the regular expression, if the
matched text is fairly long.  Attached is an example,
naturally arising in some LaTeX parsing [admittedly not
the only way to do it], along with a text generator
parameterized by a repetition count n.  Here is simple
timing data on a Pentium 4 1.5GHz with 1.5GB RAM as a
function of n:

[...]
n=4: 0:00:00.015000
n=5: 0:00:00.032000
n=6: 0:00:00.140000
n=7: 0:00:00.594000
n=8: 0:00:02.203000
n=9: 0:00:08.859000
n=10: 0:00:39.641000
n=11: 0:02:44.172000
n=12: 0:10:23.500000

This seems far slower than it should be, but I don't
know the algorithm used by 're'.  Is this behavior
expected?  If so, should it be optimized away by
changing the algorithm?

The generated text consists of a few lines of preamble,
then a variable number n of copies of a partiuclar
line, followed by a few lines of postscript.  The first
line of the postscript causes the regular expression
*not* to match, and 're' spends a long time to find
that out.  Removing that line from the postscript, and
causing the regular expression to match, makes the
program run instantly.

I get the same behavior on Python 2.4 and 2.5b1, on
Windows and Linux, and with re.sub and re.search.

History
Date	User	Action	Args
2007-08-23 14:40:58	admin	link	issue1515829 messages
2007-08-23 14:40:58	admin	create