Message94068
Anyway, there are ways to speedup regexps, even without instructing the
regexps with anti-backtracking syntaxes.
See http://swtch.com/~rsc/regexp/regexp1.html
(article dated January 2007)
Which discusses how Perl, PCRE (and PHP), Python, Java, Ruby, .NET
library... are slow because they are using backtracking a single state
in the NFA, instead of using simultaneously active states (which
correctly emulates the DFA, without having to actually build the DFA
transition states table, which could grow combinatorially, as seen in
yacc and Bison).
Java uses now the Thomson approach in its latest releases, but I wonder
how Python works: does it use the DFA simulation? Do you use PCRE?
Note that I've been using the DFA simulation since more than 20 years in
1987, when I built my first regexp matcher (because the existing
implementation at that time were really damn slow), after I read the
Aho/Sethi/Ullman book which already demonstrated the algorithm.
This algorithm has been implemented in some tools replacing the old
yacc/Bison tools, because they generate huge DFA transition tables (and
this was the reason why Lex and Yacc were maintained as separate tools,
Lex using the single-state NFA approach with excessive backtracking, and
Yacc/Bison trying to generate the full DFA transition tables.) : the
first language to use this approach was the Purdue Univerity Compiler
Construction Tools Set (PCCTS) which was initially written in C and is
now fully written and supported in Java.
The Perl 6 extension for quantified capturing groups will have a slow
adoption, as long as Perl will continue the slow single-state NFA
approach with excessive backtracking, instead of the Aho/Sethi/Ullman
(ASU) approach (that some are attributing to Thomson due to the article
in 2007, but this is false) using simultaneously active states. But
anyway, it already exists (and Perl developers are already working on
rewriting the engine using the ASU approach).
But my suggstion is much more general, as it should not just apply to
quantified capturing groups, but to any capturing group that is part of
a subexpression which is quantified.
And the way I specified it, it does not depend on the way the engine is
written (whever it uses a single-state NFA or multi-state NFA or
generates and uses a DFA transition state with single-state like in
Yacc/Bison), because capturing groups are just used to store position
pairs, and regexp engines already have to count them for effectively
matching the greedy or non-greedy quantifiers, so this immediately
provides a usable index for storing at successive positions in a
numbered array for captured groups.
The simple test case is effectively to try to match /(aa?)*a+/ with
strings longer than a few dozens of 'a'. |
|
Date |
User |
Action |
Args |
2009-10-15 02:41:19 | verdy_p | set | recipients:
+ verdy_p, ezio.melotti, mrabarnett |
2009-10-15 02:41:19 | verdy_p | set | messageid: <1255574479.64.0.0316658450485.issue7132@psf.upfronthosting.co.za> |
2009-10-15 02:41:18 | verdy_p | link | issue7132 messages |
2009-10-15 02:41:16 | verdy_p | create | |
|