Author verdy_p
Recipients ezio.melotti, mrabarnett, verdy_p
Date 2009-10-15.02:41:16
SpamBayes Score 1.02869e-11
Marked as misclassified No
Message-id <1255574479.64.0.0316658450485.issue7132@psf.upfronthosting.co.za>
In-reply-to
Content
Anyway, there are ways to speedup regexps, even without instructing the 
regexps with anti-backtracking syntaxes.

See http://swtch.com/~rsc/regexp/regexp1.html
(article dated January 2007)
Which discusses how Perl, PCRE (and PHP), Python, Java, Ruby, .NET 
library... are slow because they are using backtracking a single state 
in the NFA, instead of using simultaneously active states (which 
correctly emulates the DFA, without having to actually build the DFA 
transition states table, which could grow combinatorially, as seen in 
yacc and Bison).

Java uses now the Thomson approach in its latest releases, but I wonder 
how Python works: does it use the DFA simulation? Do you use PCRE?

Note that I've been using the DFA simulation since more than 20 years in 
1987, when I built my first regexp matcher (because the existing 
implementation at that time were really damn slow), after I read the 
Aho/Sethi/Ullman book which already demonstrated the algorithm.

This algorithm has been implemented in some tools replacing the old 
yacc/Bison tools, because they generate huge DFA transition tables (and 
this was the reason why Lex and Yacc were maintained as separate tools, 
Lex using the single-state NFA approach with excessive backtracking, and 
Yacc/Bison trying to generate the full DFA transition tables.) : the 
first language to use this approach was the Purdue Univerity Compiler 
Construction Tools Set (PCCTS) which was initially written in C and is 
now fully written and supported in Java.

The Perl 6 extension for quantified capturing groups will have a slow 
adoption, as long as Perl will continue the slow single-state NFA 
approach with excessive backtracking, instead of the Aho/Sethi/Ullman 
(ASU) approach (that some are attributing to Thomson due to the article 
in 2007, but this is false) using simultaneously active states. But 
anyway, it already exists (and Perl developers are already working on 
rewriting the engine using the ASU approach).

But my suggstion is much more general, as it should not just apply to 
quantified capturing groups, but to any capturing group that is part of 
a subexpression which is quantified.

And the way I specified it, it does not depend on the way the engine is 
written (whever it uses a single-state NFA or multi-state NFA or 
generates and uses a DFA transition state with single-state like in 
Yacc/Bison), because capturing groups are just used to store position 
pairs, and regexp engines already have to count them for effectively 
matching the greedy or non-greedy quantifiers, so this immediately 
provides a usable index for storing at successive positions in a 
numbered array for captured groups.

The simple test case is effectively to try to match /(aa?)*a+/ with 
strings longer than a few dozens of 'a'.
History
Date User Action Args
2009-10-15 02:41:19verdy_psetrecipients: + verdy_p, ezio.melotti, mrabarnett
2009-10-15 02:41:19verdy_psetmessageid: <1255574479.64.0.0316658450485.issue7132@psf.upfronthosting.co.za>
2009-10-15 02:41:18verdy_plinkissue7132 messages
2009-10-15 02:41:16verdy_pcreate