Message 94068 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	verdy_p
Recipients	ezio.melotti, mrabarnett, verdy_p
Date	2009-10-15.02:41:16
SpamBayes Score	1.0286882e-11
Marked as misclassified	No
Message-id	<1255574479.64.0.0316658450485.issue7132@psf.upfronthosting.co.za>
In-reply-to

Content
Anyway, there are ways to speedup regexps, even without instructing the regexps with anti-backtracking syntaxes. See http://swtch.com/~rsc/regexp/regexp1.html (article dated January 2007) Which discusses how Perl, PCRE (and PHP), Python, Java, Ruby, .NET library... are slow because they are using backtracking a single state in the NFA, instead of using simultaneously active states (which correctly emulates the DFA, without having to actually build the DFA transition states table, which could grow combinatorially, as seen in yacc and Bison). Java uses now the Thomson approach in its latest releases, but I wonder how Python works: does it use the DFA simulation? Do you use PCRE? Note that I've been using the DFA simulation since more than 20 years in 1987, when I built my first regexp matcher (because the existing implementation at that time were really damn slow), after I read the Aho/Sethi/Ullman book which already demonstrated the algorithm. This algorithm has been implemented in some tools replacing the old yacc/Bison tools, because they generate huge DFA transition tables (and this was the reason why Lex and Yacc were maintained as separate tools, Lex using the single-state NFA approach with excessive backtracking, and Yacc/Bison trying to generate the full DFA transition tables.) : the first language to use this approach was the Purdue Univerity Compiler Construction Tools Set (PCCTS) which was initially written in C and is now fully written and supported in Java. The Perl 6 extension for quantified capturing groups will have a slow adoption, as long as Perl will continue the slow single-state NFA approach with excessive backtracking, instead of the Aho/Sethi/Ullman (ASU) approach (that some are attributing to Thomson due to the article in 2007, but this is false) using simultaneously active states. But anyway, it already exists (and Perl developers are already working on rewriting the engine using the ASU approach). But my suggstion is much more general, as it should not just apply to quantified capturing groups, but to any capturing group that is part of a subexpression which is quantified. And the way I specified it, it does not depend on the way the engine is written (whever it uses a single-state NFA or multi-state NFA or generates and uses a DFA transition state with single-state like in Yacc/Bison), because capturing groups are just used to store position pairs, and regexp engines already have to count them for effectively matching the greedy or non-greedy quantifiers, so this immediately provides a usable index for storing at successive positions in a numbered array for captured groups. The simple test case is effectively to try to match /(aa?)*a+/ with strings longer than a few dozens of 'a'.

Anyway, there are ways to speedup regexps, even without instructing the 
regexps with anti-backtracking syntaxes.

See http://swtch.com/~rsc/regexp/regexp1.html
(article dated January 2007)
Which discusses how Perl, PCRE (and PHP), Python, Java, Ruby, .NET 
library... are slow because they are using backtracking a single state 
in the NFA, instead of using simultaneously active states (which 
correctly emulates the DFA, without having to actually build the DFA 
transition states table, which could grow combinatorially, as seen in 
yacc and Bison).

Java uses now the Thomson approach in its latest releases, but I wonder 
how Python works: does it use the DFA simulation? Do you use PCRE?

Note that I've been using the DFA simulation since more than 20 years in 
1987, when I built my first regexp matcher (because the existing 
implementation at that time were really damn slow), after I read the 
Aho/Sethi/Ullman book which already demonstrated the algorithm.

This algorithm has been implemented in some tools replacing the old 
yacc/Bison tools, because they generate huge DFA transition tables (and 
this was the reason why Lex and Yacc were maintained as separate tools, 
Lex using the single-state NFA approach with excessive backtracking, and 
Yacc/Bison trying to generate the full DFA transition tables.) : the 
first language to use this approach was the Purdue Univerity Compiler 
Construction Tools Set (PCCTS) which was initially written in C and is 
now fully written and supported in Java.

The Perl 6 extension for quantified capturing groups will have a slow 
adoption, as long as Perl will continue the slow single-state NFA 
approach with excessive backtracking, instead of the Aho/Sethi/Ullman 
(ASU) approach (that some are attributing to Thomson due to the article 
in 2007, but this is false) using simultaneously active states. But 
anyway, it already exists (and Perl developers are already working on 
rewriting the engine using the ASU approach).

But my suggstion is much more general, as it should not just apply to 
quantified capturing groups, but to any capturing group that is part of 
a subexpression which is quantified.

And the way I specified it, it does not depend on the way the engine is 
written (whever it uses a single-state NFA or multi-state NFA or 
generates and uses a DFA transition state with single-state like in 
Yacc/Bison), because capturing groups are just used to store position 
pairs, and regexp engines already have to count them for effectively 
matching the greedy or non-greedy quantifiers, so this immediately 
provides a usable index for storing at successive positions in a 
numbered array for captured groups.

The simple test case is effectively to try to match /(aa?)*a+/ with 
strings longer than a few dozens of 'a'.

History
Date	User	Action	Args
2009-10-15 02:41:19	verdy_p	set	recipients: + verdy_p, ezio.melotti, mrabarnett
2009-10-15 02:41:19	verdy_p	set	messageid: <1255574479.64.0.0316658450485.issue7132@psf.upfronthosting.co.za>
2009-10-15 02:41:18	verdy_p	link	issue7132 messages
2009-10-15 02:41:16	verdy_p	create