Author verdy_p
Recipients ezio.melotti, mrabarnett, verdy_p
Date 2009-10-15.03:37:10
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1255577835.82.0.0456978065129.issue7132@psf.upfronthosting.co.za>
In-reply-to
Content
Umm.... I saif that the attribution to Thompson was wrong, in fact it 
was correct. Thompson designed and documented the algorithm in 1968, 
long before the Aho/Seti/Ullman green book... so the algorithm is more 
than 40 years old, and still not in Python, Perl and PCRE (but it is 
present in GNU awk...)

The paper published in swtch.com is effectively written in 2007, but its 
conclusions are perfectly valid. The interesting aspect of this paper is 
that it demonstrates that the Thompson's multi-state NFA approach is 
still the best one, and better than what Perl, PCRE (and PHP), Python, 
Ruby and others are using, but that it can be also optimized further by 
caching the DFA construction "on the fly" (see the blue curve on the 
displayed diagram) while parsing the the already precompiled multi-state 
NFA.

The cache for DFA states will fill up while matching the regexp against 
actual strings, so it will be much faster (and much less memory-and-
time-greedy than generating the full DFA transition table at compilation 
time like in Bison/Yacc).

However the paper still does not discusses how to make the DFA states 
cache limited in size. Notably because the longer the input text will 
be, the more the DFA cache will contain DFA states. One simple rule is 
to limit the number of cached DFA states, and then to allow all stored 
transitions to go all multiple-states in the NFA, and optionally to a 
single DFA state in the cache.

Then the DFA cache can be used in a LIFO manner, to purge it 
automatically from unused states, in order to reuse them (for caching 
another new DFA state which is still not present in the cache, when the 
cache has reached its maximum size): if this occurs, the other existing 
DFA states that point to it must be cleaned (their DFA state pointer or 
reference, stored in their NFA or DFA transitions, must be cleared/set 
to null, so that they will only contain the list of pointers to outgoing 
NFA states). The problem is how to look for a multistate in the DFA 
cache: this requires some fast lookup, but this can be implemented in a 
fast way using hash tables (by hashing the list of NFA states 
represented in the DFA state).

Apparently, GNU awk does not use the cached DFA approach: it just uses 
the NFA directly when the input text is lower than two dozens of 
characters, then it builds the full DFA as soon as the input text 
becomes larger (this explains the sudden, but moderate increase of 
time). But I've seen that GNU awk has the defect of this unlimited DFA 
generation approach: its excessive use of memory when the input text 
increases, because the number of DFA states added to the cache is 
contantly growing with the input, and the time to match each characer 
from the input slowly increases too. At some point, it will crash and 
exit due to memory limits exhaustion, when no more DFA states can be 
stored. That's why the DFA cache MUST be maintained under some level.

I'll try to implement this newer approach first in Java (just because I 
better know this language than Python, and beacause I think it is more 
solid in terms of type-safety, so it can reduce the number of bugs to 
correct before getting something stable).

In Java, there's a clean way to automatically cleanup objects from 
collections, when they are no longer needed: you just need to use weak 
references (the garbage collector will automatically cleanup the older 
objects, when needed). But this approach is not easy to port, and in 
fact, if it can effectively solve some problems, it will still not 
forbif the cache to use up to the maximum VM size. for performance 
reasons, I see little interest in storing more than about 1 million DFA 
states in the cache (also because the hash table that would be used 
would be less efficient when looking up for the key of a NFA multi-state 
where the DFA state is stored). So I will probably not use weak 
references, but will first use a maximum size (even if weak references 
could help maintain the cache at even lower bounds than the allowed 
maximum, if VM memory size is more constrained: it is a good idea in all 
Java programs to allow caches introduced only for performance reasons to 
also collaborate with the garbage collector, in order to avoid the 
explosion of all caches used in various programs or libraries). I don't 
know if Python supports the concept of weak references for handling 
performance caches.

May be I'll port it later in Python, but don't expect that I'll port it 
to C/C++ (as a replacement of PCRE), because I now hate these unsafe 
languages despite I have practiced them for many years: others would do 
that for me, when I'll have published my Java implementation.
History
Date User Action Args
2009-10-15 03:37:15verdy_psetrecipients: + verdy_p, ezio.melotti, mrabarnett
2009-10-15 03:37:15verdy_psetmessageid: <1255577835.82.0.0456978065129.issue7132@psf.upfronthosting.co.za>
2009-10-15 03:37:14verdy_plinkissue7132 messages
2009-10-15 03:37:11verdy_pcreate