Message94071
Umm.... I saif that the attribution to Thompson was wrong, in fact it
was correct. Thompson designed and documented the algorithm in 1968,
long before the Aho/Seti/Ullman green book... so the algorithm is more
than 40 years old, and still not in Python, Perl and PCRE (but it is
present in GNU awk...)
The paper published in swtch.com is effectively written in 2007, but its
conclusions are perfectly valid. The interesting aspect of this paper is
that it demonstrates that the Thompson's multi-state NFA approach is
still the best one, and better than what Perl, PCRE (and PHP), Python,
Ruby and others are using, but that it can be also optimized further by
caching the DFA construction "on the fly" (see the blue curve on the
displayed diagram) while parsing the the already precompiled multi-state
NFA.
The cache for DFA states will fill up while matching the regexp against
actual strings, so it will be much faster (and much less memory-and-
time-greedy than generating the full DFA transition table at compilation
time like in Bison/Yacc).
However the paper still does not discusses how to make the DFA states
cache limited in size. Notably because the longer the input text will
be, the more the DFA cache will contain DFA states. One simple rule is
to limit the number of cached DFA states, and then to allow all stored
transitions to go all multiple-states in the NFA, and optionally to a
single DFA state in the cache.
Then the DFA cache can be used in a LIFO manner, to purge it
automatically from unused states, in order to reuse them (for caching
another new DFA state which is still not present in the cache, when the
cache has reached its maximum size): if this occurs, the other existing
DFA states that point to it must be cleaned (their DFA state pointer or
reference, stored in their NFA or DFA transitions, must be cleared/set
to null, so that they will only contain the list of pointers to outgoing
NFA states). The problem is how to look for a multistate in the DFA
cache: this requires some fast lookup, but this can be implemented in a
fast way using hash tables (by hashing the list of NFA states
represented in the DFA state).
Apparently, GNU awk does not use the cached DFA approach: it just uses
the NFA directly when the input text is lower than two dozens of
characters, then it builds the full DFA as soon as the input text
becomes larger (this explains the sudden, but moderate increase of
time). But I've seen that GNU awk has the defect of this unlimited DFA
generation approach: its excessive use of memory when the input text
increases, because the number of DFA states added to the cache is
contantly growing with the input, and the time to match each characer
from the input slowly increases too. At some point, it will crash and
exit due to memory limits exhaustion, when no more DFA states can be
stored. That's why the DFA cache MUST be maintained under some level.
I'll try to implement this newer approach first in Java (just because I
better know this language than Python, and beacause I think it is more
solid in terms of type-safety, so it can reduce the number of bugs to
correct before getting something stable).
In Java, there's a clean way to automatically cleanup objects from
collections, when they are no longer needed: you just need to use weak
references (the garbage collector will automatically cleanup the older
objects, when needed). But this approach is not easy to port, and in
fact, if it can effectively solve some problems, it will still not
forbif the cache to use up to the maximum VM size. for performance
reasons, I see little interest in storing more than about 1 million DFA
states in the cache (also because the hash table that would be used
would be less efficient when looking up for the key of a NFA multi-state
where the DFA state is stored). So I will probably not use weak
references, but will first use a maximum size (even if weak references
could help maintain the cache at even lower bounds than the allowed
maximum, if VM memory size is more constrained: it is a good idea in all
Java programs to allow caches introduced only for performance reasons to
also collaborate with the garbage collector, in order to avoid the
explosion of all caches used in various programs or libraries). I don't
know if Python supports the concept of weak references for handling
performance caches.
May be I'll port it later in Python, but don't expect that I'll port it
to C/C++ (as a replacement of PCRE), because I now hate these unsafe
languages despite I have practiced them for many years: others would do
that for me, when I'll have published my Java implementation. |
|
Date |
User |
Action |
Args |
2009-10-15 03:37:15 | verdy_p | set | recipients:
+ verdy_p, ezio.melotti, mrabarnett |
2009-10-15 03:37:15 | verdy_p | set | messageid: <1255577835.82.0.0456978065129.issue7132@psf.upfronthosting.co.za> |
2009-10-15 03:37:14 | verdy_p | link | issue7132 messages |
2009-10-15 03:37:11 | verdy_p | create | |
|