Message 94071 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	verdy_p
Recipients	ezio.melotti, mrabarnett, verdy_p
Date	2009-10-15.03:37:10
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1255577835.82.0.0456978065129.issue7132@psf.upfronthosting.co.za>
In-reply-to

Content
Umm.... I saif that the attribution to Thompson was wrong, in fact it was correct. Thompson designed and documented the algorithm in 1968, long before the Aho/Seti/Ullman green book... so the algorithm is more than 40 years old, and still not in Python, Perl and PCRE (but it is present in GNU awk...) The paper published in swtch.com is effectively written in 2007, but its conclusions are perfectly valid. The interesting aspect of this paper is that it demonstrates that the Thompson's multi-state NFA approach is still the best one, and better than what Perl, PCRE (and PHP), Python, Ruby and others are using, but that it can be also optimized further by caching the DFA construction "on the fly" (see the blue curve on the displayed diagram) while parsing the the already precompiled multi-state NFA. The cache for DFA states will fill up while matching the regexp against actual strings, so it will be much faster (and much less memory-and- time-greedy than generating the full DFA transition table at compilation time like in Bison/Yacc). However the paper still does not discusses how to make the DFA states cache limited in size. Notably because the longer the input text will be, the more the DFA cache will contain DFA states. One simple rule is to limit the number of cached DFA states, and then to allow all stored transitions to go all multiple-states in the NFA, and optionally to a single DFA state in the cache. Then the DFA cache can be used in a LIFO manner, to purge it automatically from unused states, in order to reuse them (for caching another new DFA state which is still not present in the cache, when the cache has reached its maximum size): if this occurs, the other existing DFA states that point to it must be cleaned (their DFA state pointer or reference, stored in their NFA or DFA transitions, must be cleared/set to null, so that they will only contain the list of pointers to outgoing NFA states). The problem is how to look for a multistate in the DFA cache: this requires some fast lookup, but this can be implemented in a fast way using hash tables (by hashing the list of NFA states represented in the DFA state). Apparently, GNU awk does not use the cached DFA approach: it just uses the NFA directly when the input text is lower than two dozens of characters, then it builds the full DFA as soon as the input text becomes larger (this explains the sudden, but moderate increase of time). But I've seen that GNU awk has the defect of this unlimited DFA generation approach: its excessive use of memory when the input text increases, because the number of DFA states added to the cache is contantly growing with the input, and the time to match each characer from the input slowly increases too. At some point, it will crash and exit due to memory limits exhaustion, when no more DFA states can be stored. That's why the DFA cache MUST be maintained under some level. I'll try to implement this newer approach first in Java (just because I better know this language than Python, and beacause I think it is more solid in terms of type-safety, so it can reduce the number of bugs to correct before getting something stable). In Java, there's a clean way to automatically cleanup objects from collections, when they are no longer needed: you just need to use weak references (the garbage collector will automatically cleanup the older objects, when needed). But this approach is not easy to port, and in fact, if it can effectively solve some problems, it will still not forbif the cache to use up to the maximum VM size. for performance reasons, I see little interest in storing more than about 1 million DFA states in the cache (also because the hash table that would be used would be less efficient when looking up for the key of a NFA multi-state where the DFA state is stored). So I will probably not use weak references, but will first use a maximum size (even if weak references could help maintain the cache at even lower bounds than the allowed maximum, if VM memory size is more constrained: it is a good idea in all Java programs to allow caches introduced only for performance reasons to also collaborate with the garbage collector, in order to avoid the explosion of all caches used in various programs or libraries). I don't know if Python supports the concept of weak references for handling performance caches. May be I'll port it later in Python, but don't expect that I'll port it to C/C++ (as a replacement of PCRE), because I now hate these unsafe languages despite I have practiced them for many years: others would do that for me, when I'll have published my Java implementation.

Umm.... I saif that the attribution to Thompson was wrong, in fact it 
was correct. Thompson designed and documented the algorithm in 1968, 
long before the Aho/Seti/Ullman green book... so the algorithm is more 
than 40 years old, and still not in Python, Perl and PCRE (but it is 
present in GNU awk...)

The paper published in swtch.com is effectively written in 2007, but its 
conclusions are perfectly valid. The interesting aspect of this paper is 
that it demonstrates that the Thompson's multi-state NFA approach is 
still the best one, and better than what Perl, PCRE (and PHP), Python, 
Ruby and others are using, but that it can be also optimized further by 
caching the DFA construction "on the fly" (see the blue curve on the 
displayed diagram) while parsing the the already precompiled multi-state 
NFA.

The cache for DFA states will fill up while matching the regexp against 
actual strings, so it will be much faster (and much less memory-and-
time-greedy than generating the full DFA transition table at compilation 
time like in Bison/Yacc).

However the paper still does not discusses how to make the DFA states 
cache limited in size. Notably because the longer the input text will 
be, the more the DFA cache will contain DFA states. One simple rule is 
to limit the number of cached DFA states, and then to allow all stored 
transitions to go all multiple-states in the NFA, and optionally to a 
single DFA state in the cache.

Then the DFA cache can be used in a LIFO manner, to purge it 
automatically from unused states, in order to reuse them (for caching 
another new DFA state which is still not present in the cache, when the 
cache has reached its maximum size): if this occurs, the other existing 
DFA states that point to it must be cleaned (their DFA state pointer or 
reference, stored in their NFA or DFA transitions, must be cleared/set 
to null, so that they will only contain the list of pointers to outgoing 
NFA states). The problem is how to look for a multistate in the DFA 
cache: this requires some fast lookup, but this can be implemented in a 
fast way using hash tables (by hashing the list of NFA states 
represented in the DFA state).

Apparently, GNU awk does not use the cached DFA approach: it just uses 
the NFA directly when the input text is lower than two dozens of 
characters, then it builds the full DFA as soon as the input text 
becomes larger (this explains the sudden, but moderate increase of 
time). But I've seen that GNU awk has the defect of this unlimited DFA 
generation approach: its excessive use of memory when the input text 
increases, because the number of DFA states added to the cache is 
contantly growing with the input, and the time to match each characer 
from the input slowly increases too. At some point, it will crash and 
exit due to memory limits exhaustion, when no more DFA states can be 
stored. That's why the DFA cache MUST be maintained under some level.

I'll try to implement this newer approach first in Java (just because I 
better know this language than Python, and beacause I think it is more 
solid in terms of type-safety, so it can reduce the number of bugs to 
correct before getting something stable).

In Java, there's a clean way to automatically cleanup objects from 
collections, when they are no longer needed: you just need to use weak 
references (the garbage collector will automatically cleanup the older 
objects, when needed). But this approach is not easy to port, and in 
fact, if it can effectively solve some problems, it will still not 
forbif the cache to use up to the maximum VM size. for performance 
reasons, I see little interest in storing more than about 1 million DFA 
states in the cache (also because the hash table that would be used 
would be less efficient when looking up for the key of a NFA multi-state 
where the DFA state is stored). So I will probably not use weak 
references, but will first use a maximum size (even if weak references 
could help maintain the cache at even lower bounds than the allowed 
maximum, if VM memory size is more constrained: it is a good idea in all 
Java programs to allow caches introduced only for performance reasons to 
also collaborate with the garbage collector, in order to avoid the 
explosion of all caches used in various programs or libraries). I don't 
know if Python supports the concept of weak references for handling 
performance caches.

May be I'll port it later in Python, but don't expect that I'll port it 
to C/C++ (as a replacement of PCRE), because I now hate these unsafe 
languages despite I have practiced them for many years: others would do 
that for me, when I'll have published my Java implementation.

History
Date	User	Action	Args
2009-10-15 03:37:15	verdy_p	set	recipients: + verdy_p, ezio.melotti, mrabarnett
2009-10-15 03:37:15	verdy_p	set	messageid: <1255577835.82.0.0456978065129.issue7132@psf.upfronthosting.co.za>
2009-10-15 03:37:14	verdy_p	link	issue7132 messages
2009-10-15 03:37:11	verdy_p	create