This may be irrelevant at this point, but trying to understand the original reproducer, I wanted to add my 1c worth.
It seems Dennis' reproducer.py is roughly this:
(I'm renaming BIG//3 to K to simplify the math.)
aaaaaBBBBBaaaaaBBBBBBBBBBBBBBB (haystack, length K*6)
BBBBBBBBBBBBBBB (needle, length K*3)
The needle matches exactly once, at the end. (Dennis uses BIG==10**6, which leaves a remainder of 1 after dividing by 3, but that turns out to be irrelevant -- it works with BIG==999999 as well.)
The reproducer falls prey to the fact that it shifts the needle by 1 each time (for the reason Tim already explained). At each position probed, the sequence of comparisons is (regardless of the bloom filter or skip size, and stopping at the first mismatch):
- last byte of needle
- first, second, third, etc. byte of needle
As long as the needle's first character corresponds to an 'a' (i.e., K times) this is just two comparisons until failure, but once it hits the first run of 'B's it does K+1 comparisons, then shifts by 1, does another K+1 comparisons, and so on, for a total of K times. That's K**2 + K, the source of the slowdown. Then come K more quick misses, followed by the final success.
(Do we know how the OP found this reproducer? The specific length of their needle seems irrelevant, and I don't dare look in their data file.)
Anyway, thinking about this, for the current (unpatched) code, here's a somewhat simpler reproducer along the same lines:
BBBBBaaaaaBBBBB (haystack, length K*3)
BBBBBBBBBB (needle, length K*2)
This immediately starts doing K sets of K+1 comparisons, i.e. K**2 + K again, followed by failure.
I am confident this has no relevance to the Two-Way algorithm. |