This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author gvanrossum
Recipients Dennis Sweeney, Zeturic, ammar2, corona10, gregory.p.smith, gvanrossum, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date 2020-10-18.00:12:42
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1602979963.33.0.943364507556.issue41972@roundup.psfhosted.org>
In-reply-to
Content
This may be irrelevant at this point, but trying to understand the original reproducer, I wanted to add my 1c worth.

It seems Dennis' reproducer.py is roughly this:

(I'm renaming BIG//3 to K to simplify the math.)

aaaaaBBBBBaaaaaBBBBBBBBBBBBBBB (haystack, length K*6)
BBBBBBBBBBBBBBB                (needle, length K*3)

The needle matches exactly once, at the end.  (Dennis uses BIG==10**6, which leaves a remainder of 1 after dividing by 3, but that turns out to be irrelevant -- it works with BIG==999999 as well.)

The reproducer falls prey to the fact that it shifts the needle by 1 each time (for the reason Tim already explained).  At each position probed, the sequence of comparisons is (regardless of the bloom filter or skip size, and stopping at the first mismatch):

- last byte of needle
- first, second, third, etc. byte of needle

As long as the needle's first character corresponds to an 'a' (i.e., K times) this is just two comparisons until failure, but once it hits the first run of 'B's it does K+1 comparisons, then shifts by 1, does another K+1 comparisons, and so on, for a total of K times. That's K**2 + K, the source of the slowdown. Then come K more quick misses, followed by the final success.

(Do we know how the OP found this reproducer? The specific length of their needle seems irrelevant, and I don't dare look in their data file.)

Anyway, thinking about this, for the current (unpatched) code, here's a somewhat simpler reproducer along the same lines:

BBBBBaaaaaBBBBB (haystack, length K*3)
BBBBBBBBBB      (needle, length K*2)

This immediately starts doing K sets of K+1 comparisons, i.e. K**2 + K again, followed by failure.


I am confident this has no relevance to the Two-Way algorithm.
History
Date User Action Args
2020-10-18 00:12:43gvanrossumsetrecipients: + gvanrossum, tim.peters, gregory.p.smith, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Dennis Sweeney, Zeturic
2020-10-18 00:12:43gvanrossumsetmessageid: <1602979963.33.0.943364507556.issue41972@roundup.psfhosted.org>
2020-10-18 00:12:43gvanrossumlinkissue41972 messages
2020-10-18 00:12:42gvanrossumcreate