This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Dennis Sweeney
Recipients Dennis Sweeney, Zeturic, ammar2, corona10, gregory.p.smith, gvanrossum, josh.r, pmpp, serhiy.storchaka, taleinat, tim.peters
Date 2021-01-17.22:56:56
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
PR 22904 now adds a text document explaining how the Two-Way algorithm works in much more detail.

I was looking at more benchmarking results, and I came to a couple of conclusions about cutoffs.

There's a consistent benefit to using the two-way algorithm whenever both strings are long enough and the needle is less than 20% the length of the haystack. If that percentage is higher, the preprocessing cost starts to dominate, and the Two-Way algorithm is slower than the status quo in average such cases. This doesn't change the fact that in cases like OP's original example, the Two-way algorithm can be thousands of times faster than the status quo, even though the needle is a significant percentage of the length of the haystack.

So to handle that, I tried an adaptive/introspective algorithm like Tim suggested: it counts the number of matched characters that don't lead to full matches, and once that total exceeds O(m), the Two-Way preprocessing cost will surely be worth it. The only way I could figure out how to not damage the speed of the usual small-string cases is to duplicate that code. See comparison.jpg for how much better the adaptive version is as opposed to always using two-way on this Zipf benchmark, while still guaranteeing O(m + n).
Date User Action Args
2021-01-17 22:56:56Dennis Sweeneysetrecipients: + Dennis Sweeney, gvanrossum, tim.peters, gregory.p.smith, taleinat, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Zeturic
2021-01-17 22:56:56Dennis Sweeneysetmessageid: <>
2021-01-17 22:56:56Dennis Sweeneylinkissue41972 messages
2021-01-17 22:56:56Dennis Sweeneycreate