Message385169
PR 22904 now adds a text document explaining how the Two-Way algorithm works in much more detail.
I was looking at more benchmarking results, and I came to a couple of conclusions about cutoffs.
There's a consistent benefit to using the two-way algorithm whenever both strings are long enough and the needle is less than 20% the length of the haystack. If that percentage is higher, the preprocessing cost starts to dominate, and the Two-Way algorithm is slower than the status quo in average such cases. This doesn't change the fact that in cases like OP's original example, the Two-way algorithm can be thousands of times faster than the status quo, even though the needle is a significant percentage of the length of the haystack.
So to handle that, I tried an adaptive/introspective algorithm like Tim suggested: it counts the number of matched characters that don't lead to full matches, and once that total exceeds O(m), the Two-Way preprocessing cost will surely be worth it. The only way I could figure out how to not damage the speed of the usual small-string cases is to duplicate that code. See comparison.jpg for how much better the adaptive version is as opposed to always using two-way on this Zipf benchmark, while still guaranteeing O(m + n). |
|
Date |
User |
Action |
Args |
2021-01-17 22:56:56 | Dennis Sweeney | set | recipients:
+ Dennis Sweeney, gvanrossum, tim.peters, gregory.p.smith, taleinat, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Zeturic |
2021-01-17 22:56:56 | Dennis Sweeney | set | messageid: <1610924216.68.0.863724303611.issue41972@roundup.psfhosted.org> |
2021-01-17 22:56:56 | Dennis Sweeney | link | issue41972 messages |
2021-01-17 22:56:56 | Dennis Sweeney | create | |
|