Message 385169 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dennis Sweeney
Recipients	Dennis Sweeney, Zeturic, ammar2, corona10, gregory.p.smith, gvanrossum, josh.r, pmpp, serhiy.storchaka, taleinat, tim.peters
Date	2021-01-17.22:56:56
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1610924216.68.0.863724303611.issue41972@roundup.psfhosted.org>
In-reply-to

Content
PR 22904 now adds a text document explaining how the Two-Way algorithm works in much more detail. I was looking at more benchmarking results, and I came to a couple of conclusions about cutoffs. There's a consistent benefit to using the two-way algorithm whenever both strings are long enough and the needle is less than 20% the length of the haystack. If that percentage is higher, the preprocessing cost starts to dominate, and the Two-Way algorithm is slower than the status quo in average such cases. This doesn't change the fact that in cases like OP's original example, the Two-way algorithm can be thousands of times faster than the status quo, even though the needle is a significant percentage of the length of the haystack. So to handle that, I tried an adaptive/introspective algorithm like Tim suggested: it counts the number of matched characters that don't lead to full matches, and once that total exceeds O(m), the Two-Way preprocessing cost will surely be worth it. The only way I could figure out how to not damage the speed of the usual small-string cases is to duplicate that code. See comparison.jpg for how much better the adaptive version is as opposed to always using two-way on this Zipf benchmark, while still guaranteeing O(m + n).

PR 22904 now adds a text document explaining how the Two-Way algorithm works in much more detail.

I was looking at more benchmarking results, and I came to a couple of conclusions about cutoffs.

There's a consistent benefit to using the two-way algorithm whenever both strings are long enough and the needle is less than 20% the length of the haystack. If that percentage is higher, the preprocessing cost starts to dominate, and the Two-Way algorithm is slower than the status quo in average such cases. This doesn't change the fact that in cases like OP's original example, the Two-way algorithm can be thousands of times faster than the status quo, even though the needle is a significant percentage of the length of the haystack.

So to handle that, I tried an adaptive/introspective algorithm like Tim suggested: it counts the number of matched characters that don't lead to full matches, and once that total exceeds O(m), the Two-Way preprocessing cost will surely be worth it. The only way I could figure out how to not damage the speed of the usual small-string cases is to duplicate that code. See comparison.jpg for how much better the adaptive version is as opposed to always using two-way on this Zipf benchmark, while still guaranteeing O(m + n).

History
Date	User	Action	Args
2021-01-17 22:56:56	Dennis Sweeney	set	recipients: + Dennis Sweeney, gvanrossum, tim.peters, gregory.p.smith, taleinat, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Zeturic
2021-01-17 22:56:56	Dennis Sweeney	set	messageid: <1610924216.68.0.863724303611.issue41972@roundup.psfhosted.org>
2021-01-17 22:56:56	Dennis Sweeney	link	issue41972 messages
2021-01-17 22:56:56	Dennis Sweeney	create