Message 378420 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Dennis Sweeney, Zeturic, ammar2, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date	2020-10-11.03:37:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1602387467.42.0.926321074584.issue41972@roundup.psfhosted.org>
In-reply-to

Content
Ya, the case for the diff is at best marginal. Note that while it may be theoretically provable that the extra test would make the worst cases slower, that's almost certainly not measurable. The extra test would almost never be executed in the worst cases: in those the last characters of the needle and haystack DO match, and we spend almost all the total time in the for (j = 0; j < mlast; j++) loop. The extra test is only in the branch where the last characters _don't_ match. In that branch, it's already certain that the last haystack character does not match the last needle character, so in a probabilistic sense, for "random" data that increases the odds that the last haystack character isn't in the needle at all. In which case the payoff is relatively large compared to the cost of the test (can skip len(needle) positions at once, instead of only 1). I don't believe this can be out-thought. It would require timing on "typical" real-life searches. Which are likely impossible to obtain ;-) Offhand do you know what the _best_ timing for two-way search is in a pattern-not-found case? The algorithm is too complicated for that to be apparent to me at a glance. As Fredrik said in the post of his I linked to, he was very keen to have an algorithm with best case sublinear time. For example, the one we have takes best-case not-found O(n/m) time (where n is the haystack length and m the needle length). For example, searching for 'B' * 1_000_000 in 'A' * 10_000_000 fails after just 9 character comparisons (well, not counting the million less 1 compares to initialize `skip` to the ultimately useless 0).

Ya, the case for the diff is at best marginal. Note that while it may be theoretically provable that the extra test would make the worst cases slower, that's almost certainly not measurable. The extra test would almost never be executed in the worst cases: in those the last characters of the needle and haystack DO match, and we spend almost all the total time in the

for (j = 0; j < mlast; j++)

loop. The extra test is only in the branch where the last characters _don't_ match.

In that branch, it's already certain that the last haystack character does not match the last needle character, so in a probabilistic sense, for "random" data that increases the odds that the last haystack character isn't in the needle at all. In which case the payoff is relatively large compared to the cost of the test (can skip len(needle) positions at once, instead of only 1).

I don't believe this can be out-thought. It would require timing on "typical" real-life searches. Which are likely impossible to obtain ;-)

Offhand do you know what the _best_ timing for two-way search is in a pattern-not-found case? The algorithm is too complicated for that to be apparent to me at a glance. As Fredrik said in the post of his I linked to, he was very keen to have an algorithm with best case sublinear time. For example, the one we have takes best-case not-found O(n/m) time (where n is the haystack length and m the needle length). For example, searching for 'B' * 1_000_000 in 'A' * 10_000_000 fails after just 9 character comparisons (well, not counting the million less 1 compares to initialize `skip` to the ultimately useless 0).

History
Date	User	Action	Args
2020-10-11 03:37:47	tim.peters	set	recipients: + tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, Dennis Sweeney, Zeturic
2020-10-11 03:37:47	tim.peters	set	messageid: <1602387467.42.0.926321074584.issue41972@roundup.psfhosted.org>
2020-10-11 03:37:47	tim.peters	link	issue41972 messages
2020-10-11 03:37:46	tim.peters	create