Message378420
Ya, the case for the diff is at best marginal. Note that while it may be theoretically provable that the extra test would make the worst cases slower, that's almost certainly not measurable. The extra test would almost never be executed in the worst cases: in those the last characters of the needle and haystack DO match, and we spend almost all the total time in the
for (j = 0; j < mlast; j++)
loop. The extra test is only in the branch where the last characters _don't_ match.
In that branch, it's already certain that the last haystack character does not match the last needle character, so in a probabilistic sense, for "random" data that increases the odds that the last haystack character isn't in the needle at all. In which case the payoff is relatively large compared to the cost of the test (can skip len(needle) positions at once, instead of only 1).
I don't believe this can be out-thought. It would require timing on "typical" real-life searches. Which are likely impossible to obtain ;-)
Offhand do you know what the _best_ timing for two-way search is in a pattern-not-found case? The algorithm is too complicated for that to be apparent to me at a glance. As Fredrik said in the post of his I linked to, he was very keen to have an algorithm with best case sublinear time. For example, the one we have takes best-case not-found O(n/m) time (where n is the haystack length and m the needle length). For example, searching for 'B' * 1_000_000 in 'A' * 10_000_000 fails after just 9 character comparisons (well, not counting the million less 1 compares to initialize `skip` to the ultimately useless 0). |
|
Date |
User |
Action |
Args |
2020-10-11 03:37:47 | tim.peters | set | recipients:
+ tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, Dennis Sweeney, Zeturic |
2020-10-11 03:37:47 | tim.peters | set | messageid: <1602387467.42.0.926321074584.issue41972@roundup.psfhosted.org> |
2020-10-11 03:37:47 | tim.peters | link | issue41972 messages |
2020-10-11 03:37:46 | tim.peters | create | |
|