This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Dennis Sweeney
Recipients Dennis Sweeney, Zeturic, ammar2, corona10, gregory.p.smith, gvanrossum, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date 2020-10-23.09:32:11
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1603445532.42.0.224876122073.issue41972@roundup.psfhosted.org>
In-reply-to
Content
This is the approach in the PR:

    # JUMP_BOTH
    while ...:
        if haystack[j + cut] != needle[cut]:
            # Sunday skip
            j += table[haystack[j + len(needle)]]
            continue
        j += rest_of_the_two_way_algorithm()

If I understand correctly, this is what you're suggesting:

    # JUMP_MAX
    while ...:
        shift = rest_of_the_two_way_algorithm()
        j += max(shift, table[haystack[j + len(needle)]])

I implemented this and on my Zipf benchmark, and JUMP_BOTH was faster on 410 cases (at most 1.86x faster), while JUMP_MAX was faster on 179 cases (at most 1.3x faster).

Some things to notice about the JUMP_BOTH approach:

* The if-statement is actually the first step of the rest_of_the_two_way_algorithm, so there's no extra comparison over the pure two-way algorithm.

* The if-block only executes in situations where the two-way algorithm would say to skip ahead by only 1. In all other situations, the two-way algorithm skips by at least 2.

The typical purpose of the Sunday skip (as far as I can tell) is that in Sunday's algorithm, we find a mismatch, then have no a priori idea of how far ahead to skip, other than knowing that it has to be at least 1, so we check the character 1 to the right of the window. Another way of thinking about this would be to increment the window by 1, look at the last character in the new window, and jump ahead to line it up.

However, with the hybrid method of the PR, we *do* have some a priori skip, bigger than 1, known before we check the table. So instead of doing the maximum of the two-way skip and the Sunday skip, why not do both? As in: do the two-way shift, then extend that with a Sunday shift in the next iteration.

I tried another version also with this in mind, something like:

    # JUMP_AFTER
    while ...:
        # Guaranteed to jump to a mismatched poisition
        j += rest_of_the_two_way_algorithm() - 1
        j += table[haystack[j + len(needle)]]

But this resulted in slightly-worse performance: JUMP_BOTH was faster in 344 cases (at most 1.9x faster), while JUMP_AFTER was faster in 265 cases (at most 1.32x faster). My guess for the explanation of this is that since most of the time is spent on Sunday shifts lining up the cut character, it's beneficial for the compiler to generate specialized code for that case.
History
Date User Action Args
2020-10-23 09:32:12Dennis Sweeneysetrecipients: + Dennis Sweeney, gvanrossum, tim.peters, gregory.p.smith, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Zeturic
2020-10-23 09:32:12Dennis Sweeneysetmessageid: <1603445532.42.0.224876122073.issue41972@roundup.psfhosted.org>
2020-10-23 09:32:12Dennis Sweeneylinkissue41972 messages
2020-10-23 09:32:11Dennis Sweeneycreate