Author Dennis Sweeney
Recipients Dennis Sweeney, Zeturic, ammar2, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date 2020-10-11.06:51:32
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1602399093.18.0.0798662657201.issue41972@roundup.psfhosted.org>
In-reply-to
Content
> Offhand do you know what the _best_ timing for two-way search is in a pattern-not-found case?

Looking at the glibc implementation, in the top-level "else" clause
(for when the string isn't completely periodic),
we have:

    period = MAX (suffix, needle_len - suffix) + 1;

so period > needle_len / 2.

Then during each iteration of the j-loop (where j is the index into the haystack),
j is sometimes incremented by `period`, which would probably give O(n/m) best case.

Here's a sublinear example for the two-way algorithm:

    N = 10**7
    haystack = "ABC" * N
    needle1 = "ABC" * (N // 10) + "DDD"
    needle2 = "ABC" * 10 + "DDD"
    
    === Results ===
    Pure Python Two-Way, shorter needle:     1.7142754
    Pure Python Two-Way. Longer needle:      0.0018845000000000667
    Python builtin, shorter needle:          0.031071400000000082
    Python builtin, longer needle:           0.030566099999999707

This case is surprisingly better than the builtin:

    N = 10**7
    haystack = "ABC" * N
    needle1 = "DDD" + "ABC" * (N // 10)
    needle2 = "DDD" + "ABC" * 10

    === Results ===
    Pure Python Two-Way, shorter needle:     0.0020895000000000774
    Pure Python Two-Way. Longer needle:      0.0017878999999999534
    Python builtin, shorter needle:          0.029998100000000028
    Python builtin, longer needle:           0.02963439999999995

This was measured using the attached fastsearch.py. There are some manipulations in there like string slicing that would make more sense as pointer math in C, so especially accounting for that, I'm guessing it would be pretty competitive in a lot of cases.

A lot of the implementations look like they use a complete Boyer-Moore table to go sublinear in even more cases, which it seems we want to avoid. I'm not certain, but it's conceivable that the existing techniques of using just one one delta-value or the "Bloom filter" could be thrown in there to get some of the same improvements. Although that could be redundant -- I'm not sure.
History
Date User Action Args
2020-10-11 06:51:33Dennis Sweeneysetrecipients: + Dennis Sweeney, tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, Zeturic
2020-10-11 06:51:33Dennis Sweeneysetmessageid: <1602399093.18.0.0798662657201.issue41972@roundup.psfhosted.org>
2020-10-11 06:51:33Dennis Sweeneylinkissue41972 messages
2020-10-11 06:51:33Dennis Sweeneycreate