Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes.find consistently hangs in a particular scenario #86138

Closed
Zeturic mannequin opened this issue Oct 7, 2020 · 77 comments
Closed

bytes.find consistently hangs in a particular scenario #86138

Zeturic mannequin opened this issue Oct 7, 2020 · 77 comments
Labels
3.9 only security fixes 3.10 only security fixes performance Performance or resource usage

Comments

@Zeturic
Copy link
Mannequin

Zeturic mannequin commented Oct 7, 2020

BPO 41972
Nosy @gvanrossum, @tim-one, @rhettinger, @gpshead, @cfbolz, @taleinat, @pmp-p, @ambv, @serhiy-storchaka, @MojoVampire, @ammaraskar, @corona10, @sweeneyde, @Zeturic
PRs
  • bpo-41972: Use the "Two-Way" algorithm when searching for long substrings #22679
  • bpo-41972: Use the two-way algorithm for string searching #22904
  • bpo-41972: Add whatsnew note for GH-22904 #24672
  • bpo-41972: Tweak fastsearch.h string search algorithms #27091
  • Files
  • reproducer.py
  • fastsearch.diff
  • fastsearch.py: Pure-python two-way string search algorithm translated from glibc
  • fastsearch.h: Drop-in replacement for Objects/stringlib/fastsearch.h using the two-way algorithm
  • bench_results.txt
  • random_bench.py: Test str in str on random words of a language with a Zipf distribution
  • bench_table.txt
  • twoway-benchmark-results-2020-10-28.txt
  • Table of benchmarks on lengths.jpg
  • length_benchmarks.txt
  • twoway_demo.py: Added Checkbox for using shift-table
  • comparison.jpg: Benchmarks of Two-Way algorithm versus adaptive algorithm on Zipf benchmarks
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-07-19.15:25:54.659>
    created_at = <Date 2020-10-07.23:31:59.517>
    labels = ['3.9', '3.10', 'performance']
    title = 'bytes.find consistently hangs in a particular scenario'
    updated_at = <Date 2021-07-19.15:25:54.658>
    user = 'https://github.com/Zeturic'

    bugs.python.org fields:

    activity = <Date 2021-07-19.15:25:54.658>
    actor = 'lukasz.langa'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-07-19.15:25:54.659>
    closer = 'lukasz.langa'
    components = []
    creation = <Date 2020-10-07.23:31:59.517>
    creator = 'Zeturic'
    dependencies = []
    files = ['49499', '49503', '49507', '49508', '49511', '49512', '49518', '49545', '49577', '49578', '49674', '49746']
    hgrepos = []
    issue_num = 41972
    keywords = ['patch']
    message_count = 77.0
    messages = ['378197', '378209', '378210', '378221', '378222', '378225', '378229', '378233', '378260', '378263', '378301', '378329', '378372', '378420', '378423', '378470', '378472', '378534', '378537', '378541', '378543', '378556', '378576', '378578', '378579', '378582', '378583', '378584', '378585', '378590', '378600', '378635', '378652', '378736', '378737', '378739', '378742', '378750', '378751', '378753', '378793', '378828', '378834', '378836', '378842', '378843', '378844', '378847', '378916', '378917', '378920', '378922', '378923', '378924', '378979', '379388', '379391', '379422', '379494', '379822', '379825', '380473', '380474', '380476', '380477', '380487', '382905', '385169', '387717', '387790', '387791', '387792', '387813', '397277', '397787', '397788', '397805']
    nosy_count = 14.0
    nosy_names = ['gvanrossum', 'tim.peters', 'rhettinger', 'gregory.p.smith', 'Carl.Friedrich.Bolz', 'taleinat', 'pmpp', 'lukasz.langa', 'serhiy.storchaka', 'josh.r', 'ammar2', 'corona10', 'Dennis Sweeney', 'Zeturic']
    pr_nums = ['22679', '22904', '24672', '27091']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue41972'
    versions = ['Python 3.9', 'Python 3.10']

    @Zeturic
    Copy link
    Mannequin Author

    Zeturic mannequin commented Oct 7, 2020

    Sorry for the vague title. I'm not sure how to succinctly describe this issue.

    The following code:

    with open("data.bin", "rb") as f:
        data = f.read()
    
    base = 15403807 * b'\xff'
    longer = base + b'\xff'
    
    print(data.find(base))
    print(data.find(longer))
    

    Always hangs on the second call to find.

    It might complete eventually, but I've left it running and never seen it do so. Because of the structure of data.bin, it should find the same position as the first call to find.

    The first call to find completes and prints near instantly, which makes the pathological performance of the second (which is only searching for one b"\xff" more than the first) even more mystifying.

    I attempted to upload the data.bin file I was working with as an attachment here, but it failed multiple times. I assume it's too large for an attachment; it's a 32MiB file consisting only of 00 bytes and FF bytes.

    Since I couldn't attach it, I uploaded it to a gist. I hope that's okay.

    https://gist.github.com/Zeturic/7d0480a94352968c1fe92aa62e8adeaf

    I wasn't able to trigger the pathological runtime behavior with other sequences of bytes, which is why I uploaded it in the first place. For example, if it is randomly generated, it doesn't trigger it.

    I've verified that this happens on multiple versions of CPython (as well as PyPy) and on multiple computers / operating systems.

    @Zeturic Zeturic mannequin added performance Performance or resource usage 3.8 only security fixes 3.9 only security fixes labels Oct 7, 2020
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Oct 8, 2020

    Can reproduce on Alpine Linux, with CPython 3.8.2 (running under WSLv2), so it's not just you. CPU usage is high; seems like it must be stuck in an infinite loop.

    @MojoVampire MojoVampire mannequin added type-bug An unexpected behavior, bug, or error and removed performance Performance or resource usage labels Oct 8, 2020
    @tim-one
    Copy link
    Member

    tim-one commented Oct 8, 2020

    Also reproduced on 64-bit Win10 with just-released 3.9.0.

    Note that string search tries to incorporate a number of tricks (pieces of Boyer-Moore, Sunday, etc) to speed searches. The "skip tables" here are probably computing a 0 by mistake. The file here contains only 2 distinct byte values, and the relatively huge string to search for has only 1, so there's plenty of opportunity for those tricks to get confused ;-)

    @serhiy-storchaka
    Copy link
    Member

    What is the content of "data.bin"? How can we reproduce the issue?

    @serhiy-storchaka serhiy-storchaka added 3.10 only security fixes labels Oct 8, 2020
    @ammaraskar
    Copy link
    Member

    It's available from the Github gist that Kevin posted. Here is a direct link that you can curl/wget:

    https://gist.github.com/Zeturic/7d0480a94352968c1fe92aa62e8adeaf/raw/6daebaabedaa903016810c2c04d0d1f0b1af1ed3/data.bin

    @sweeneyde
    Copy link
    Member

    Adding some hasty printf-debugging to fastsearch.h, I see this:

        >>> with open('data.bin', 'rb') as f:
        ...     s = f.read()
        ...
        >>> base = 15403807 * b'\xff'
        >>> longer = base + b'\xff'
        >>> base in s
        
        Bloom has 0x00: 0
        Bloom has 0xff: -2147483648
        skip = 0
        Checking bloom filter for next char 0x00... not found in pattern. Skipping a bunch.
        Candidate at 15403808
    True
    
    >>> longer in s
    
    Bloom has 0x00: No
    Bloom has 0xff: Yes
    skip = 0
    Checking bloom filter for next char 0xff... found in pattern; increment by only one.
    Candidate at 1
    Candidate at 2
    Candidate at 3
    Candidate at 4
    Candidate at 5
    Candidate at 6
    Candidate at 7
    (...)
    

    Trying the same base, I also get this:

        >>> with open('data.bin', 'rb') as f:
        ...     s = f.read()
        ...
        >>> base = 15403807 * b'\xff'
        >>> s1 = s[1:]
        >>> base in s1
    Bloom has 0x00: No
    Bloom has 0xff: Yes
    skip = 0
    Checking bloom filter for next char 0xff... found in pattern; increment by only one.
    Candidate at 1
    Candidate at 2
    Candidate at 3
    Candidate at 4
    Candidate at 5
    Candidate at 6
    Candidate at 7
    (...)
    

    So maybe part of the issue is just that the algorithm is getting particularly
    unlucky regarding how the strings line up to start with.

    The fact that "skip" is 0 in these cases seems to just be a limitation of the algorithm:
    we only skip to line up the second-to-last occurrence of the last character,
    but with these strings, that's just the second-to-last character.

    Computing the entire Boyer-Moore table would be better in cases like these, but
    that would require dynamic memory allocation (length of pattern) which I think is why it is avoided.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 8, 2020

    Does the code really hang? Or does it complete if you wait for an infinite amount of time?

    I understand that your concern is that bytes.find() is inefficient for a very specific pattern.

    @sweeneyde
    Copy link
    Member

    Indeed, this is just a very unlucky case.

        >>> n = len(longer)
        >>> from collections import Counter
        >>> Counter(s[:n])
        Counter({0: 9056995, 255: 6346813})
        >>> s[n-30:n+30].replace(b'\x00', b'.').replace(b'\xff', b'@')
        b'..............................@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
        >>> Counter(s[n:])
        Counter({255: 18150624})

    When checking "base", we're in this situation

    pattern:     @@@@@@@@
     string:     .........@@@@@@@@
    Algorithm says:     ^ these last characters don't match.
                         ^ this next character is not in the pattern
                         Therefore, skip ahead a bunch:
    
     pattern:              @@@@@@@@
      string:     .........@@@@@@@@
    
     This is a match!
    

    Whereas when checking "longer", we're in this situation:

    pattern:     @@@@@@@@@
     string:     .........@@@@@@@@
    Algorithm says:      ^ these last characters don't match.
                          ^ this next character *is* in the pattern.
                          We can't jump forward.
    
     pattern:       @@@@@@@@
      string:     .........@@@@@@@@
    
     Start comparing at every single alignment...
    

    I'm attaching reproducer.py, which replicates this from scratch without loading data from a file.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 8, 2020

    Good sleuthing, Dennis! Yes, Fredrik was not willing to add "potentially expensive" (in time or in space) tricks:

    http://effbot.org/zone/stringlib.htm

    So worst-case time is proportional to the product of the arguments' lengths, and the cases here appear to, essentially, hit that. It _was_ a goal that it always be at least as fast as the dirt-dumb search algorithm it replaced, and in good real-life (not just contrived) cases to be much faster. It met the goals it had.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 8, 2020

    Just FYI, the original test program did get the right answer for the second search on my box - after about 3 1/2 hours :-)

    @tim-one
    Copy link
    Member

    tim-one commented Oct 9, 2020

    BTW, this initialization in the FASTSEARCH code appears to me to be a mistake:

        skip = mlast - 1;

    That's "mistake" in the sense of "not quite what was intended, and so confusing", not in the sense of "leads to a wrong result".

    I believe skip should be initialized to plain mlast instead. Setting it to something smaller than that (like to mlast - 1) doesn't cause an error, but fails to exploit as much skipping as possible.

    mlast - 1 is the same value skip is set to if the following loop finds that the last character of the pattern is also the pattern's first character, but doesn't appear elsewhere in the pattern. It can be one larger if the last pattern character is unique in the pattern (which is the initial value of skip set by the line shown above).

    @tim-one
    Copy link
    Member

    tim-one commented Oct 9, 2020

    The attached fastsearch.diff removes the speed hit in the original test case and in the constructed one.

    I don't know whether it "should be" applied, and really can't make time to dig into it.

    The rationale: when the last characters of the pattern string and the current search window don't match, this is what happens: if the "Bloom filter" can prove that the character one beyond the search window isn't in the pattern at all, the search window is advanced by len(pattern) + 1 positions. Else it's advanced only 1 position.

    What changes here: if the Bloom filter thinks the character one beyond the search window may be in the pattern, this goes on to see whether the Bloom filter thinks the last character in the search window may be in the pattern (by case assumption, we already know it's not the _last_ character in the pattern). If not, the search window can be advanced by len(pattern) positions.

    So it adds an extra check in the last-characters-don't-match and next-character-may-be-in-the-pattern case: if the last character is not in the pattern, it's irrelevant that the next character may be - there's no possible overall match including the last character, so we can move the search window beyond it.

    The question is whether this costs more than it saves overall. It saves literally hours of CPU time for one search in this report's case ;-) But worst-case time remains O(m*n) regardless, just somewhat harder to provoke.

    @sweeneyde
    Copy link
    Member

    I agree that skip could could do 1 better.

    ---

    I don't know whether it "should be" applied

    I don't think I'm convinced: the second check fixes only the very specific case when s[len(p):].startswith(p).
    Perturbations of reproducer.py are still very slow with the patch:

    - pattern2 = b"B" * BIG
    + pattern2 = b"B" * (BIG + 10)
    

    In fact, it's even slower than before due to the double-checking.

    ---

    I'd be curious how something like Crochemore and Perrin's "two-way" algorithm would do (constant space, compares < 2n-m):

    http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
    https://en.wikipedia.org/wiki/Two-way_string-matching_algorithm
    

    Apparently it's been used as glibc's memmem() since 2008. There, it additionally computes a table, but only for long needles where it really pays off:

    https://code.woboq.org/userspace/glibc/string/str-two-way.h.html
    https://code.woboq.org/userspace/glibc/string/memmem.c.html
    

    Although there certainly seems to be a complexity rabbithole here that would be easy to over-do.
    I might look more into this.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 11, 2020

    Ya, the case for the diff is at best marginal. Note that while it may be theoretically provable that the extra test would make the worst cases slower, that's almost certainly not measurable. The extra test would almost never be executed in the worst cases: in those the last characters of the needle and haystack DO match, and we spend almost all the total time in the

                for (j = 0; j < mlast; j++)
    

    loop. The extra test is only in the branch where the last characters _don't_ match.

    In that branch, it's already certain that the last haystack character does not match the last needle character, so in a probabilistic sense, for "random" data that increases the odds that the last haystack character isn't in the needle at all. In which case the payoff is relatively large compared to the cost of the test (can skip len(needle) positions at once, instead of only 1).

    I don't believe this can be out-thought. It would require timing on "typical" real-life searches. Which are likely impossible to obtain ;-)

    Offhand do you know what the best timing for two-way search is in a pattern-not-found case? The algorithm is too complicated for that to be apparent to me at a glance. As Fredrik said in the post of his I linked to, he was very keen to have an algorithm with best case sublinear time. For example, the one we have takes best-case not-found O(n/m) time (where n is the haystack length and m the needle length). For example, searching for 'B' * 1_000_000 in 'A' * 10_000_000 fails after just 9 character comparisons (well, not counting the million less 1 compares to initialize skip to the ultimately useless 0).

    @sweeneyde
    Copy link
    Member

    Offhand do you know what the _best_ timing for two-way search is in a pattern-not-found case?

    Looking at the glibc implementation, in the top-level "else" clause
    (for when the string isn't completely periodic),
    we have:

        period = MAX (suffix, needle_len - suffix) + 1;

    so period > needle_len / 2.

    Then during each iteration of the j-loop (where j is the index into the haystack),
    j is sometimes incremented by period, which would probably give O(n/m) best case.

    Here's a sublinear example for the two-way algorithm:

        N = 10**7
        haystack = "ABC" * N
        needle1 = "ABC" * (N // 10) + "DDD"
        needle2 = "ABC" * 10 + "DDD"
        
        === Results 

    ===
    Pure Python Two-Way, shorter needle: 1.7142754
    Pure Python Two-Way. Longer needle: 0.0018845000000000667
    Python builtin, shorter needle: 0.031071400000000082
    Python builtin, longer needle: 0.030566099999999707

    This case is surprisingly better than the builtin:

        N = 10**7
        haystack = "ABC" * N
        needle1 = "DDD" + "ABC" * (N // 10)
        needle2 = "DDD" + "ABC" * 10
    === Results \===
    Pure Python Two-Way, shorter needle:     0.0020895000000000774
    Pure Python Two-Way. Longer needle:      0.0017878999999999534
    Python builtin, shorter needle:          0.029998100000000028
    Python builtin, longer needle:           0.02963439999999995
    

    This was measured using the attached fastsearch.py. There are some manipulations in there like string slicing that would make more sense as pointer math in C, so especially accounting for that, I'm guessing it would be pretty competitive in a lot of cases.

    A lot of the implementations look like they use a complete Boyer-Moore table to go sublinear in even more cases, which it seems we want to avoid. I'm not certain, but it's conceivable that the existing techniques of using just one one delta-value or the "Bloom filter" could be thrown in there to get some of the same improvements. Although that could be redundant -- I'm not sure.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 12, 2020

    Impressive, Dennis! Nice work.

    FYI, on the OP's original test data, your fastsearch() completes each search in under 20 seconds using CPython, and in well under 1 second using PyPy.

    Unfortunately, that's so promising it can't just be dismissed offhandedly ;-) Looks like something quite worth pursuing here!

    @sweeneyde
    Copy link
    Member

    Here is a C implementation of the two-way algorithm that should work as a drop-in replacement for Objects/stringlib/fastsearch.h.

    Benchmarking so far, it looks like it is a bit slower in a lot of cases. But it's also a bit faster in a some other cases and oodles faster in the really bad cases.

    I wonder if there's a good heuristic cutoff (for the needle size?) where the two-way usually becomes better.

    @sweeneyde
    Copy link
    Member

    PR 22679 is a draft that does the two-way algorithm but also adds both of the tricks from Fredrik's implementation: a bit-set "bloom filter" and remembering the skip-distance between some pair of characters.

    @sweeneyde
    Copy link
    Member

    Dennis, I think that's expected, right? Two-way on its own can exploit nothing about individual characters - it only preprocesses the needle to break the possibility for quadratic-time behavior due to periods in the needle.

    Yes, that's correct.

    It sounds like you switched the current code's Sunday-ish approach to a more Boyer-Moore-ish approach. That probably hurt, and especially for shorter needles (where Sunday can yield a maximum shift of len(needle)+1, not len(needle) - the "+1" is more significant the smaller len(needle)).

    I'm definitely open to switching things up, but I'm not totally sure there would be a difference. In Sunday, the pattern is somehow scanned at each step, and then upon finding a mismatch, we look up in the table the first character outside the window. With the PR as written, when looking up in the table the last character of the window, we haven't actually produced a mismatch yet; the 'continue' statements jump ahead until the last characters match (mod 128), and then the two-way scanning begins, which determines how the window gets shifted after that. But after this two-way-determined shift, the last character in the new window immediately gets looked up again in the table at the top of the loop, effectively letting the two-way shift immediately be extended to a longer shift. I suppose that the line j += i - suffix + 1; (which is very often an increment by 1) could be changed to a table lookup of the character just beyond the window, but I don't quite understand why that would be better than incrementing j and then doing the table lookups at the top of the loop.

    @sweeneyde
    Copy link
    Member

    I posted the example thinking that having a concrete walkthrough might be good for discussion, and it looks like it was. ;-)

    This makes me curious how a simplified-but-not-as-simplified-as-the-status-quo Sunday algorithm would fare: using the Sunday algorithm, but with a shift lookup modulo 128 rather than a boolean lookup. It might not be provably O(n), but it might be faster for some cases. Although if the common case is already fast enough, maybe we do want a provably-linear algorithm for the big stuff.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 19, 2020

    I confess I _assumed_ all along that you were generalizing the current code's Sunday trick to 7-bit equivalence classes (up from 32 bits total) and 64K possible shift counts (up from just 2 total possibilities: 1 or len(needle)+1). The Sunday trick couldn't care less where or when the mismatch occurs, just that a mismatch occurred somewhere.

    In my head I was picturing the paper's code, not the PR. Whenever it makes a shift, it could compare it to the "Sunday-ish shift", and pick the larger. That should have no effect on worst case O() behavior - it would always shift at least as much as Crochempre-Perrin when a mismatch was hit.

    I can't say how it would relate to details of the PR's spelling, because I'm still trying to fully understand the paper ;-) I don't believe I can usefully review the code before then.

    @sweeneyde
    Copy link
    Member

    Unfortunately due to licensing issues, it looks like I'll have to ditch the PR and start from scratch: it was too heavily based on the glibc implementation, which has a stricter license. It may be take a good deal of time before I have a potential replacement, but I'll try to work on a "clean-room" implementation. Lesson learned!

    @sweeneyde
    Copy link
    Member

    I attached a new PR, with a lot of the same ideas.

    The major differences between this and the last PR:

    • The first character to be checked at each alignment is the first character of the right half, rather than the last.

    • If that first character does not match, then the character immediately following the window is looked up in the table, and we jump forward accordingly (Sunday's trick).

    I'll post some more benchmarks soon, but preliminarily, it seems like this swapping of the character to first be matched is better for some needles, worse for others, which makes sense. Stringbench.py for example has some needles that have a unique character at the end, which prefers the status quo and old PR, but other strings work better for this PR.

    @tim-one
    Copy link
    Member

    tim-one commented Oct 23, 2020

    Note that Sunday doesn't care (at all) where mismatches occur. The "natural" way to add Sunday: follow pure C-P unless/until it finds a mismatching position. Pure C-P then computes a specific shift. Nothing about that changes. But something is added: also compute the shift Sunday suggests, and pick the larger of that and what C-P computed.

    C-P and Sunday both have cases where they can justify "supernaturally large" shifts (i.e., as long as len(needle), or even that +1 for Sunday), and they're not always the same cases.

    @sweeneyde
    Copy link
    Member

    This is the approach in the PR:

        # JUMP_BOTH
        while ...:
            if haystack[j + cut] != needle[cut]:
                # Sunday skip
                j += table[haystack[j + len(needle)]]
                continue
            j += rest_of_the_two_way_algorithm()

    If I understand correctly, this is what you're suggesting:

        # JUMP_MAX
        while ...:
            shift = rest_of_the_two_way_algorithm()
            j += max(shift, table[haystack[j + len(needle)]])

    I implemented this and on my Zipf benchmark, and JUMP_BOTH was faster on 410 cases (at most 1.86x faster), while JUMP_MAX was faster on 179 cases (at most 1.3x faster).

    Some things to notice about the JUMP_BOTH approach:

    • The if-statement is actually the first step of the rest_of_the_two_way_algorithm, so there's no extra comparison over the pure two-way algorithm.

    • The if-block only executes in situations where the two-way algorithm would say to skip ahead by only 1. In all other situations, the two-way algorithm skips by at least 2.

    The typical purpose of the Sunday skip (as far as I can tell) is that in Sunday's algorithm, we find a mismatch, then have no a priori idea of how far ahead to skip, other than knowing that it has to be at least 1, so we check the character 1 to the right of the window. Another way of thinking about this would be to increment the window by 1, look at the last character in the new window, and jump ahead to line it up.

    However, with the hybrid method of the PR, we *do* have some a priori skip, bigger than 1, known before we check the table. So instead of doing the maximum of the two-way skip and the Sunday skip, why not do both? As in: do the two-way shift, then extend that with a Sunday shift in the next iteration.

    I tried another version also with this in mind, something like:

        # JUMP_AFTER
        while ...:
            # Guaranteed to jump to a mismatched poisition
            j += rest_of_the_two_way_algorithm() - 1
            j += table[haystack[j + len(needle)]]

    But this resulted in slightly-worse performance: JUMP_BOTH was faster in 344 cases (at most 1.9x faster), while JUMP_AFTER was faster in 265 cases (at most 1.32x faster). My guess for the explanation of this is that since most of the time is spent on Sunday shifts lining up the cut character, it's beneficial for the compiler to generate specialized code for that case.

    @sweeneyde
    Copy link
    Member

    Here are those zipf-distributed benchmarks for PR 22904: https://pastebin.com/raw/qBaMi2dm

    Ignoring differences of <5%, there are 33 cases that get slower, but 477 cases that got faster.

    Here's a stringbench.py run for PR 22904: https://pastebin.com/raw/ABm32bA0

    It looks like the stringbench times get a bit worse on a few cases, but I would attribute that to the benchmarks having many "difficult" cases with a unique character at the end of the needle, such as:

        s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E"),

    which the status quo already handles as well as possible, whereas the PR best handles the case where some middle "cut" character is unique. Who knows how common these cases are.

    @taleinat
    Copy link
    Contributor

    After spending quite a while setting up my dev machine for (hopefully) reliable benchmarking, I can say with a high level of certainty that the latest PR (GH-22904, commit fe9e9d9) runs stringbench.py significantly faster than the master branch (commit fb5db7e). See attached results.

    I've also run the zipf-distributed random string benchmarks provided by Dennis, with a slight addition of random.seed(1) in the do_timings function. Those show significant, consistent improvements on needles of length over 100, but not on shorter needles. The results are also found in the attached file.

    My conclusion is that the current two-way implementation is certainly better for needles of length over 100, but for shorter needles the situation is still unclear.

    System details:
    Dell XPS 9560
    Intel i7-7700HQ
    Ubuntu 20.04

    Steps taken to prepare system for performance tuning:

    • Enable CPU isolation for two physical cores
    • Disable Intel p-state driver
    • Disable turbo-boost
    • Run pyperf system tune
    • Run benchmarks with CPU pinning

    @taleinat
    Copy link
    Contributor

    I should also mention that I configured and compiled Python for each of the above-mentioned versions using the --enable-optimizations` flag.

    @sweeneyde
    Copy link
    Member

    I am attempting to better understand the performance characteristics to determine where a cutoff should go.

    Attached is a colorful table of benchmarks of the existing algorithm to the PR with the cutoff changed to if (1) (always two-way) or if (0) (always status quo), and tested on a variety of needle lengths and haystack lengths.

    @sweeneyde
    Copy link
    Member

    I used the following script, and the raw results are attached.

        import pyperf
        runner = pyperf.Runner()
    
        ms = [12, 16, 24, 32, 48, 64, 96, 128, 
              192, 256, 384, 512, 768, 1024, 1536]
    
        ns = [2000, 3000, 4000, 6000, 8000,
              12000, 16000, 24000, 32000, 48000,
              64000, 96000]
    
        for n, m in product(ns, ms):
            runner.timeit(f"needle={m}, haystack={n}",
                          setup="needle = zipf_string(m); haystack = zipf_string(n)",
                          stmt="needle in haystack",
                          globals=locals() | globals())

    @tim-one
    Copy link
    Member

    tim-one commented Nov 6, 2020

    I'm sorry I haven't been able to give more time to this. I love what's been done, but am just overwhelmed :-(

    The main thing here is to end quadratic-time disasters, without doing significant damage in other cases. Toward that end it would be fine to use "very high" cutoffs, and save tuning for later. It's clear that we _can_ get significant benefit even in many non-disastrous cases, but unclear exactly how to exploit that. But marking the issue report as "fixed" doesn't require resolving that - and I want to see this improvement in the next Python release.

    An idea that hasn't been tried: in the world of sorting, quicksort is usually the fastest method - but it's prone to quadratic-time disasters. OTOH, heapsort never suffers disasters, but is almost always significantly slower in ordinary cases. So the more-or-less straightforward "introsort" compromises, by starting with a plain quicksort, but with "introspection" code added to measure how quickly it's making progress. If quicksort appears to be thrashing, it switches to heapsort.

    It's possible an "introsearch" would work well too. Start with pure brute force (not even with the current preprocessing), and switch to something else if it's making slow progress. That's easy to measure: compare the number of character comparisons we've made so far to how many character positions we've advanced the search window so far. If that ratio exceeds (say) 3, we're heading out of linear-time territory, so should switch to a different approach.

    With no preprocessing at all, that's likely to be faster than even the current code for very short needles, and in all cases where the needle is found at the start of the haystack.

    This context is trickier, though, in that the current code, and pure C&P, can also enjoy sub-linear time cases. It's always something ;-)

    @tim-one
    Copy link
    Member

    tim-one commented Nov 6, 2020

    But that also suggests a different approach: start with the current code, but add introspection to switch to your enhancement of C&P if the current code is drifting out of linear-time territory.

    @sweeneyde
    Copy link
    Member

    Toward that end it would be fine to use "very high" cutoffs, and save tuning for later.

    This feels reasonable to me -- I changed the cutoff to the more cautious if (m >= 100 && n - m >= 5000), where the averages are very consistently faster by my measurements, and it seems that Tal confirms that, at least for the m >= 100 part. More tuning may be worth exploring later, but this seems pretty safe for now, and it should fix all of the truly catastrophic cases like in the original post.

    @sweeneyde
    Copy link
    Member

    For convenience, attached is a quick and dirty Tkinter GUI that lets you step through the Crochemore/Perrin Algorithm on your choice of inputs, just for play/discovery.

    A good illustration of the memory for periodic needles can be found by testing:
    haystack="456,123,123,456,123,123,456"
    needle="123,123,123,"

    The GUI program does not implement the Boyer-Moore/Horspool/Sunday-style shift-table. In the current PR 22904, this table is used in exactly those situations where the GUI says "Matched 0, so jump ahead by 1".

    @sweeneyde
    Copy link
    Member

    PR 22904 now adds a text document explaining how the Two-Way algorithm works in much more detail.

    I was looking at more benchmarking results, and I came to a couple of conclusions about cutoffs.

    There's a consistent benefit to using the two-way algorithm whenever both strings are long enough and the needle is less than 20% the length of the haystack. If that percentage is higher, the preprocessing cost starts to dominate, and the Two-Way algorithm is slower than the status quo in average such cases. This doesn't change the fact that in cases like OP's original example, the Two-way algorithm can be thousands of times faster than the status quo, even though the needle is a significant percentage of the length of the haystack.

    So to handle that, I tried an adaptive/introspective algorithm like Tim suggested: it counts the number of matched characters that don't lead to full matches, and once that total exceeds O(m), the Two-Way preprocessing cost will surely be worth it. The only way I could figure out how to not damage the speed of the usual small-string cases is to duplicate that code. See comparison.jpg for how much better the adaptive version is as opposed to always using two-way on this Zipf benchmark, while still guaranteeing O(m + n).

    @cfbolz
    Copy link
    Mannequin

    cfbolz mannequin commented Feb 26, 2021

    BTW, this initialization in the FASTSEARCH code appears to me to be a mistake:

    skip = mlast - 1;

    Thanks for pointing that out Tim! Turns out PyPy had copied that mindlessly and I just fixed it.

    (I'm also generally following along with this issue, I plan to implement the two-way algorithm for PyPy as well, once you all have decided on a heuristic. We are occasionally in a slightly easier situation, because for constant-enough needles we can have the JIT do the pre-work on the needle during code generation.)

    @sweeneyde
    Copy link
    Member

    Any chance PR 22904 can make it into 3.10 before the May 03 feature freeze? The text document in that PR has some notes on how the algorithm works, so that may be a good place to start reviewing if anyone is interested.

    @gvanrossum
    Copy link
    Member

    If Tim approves we might get it into alpha 6 which goes out Monday.

    @tim-one
    Copy link
    Member

    tim-one commented Feb 27, 2021

    I'm very sorry for not keeping up with this - my health has been poor, and I just haven't been able to make enough time.

    Last time I looked to a non-trivial depth, I was quite happy, and just quibbling about possible tradeoffs.

    I can't honestly commit to doing better in the near future, so where does that leave us? I'd personally "risk it".

    I just added Raymond to the nosy list, on the off chance he can make some "emergency time" to give a more informed yes/no. He's routinely sucked up weeks of my life with random review requests, so I figure turnabout is fair play ;-)

    @tim-one
    Copy link
    Member

    tim-one commented Feb 28, 2021

    New changeset 73a85c4 by Dennis Sweeney in branch 'master':
    bpo-41972: Use the two-way algorithm for string searching (GH-22904)
    73a85c4

    @sweeneyde
    Copy link
    Member

    In "Fast String Searching" (1991), A. Hume and D. Sunday emphasize that most of the runtime of string-search algorithms occurs in a code path that skips over immediately-disqualified alignments. As such, that paper recommends extracting a hot "Horspool" code path and keeping it small. For that reason, I'm posting a PR which boils down that fast path to the following:

        while (skip > 0 && window_last < haystack_end) {
            window_last += skip;
            skip = table[(*window_last) & 63];
        }

    This uses two memory accesses, whereas the last implementation used three (the extra to get window[cut]). This requires the skip table to change slightly, since it is indexed by the last character in the current haystack window rather than one character later ("fast" versus "sd1" in the paper). The paper also recommends unrolling this loop 3x, but my benchmarks found no benefit to unrolling.

    The PR also refactors some of the main fastsearch code into separate functions, and computes and then uses a similar gap/skip integer used already in the default implementation (Hume and Sunday call this "md2"). It retains a linear-time constant space worst-case with the Two-Way algorithm.

    There are a bunch of benchmarks here, both on Zipf-distributed random strings and on a few c/rst/python/binary files in the cpython repo. They compare the current main branch to the PR:

    https://gist.github.com/sweeneyde/fc20ed5e72dc6b0f3b41c0abaf4ec3be
    

    Summary of those results:

    On the Zipf strings:
    * 83 cases slower (at most 1.59x slower)
    * 654 cases faster (up to 4.49x faster)
    * Geometric mean: 1.10x faster

    On the "Real world" source files:
    * 76 cases slower (at most 2.41x slower)
    * 420 cases faster (up to 18.12x faster)
    * Geometric mean: 1.33x faster

    @ambv
    Copy link
    Contributor

    ambv commented Jul 19, 2021

    I checked the original example in this issue and the newest change in #71278 makes the data.find(base) case 8.2% faster compared to main while the data.find(longer) case is 10.8% faster.

    @ambv
    Copy link
    Contributor

    ambv commented Jul 19, 2021

    New changeset d01dceb by Dennis Sweeney in branch 'main':
    bpo-41972: Tweak fastsearch.h string search algorithms (GH-27091)
    d01dceb

    @ambv
    Copy link
    Contributor

    ambv commented Jul 19, 2021

    Looks like this can be closed now, the original issue is fixed, the original patch is merged for 3.10, and the improved patch is merged for 3.11.

    Thanks! ✨ 🍰 ✨

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants