Message 397277 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dennis Sweeney
Recipients	Carl.Friedrich.Bolz, Dennis Sweeney, Zeturic, ammar2, corona10, gregory.p.smith, gvanrossum, josh.r, pmpp, rhettinger, serhiy.storchaka, taleinat, tim.peters
Date	2021-07-12.03:57:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1626062247.6.0.499668817634.issue41972@roundup.psfhosted.org>
In-reply-to

Content
In "Fast String Searching" (1991), A. Hume and D. Sunday emphasize that most of the runtime of string-search algorithms occurs in a code path that skips over immediately-disqualified alignments. As such, that paper recommends extracting a hot "Horspool" code path and keeping it small. For that reason, I'm posting a PR which boils down that fast path to the following: while (skip > 0 && window_last < haystack_end) { window_last += skip; skip = table[(window_last) & 63]; } This uses two memory accesses, whereas the last implementation used three (the extra to get window[cut]). This requires the skip table to change slightly, since it is indexed by the last character in the current haystack window rather than one character later ("fast" versus "sd1" in the paper). The paper also recommends unrolling this loop 3x, but my benchmarks found no benefit to unrolling. The PR also refactors some of the main fastsearch code into separate functions, and computes and then uses a similar gap/skip integer used already in the default implementation (Hume and Sunday call this "md2"). It retains a linear-time constant space worst-case with the Two-Way algorithm. There are a bunch of benchmarks here, both on Zipf-distributed random strings and on a few c/rst/python/binary files in the cpython repo. They compare the current main branch to the PR: https://gist.github.com/sweeneyde/fc20ed5e72dc6b0f3b41c0abaf4ec3be Summary of those results: On the Zipf strings: 83 cases slower (at most 1.59x slower) * 654 cases faster (up to 4.49x faster) * Geometric mean: 1.10x faster On the "Real world" source files: * 76 cases slower (at most 2.41x slower) * 420 cases faster (up to 18.12x faster) * Geometric mean: 1.33x faster

In "Fast String Searching" (1991), A. Hume and D. Sunday emphasize that most of the runtime of string-search algorithms occurs in a code path that skips over immediately-disqualified alignments. As such, that paper recommends extracting a hot "Horspool" code path and keeping it small. For that reason, I'm posting a PR which boils down that fast path to the following:

    while (skip > 0 && window_last < haystack_end) {
        window_last += skip;
        skip = table[(*window_last) & 63];
    }

This uses two memory accesses, whereas the last implementation used three (the extra to get window[cut]). This requires the skip table to change slightly, since it is indexed by the last character in the current haystack window rather than one character later ("fast" versus "sd1" in the paper). The paper also recommends unrolling this loop 3x, but my benchmarks found no benefit to unrolling.

The PR also refactors some of the main fastsearch code into separate functions, and computes and then uses a similar gap/skip integer used already in the default implementation (Hume and Sunday call this "md2"). It retains a linear-time constant space worst-case with the Two-Way algorithm.

There are a bunch of benchmarks here, both on Zipf-distributed random strings and on a few c/rst/python/binary files in the cpython repo. They compare the current main branch to the PR:

    https://gist.github.com/sweeneyde/fc20ed5e72dc6b0f3b41c0abaf4ec3be

Summary of those results:

On the Zipf strings:
    * 83 cases slower (at most 1.59x slower)
    * 654 cases faster (up to 4.49x faster)
    * Geometric mean: 1.10x faster

On the "Real world" source files:
    * 76 cases slower (at most 2.41x slower)
    * 420 cases faster (up to 18.12x faster)
    * Geometric mean: 1.33x faster

History
Date	User	Action	Args
2021-07-12 03:57:27	Dennis Sweeney	set	recipients: + Dennis Sweeney, gvanrossum, tim.peters, rhettinger, gregory.p.smith, Carl.Friedrich.Bolz, taleinat, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Zeturic
2021-07-12 03:57:27	Dennis Sweeney	set	messageid: <1626062247.6.0.499668817634.issue41972@roundup.psfhosted.org>
2021-07-12 03:57:27	Dennis Sweeney	link	issue41972 messages
2021-07-12 03:57:27	Dennis Sweeney	create