Message 378423 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dennis Sweeney
Recipients	Dennis Sweeney, Zeturic, ammar2, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date	2020-10-11.06:51:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1602399093.18.0.0798662657201.issue41972@roundup.psfhosted.org>
In-reply-to

Content
> Offhand do you know what the _best_ timing for two-way search is in a pattern-not-found case? Looking at the glibc implementation, in the top-level "else" clause (for when the string isn't completely periodic), we have: period = MAX (suffix, needle_len - suffix) + 1; so period > needle_len / 2. Then during each iteration of the j-loop (where j is the index into the haystack), j is sometimes incremented by `period`, which would probably give O(n/m) best case. Here's a sublinear example for the two-way algorithm: N = 10*7 haystack = "ABC" N needle1 = "ABC" * (N // 10) + "DDD" needle2 = "ABC" * 10 + "DDD" === Results === Pure Python Two-Way, shorter needle: 1.7142754 Pure Python Two-Way. Longer needle: 0.0018845000000000667 Python builtin, shorter needle: 0.031071400000000082 Python builtin, longer needle: 0.030566099999999707 This case is surprisingly better than the builtin: N = 10*7 haystack = "ABC" N needle1 = "DDD" + "ABC" * (N // 10) needle2 = "DDD" + "ABC" * 10 === Results === Pure Python Two-Way, shorter needle: 0.0020895000000000774 Pure Python Two-Way. Longer needle: 0.0017878999999999534 Python builtin, shorter needle: 0.029998100000000028 Python builtin, longer needle: 0.02963439999999995 This was measured using the attached fastsearch.py. There are some manipulations in there like string slicing that would make more sense as pointer math in C, so especially accounting for that, I'm guessing it would be pretty competitive in a lot of cases. A lot of the implementations look like they use a complete Boyer-Moore table to go sublinear in even more cases, which it seems we want to avoid. I'm not certain, but it's conceivable that the existing techniques of using just one one delta-value or the "Bloom filter" could be thrown in there to get some of the same improvements. Although that could be redundant -- I'm not sure.

> Offhand do you know what the _best_ timing for two-way search is in a pattern-not-found case?

Looking at the glibc implementation, in the top-level "else" clause
(for when the string isn't completely periodic),
we have:

    period = MAX (suffix, needle_len - suffix) + 1;

so period > needle_len / 2.

Then during each iteration of the j-loop (where j is the index into the haystack),
j is sometimes incremented by `period`, which would probably give O(n/m) best case.

Here's a sublinear example for the two-way algorithm:

    N = 10**7
    haystack = "ABC" * N
    needle1 = "ABC" * (N // 10) + "DDD"
    needle2 = "ABC" * 10 + "DDD"
    
    === Results ===
    Pure Python Two-Way, shorter needle:     1.7142754
    Pure Python Two-Way. Longer needle:      0.0018845000000000667
    Python builtin, shorter needle:          0.031071400000000082
    Python builtin, longer needle:           0.030566099999999707

This case is surprisingly better than the builtin:

    N = 10**7
    haystack = "ABC" * N
    needle1 = "DDD" + "ABC" * (N // 10)
    needle2 = "DDD" + "ABC" * 10

    === Results ===
    Pure Python Two-Way, shorter needle:     0.0020895000000000774
    Pure Python Two-Way. Longer needle:      0.0017878999999999534
    Python builtin, shorter needle:          0.029998100000000028
    Python builtin, longer needle:           0.02963439999999995

This was measured using the attached fastsearch.py. There are some manipulations in there like string slicing that would make more sense as pointer math in C, so especially accounting for that, I'm guessing it would be pretty competitive in a lot of cases.

A lot of the implementations look like they use a complete Boyer-Moore table to go sublinear in even more cases, which it seems we want to avoid. I'm not certain, but it's conceivable that the existing techniques of using just one one delta-value or the "Bloom filter" could be thrown in there to get some of the same improvements. Although that could be redundant -- I'm not sure.

History
Date	User	Action	Args
2020-10-11 06:51:33	Dennis Sweeney	set	recipients: + Dennis Sweeney, tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, Zeturic
2020-10-11 06:51:33	Dennis Sweeney	set	messageid: <1602399093.18.0.0798662657201.issue41972@roundup.psfhosted.org>
2020-10-11 06:51:33	Dennis Sweeney	link	issue41972 messages
2020-10-11 06:51:33	Dennis Sweeney	create