This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Dennis Sweeney
Recipients Dennis Sweeney, Zeturic, ammar2, josh.r, pmpp, serhiy.storchaka, tim.peters
Date 2020-10-08.09:09:01
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1602148141.7.0.959963365329.issue41972@roundup.psfhosted.org>
In-reply-to
Content
Adding some hasty printf-debugging to fastsearch.h, I see this:

    >>> with open('data.bin', 'rb') as f:
    ...     s = f.read()
    ...
    >>> base = 15403807 * b'\xff'
    >>> longer = base + b'\xff'
    >>> base in s
    
    Bloom has 0x00: 0
    Bloom has 0xff: -2147483648
    skip = 0
    Checking bloom filter for next char 0x00... not found in pattern. Skipping a bunch.
    Candidate at 15403808

    True

    >>> longer in s

    Bloom has 0x00: No
    Bloom has 0xff: Yes
    skip = 0
    Checking bloom filter for next char 0xff... found in pattern; increment by only one.
    Candidate at 1
    Candidate at 2
    Candidate at 3
    Candidate at 4
    Candidate at 5
    Candidate at 6
    Candidate at 7
    (...)

Trying the same base, I also get this:

    >>> with open('data.bin', 'rb') as f:
    ...     s = f.read()
    ...
    >>> base = 15403807 * b'\xff'
    >>> s1 = s[1:]
    >>> base in s1

    Bloom has 0x00: No
    Bloom has 0xff: Yes
    skip = 0
    Checking bloom filter for next char 0xff... found in pattern; increment by only one.
    Candidate at 1
    Candidate at 2
    Candidate at 3
    Candidate at 4
    Candidate at 5
    Candidate at 6
    Candidate at 7
    (...)

So maybe part of the issue is just that the algorithm is getting particularly
unlucky regarding how the strings line up to start with.

The fact that "skip" is 0 in these cases seems to just be a limitation of the algorithm:
we only skip to line up the second-to-last occurrence of the last character,
but with these strings, that's just the second-to-last character.

Computing the entire Boyer-Moore table would be better in cases like these, but
that would require dynamic memory allocation (length of pattern) which I think is why it is avoided.
History
Date User Action Args
2020-10-08 09:09:01Dennis Sweeneysetrecipients: + Dennis Sweeney, tim.peters, pmpp, serhiy.storchaka, josh.r, ammar2, Zeturic
2020-10-08 09:09:01Dennis Sweeneysetmessageid: <1602148141.7.0.959963365329.issue41972@roundup.psfhosted.org>
2020-10-08 09:09:01Dennis Sweeneylinkissue41972 messages
2020-10-08 09:09:01Dennis Sweeneycreate