Message378225
Adding some hasty printf-debugging to fastsearch.h, I see this:
>>> with open('data.bin', 'rb') as f:
... s = f.read()
...
>>> base = 15403807 * b'\xff'
>>> longer = base + b'\xff'
>>> base in s
Bloom has 0x00: 0
Bloom has 0xff: -2147483648
skip = 0
Checking bloom filter for next char 0x00... not found in pattern. Skipping a bunch.
Candidate at 15403808
True
>>> longer in s
Bloom has 0x00: No
Bloom has 0xff: Yes
skip = 0
Checking bloom filter for next char 0xff... found in pattern; increment by only one.
Candidate at 1
Candidate at 2
Candidate at 3
Candidate at 4
Candidate at 5
Candidate at 6
Candidate at 7
(...)
Trying the same base, I also get this:
>>> with open('data.bin', 'rb') as f:
... s = f.read()
...
>>> base = 15403807 * b'\xff'
>>> s1 = s[1:]
>>> base in s1
Bloom has 0x00: No
Bloom has 0xff: Yes
skip = 0
Checking bloom filter for next char 0xff... found in pattern; increment by only one.
Candidate at 1
Candidate at 2
Candidate at 3
Candidate at 4
Candidate at 5
Candidate at 6
Candidate at 7
(...)
So maybe part of the issue is just that the algorithm is getting particularly
unlucky regarding how the strings line up to start with.
The fact that "skip" is 0 in these cases seems to just be a limitation of the algorithm:
we only skip to line up the second-to-last occurrence of the last character,
but with these strings, that's just the second-to-last character.
Computing the entire Boyer-Moore table would be better in cases like these, but
that would require dynamic memory allocation (length of pattern) which I think is why it is avoided. |
|
Date |
User |
Action |
Args |
2020-10-08 09:09:01 | Dennis Sweeney | set | recipients:
+ Dennis Sweeney, tim.peters, pmpp, serhiy.storchaka, josh.r, ammar2, Zeturic |
2020-10-08 09:09:01 | Dennis Sweeney | set | messageid: <1602148141.7.0.959963365329.issue41972@roundup.psfhosted.org> |
2020-10-08 09:09:01 | Dennis Sweeney | link | issue41972 messages |
2020-10-08 09:09:01 | Dennis Sweeney | create | |
|