Author Dennis Sweeney
Recipients Dennis Sweeney, Zeturic, ammar2, corona10, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date 2020-10-13.22:47:20
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1602629240.85.0.745692176318.issue41972@roundup.psfhosted.org>
In-reply-to
Content
Another algorithmic possibility: Instead of the bitset, we could have a stack-allocated

    uint8_t jump[32]; // maybe 64? Maybe uint16_t?

It would say this: If the last character lined up in the haystack is congruent to i mod (1 << 8), then jump ahead by (neede_len if jump[i]==255 else jump[i]), where jump[i] gives the distance between the end of the needle and the last occurrence in the needle of a character congruent to i.

Is this sort of constant-but-more-than-a-few-bytes stack-space an acceptable constant memory cost? If so, I believe that could be a big improvement.

There are also a bunch of little tweaks that could be done here: For example, should a hypothetical jump-table jump to line up the last character in the needle, or jump to line up the middle character in the needle? The max from two tables? Should we search for the last characters to be equal, or just make sure the last character is in the needle-bit-set (like in the PR)? Should there be a second bit-set for the right half of the string? Should there be a skip value computed for the last character as well as the middle character (middle character only in the PR)? etc. etc. I'll be sure to try some of these things.
History
Date User Action Args
2020-10-13 22:47:20Dennis Sweeneysetrecipients: + Dennis Sweeney, tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Zeturic
2020-10-13 22:47:20Dennis Sweeneysetmessageid: <1602629240.85.0.745692176318.issue41972@roundup.psfhosted.org>
2020-10-13 22:47:20Dennis Sweeneylinkissue41972 messages
2020-10-13 22:47:20Dennis Sweeneycreate