Author tim.peters
Recipients Dennis Sweeney, Zeturic, ammar2, corona10, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date 2020-10-14.01:06:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1602637566.35.0.0761568422299.issue41972@roundup.psfhosted.org>
In-reply-to
Content
Dennis, I'm delighted that the timing harness pointed out an actual glitch, and that it was so (seemingly) straightforward to identify the algorithmic cause. This gives me increased confidence that this project can be pushed to adoption, and your name will be hallowed in Python's history :-)

I have no problem with increasing the constant space. Fredrik was Python's Unicode pioneer too, so was naturally repelled by any scheme that required additional space proportional to the alphabet size. The "Bloom filter" is the tiny bit left of Daniel Sunday's algorithm, which actually has no such thing ;-) Along the lines you suggested, Sunday precomputes a vector indexed by characters, each entry giving the distance to that character's rightmost index in the needle.  The "Bloom filter" throws that vector away and only saves hashes of the indices where the vector holds its maximum possible value.

Note that there's nothing magical about "right to left" in Sunday's algorithm (or, really, in our current use of the Bloom filter). Characters can be compared in any order, and when there's a mismatch, the skip vector can be indexed by the character one beyond the search window to often find a decent amount to skip.  Indeed, in apps where the expected frequency of characters is known, Sunday's algorithm is often adapted to compare the least-frequently expected needle character first.

The downside isn't really the space, but that stack space is uninitialized trash. Initializing it to a known value increases preprocessing overhead, albeit independent of needle length. So the relative overhead is higher the shorter the needle.

I agree expanding it beyond the tiny bit vector is likely to be significantly helpful.  There's no discomfort at all to me if, e.g., it stored 32-bit counts and is indexed by the last 6 bits of the character.  That's a measly 256 bytes in all.

It's also possible that a more capable "Sunday-ish vector" of this kind would render the current `skip` trick usually useless by comparison. Or not ;-)
History
Date User Action Args
2020-10-14 01:06:06tim.peterssetrecipients: + tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Dennis Sweeney, Zeturic
2020-10-14 01:06:06tim.peterssetmessageid: <1602637566.35.0.0761568422299.issue41972@roundup.psfhosted.org>
2020-10-14 01:06:06tim.peterslinkissue41972 messages
2020-10-14 01:06:05tim.peterscreate