Message 378590 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Dennis Sweeney, Zeturic, ammar2, corona10, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date	2020-10-14.01:06:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1602637566.35.0.0761568422299.issue41972@roundup.psfhosted.org>
In-reply-to

Content
Dennis, I'm delighted that the timing harness pointed out an actual glitch, and that it was so (seemingly) straightforward to identify the algorithmic cause. This gives me increased confidence that this project can be pushed to adoption, and your name will be hallowed in Python's history :-) I have no problem with increasing the constant space. Fredrik was Python's Unicode pioneer too, so was naturally repelled by any scheme that required additional space proportional to the alphabet size. The "Bloom filter" is the tiny bit left of Daniel Sunday's algorithm, which actually has no such thing ;-) Along the lines you suggested, Sunday precomputes a vector indexed by characters, each entry giving the distance to that character's rightmost index in the needle. The "Bloom filter" throws that vector away and only saves hashes of the indices where the vector holds its maximum possible value. Note that there's nothing magical about "right to left" in Sunday's algorithm (or, really, in our current use of the Bloom filter). Characters can be compared in any order, and when there's a mismatch, the skip vector can be indexed by the character one beyond the search window to often find a decent amount to skip. Indeed, in apps where the expected frequency of characters is known, Sunday's algorithm is often adapted to compare the least-frequently expected needle character first. The downside isn't really the space, but that stack space is uninitialized trash. Initializing it to a known value increases preprocessing overhead, albeit independent of needle length. So the relative overhead is higher the shorter the needle. I agree expanding it beyond the tiny bit vector is likely to be significantly helpful. There's no discomfort at all to me if, e.g., it stored 32-bit counts and is indexed by the last 6 bits of the character. That's a measly 256 bytes in all. It's also possible that a more capable "Sunday-ish vector" of this kind would render the current `skip` trick usually useless by comparison. Or not ;-)

Dennis, I'm delighted that the timing harness pointed out an actual glitch, and that it was so (seemingly) straightforward to identify the algorithmic cause. This gives me increased confidence that this project can be pushed to adoption, and your name will be hallowed in Python's history :-)

I have no problem with increasing the constant space. Fredrik was Python's Unicode pioneer too, so was naturally repelled by any scheme that required additional space proportional to the alphabet size. The "Bloom filter" is the tiny bit left of Daniel Sunday's algorithm, which actually has no such thing ;-) Along the lines you suggested, Sunday precomputes a vector indexed by characters, each entry giving the distance to that character's rightmost index in the needle. The "Bloom filter" throws that vector away and only saves hashes of the indices where the vector holds its maximum possible value.

Note that there's nothing magical about "right to left" in Sunday's algorithm (or, really, in our current use of the Bloom filter). Characters can be compared in any order, and when there's a mismatch, the skip vector can be indexed by the character one beyond the search window to often find a decent amount to skip. Indeed, in apps where the expected frequency of characters is known, Sunday's algorithm is often adapted to compare the least-frequently expected needle character first.

The downside isn't really the space, but that stack space is uninitialized trash. Initializing it to a known value increases preprocessing overhead, albeit independent of needle length. So the relative overhead is higher the shorter the needle.

I agree expanding it beyond the tiny bit vector is likely to be significantly helpful. There's no discomfort at all to me if, e.g., it stored 32-bit counts and is indexed by the last 6 bits of the character. That's a measly 256 bytes in all.

It's also possible that a more capable "Sunday-ish vector" of this kind would render the current `skip` trick usually useless by comparison. Or not ;-)

History
Date	User	Action	Args
2020-10-14 01:06:06	tim.peters	set	recipients: + tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Dennis Sweeney, Zeturic
2020-10-14 01:06:06	tim.peters	set	messageid: <1602637566.35.0.0761568422299.issue41972@roundup.psfhosted.org>
2020-10-14 01:06:06	tim.peters	link	issue41972 messages
2020-10-14 01:06:05	tim.peters	create