Message 378842 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gvanrossum
Recipients	Dennis Sweeney, Zeturic, ammar2, corona10, gregory.p.smith, gvanrossum, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date	2020-10-18.00:12:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1602979963.33.0.943364507556.issue41972@roundup.psfhosted.org>
In-reply-to

Content
This may be irrelevant at this point, but trying to understand the original reproducer, I wanted to add my 1c worth. It seems Dennis' reproducer.py is roughly this: (I'm renaming BIG//3 to K to simplify the math.) aaaaaBBBBBaaaaaBBBBBBBBBBBBBBB (haystack, length K6) BBBBBBBBBBBBBBB (needle, length K3) The needle matches exactly once, at the end. (Dennis uses BIG==106, which leaves a remainder of 1 after dividing by 3, but that turns out to be irrelevant -- it works with BIG==999999 as well.) The reproducer falls prey to the fact that it shifts the needle by 1 each time (for the reason Tim already explained). At each position probed, the sequence of comparisons is (regardless of the bloom filter or skip size, and stopping at the first mismatch): - last byte of needle - first, second, third, etc. byte of needle As long as the needle's first character corresponds to an 'a' (i.e., K times) this is just two comparisons until failure, but once it hits the first run of 'B's it does K+1 comparisons, then shifts by 1, does another K+1 comparisons, and so on, for a total of K times. That's K2 + K, the source of the slowdown. Then come K more quick misses, followed by the final success. (Do we know how the OP found this reproducer? The specific length of their needle seems irrelevant, and I don't dare look in their data file.) Anyway, thinking about this, for the current (unpatched) code, here's a somewhat simpler reproducer along the same lines: BBBBBaaaaaBBBBB (haystack, length K3) BBBBBBBBBB (needle, length K2) This immediately starts doing K sets of K+1 comparisons, i.e. K**2 + K again, followed by failure. I am confident this has no relevance to the Two-Way algorithm.

This may be irrelevant at this point, but trying to understand the original reproducer, I wanted to add my 1c worth.

It seems Dennis' reproducer.py is roughly this:

(I'm renaming BIG//3 to K to simplify the math.)

aaaaaBBBBBaaaaaBBBBBBBBBBBBBBB (haystack, length K*6)
BBBBBBBBBBBBBBB                (needle, length K*3)

The needle matches exactly once, at the end.  (Dennis uses BIG==10**6, which leaves a remainder of 1 after dividing by 3, but that turns out to be irrelevant -- it works with BIG==999999 as well.)

The reproducer falls prey to the fact that it shifts the needle by 1 each time (for the reason Tim already explained).  At each position probed, the sequence of comparisons is (regardless of the bloom filter or skip size, and stopping at the first mismatch):

- last byte of needle
- first, second, third, etc. byte of needle

As long as the needle's first character corresponds to an 'a' (i.e., K times) this is just two comparisons until failure, but once it hits the first run of 'B's it does K+1 comparisons, then shifts by 1, does another K+1 comparisons, and so on, for a total of K times. That's K**2 + K, the source of the slowdown. Then come K more quick misses, followed by the final success.

(Do we know how the OP found this reproducer? The specific length of their needle seems irrelevant, and I don't dare look in their data file.)

Anyway, thinking about this, for the current (unpatched) code, here's a somewhat simpler reproducer along the same lines:

BBBBBaaaaaBBBBB (haystack, length K*3)
BBBBBBBBBB      (needle, length K*2)

This immediately starts doing K sets of K+1 comparisons, i.e. K**2 + K again, followed by failure.


I am confident this has no relevance to the Two-Way algorithm.

History
Date	User	Action	Args
2020-10-18 00:12:43	gvanrossum	set	recipients: + gvanrossum, tim.peters, gregory.p.smith, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Dennis Sweeney, Zeturic
2020-10-18 00:12:43	gvanrossum	set	messageid: <1602979963.33.0.943364507556.issue41972@roundup.psfhosted.org>
2020-10-18 00:12:43	gvanrossum	link	issue41972 messages
2020-10-18 00:12:42	gvanrossum	create