This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tim.peters
Recipients Dennis Sweeney, Zeturic, ammar2, corona10, gregory.p.smith, gvanrossum, josh.r, pmpp, serhiy.storchaka, taleinat, tim.peters, vstinner
Date 2020-11-06.21:28:40
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1604698120.79.0.131756045932.issue41972@roundup.psfhosted.org>
In-reply-to
Content
I'm sorry I haven't been able to give more time to this. I love what's been done, but am just overwhelmed :-(

The main thing here is to end quadratic-time disasters, without doing significant damage in other cases. Toward that end it would be fine to use "very high" cutoffs, and save tuning for later. It's clear that we _can_ get significant benefit even in many non-disastrous cases, but unclear exactly how to exploit that. But marking the issue report as "fixed" doesn't require resolving that - and I want to see this improvement in the next Python release.

An idea that hasn't been tried: in the world of sorting, quicksort is usually the fastest method - but it's prone to quadratic-time disasters. OTOH, heapsort never suffers disasters, but is almost always significantly slower in ordinary cases. So the more-or-less straightforward "introsort" compromises, by starting with a plain quicksort, but with "introspection" code added to measure how quickly it's making progress. If quicksort appears to be thrashing, it switches to heapsort.

It's possible an "introsearch" would work well too. Start with pure brute force (not even with the current preprocessing), and switch to something else if it's making slow progress. That's easy to measure: compare the number of character comparisons we've made so far to how many character positions we've advanced the search window so far. If that ratio exceeds (say) 3, we're heading out of linear-time territory, so should switch to a different approach.

With no preprocessing at all, that's likely to be faster than even the current code for very short needles, and in all cases where the needle is found at the start of the haystack.

This context is trickier, though, in that the current code, and pure C&P, can also enjoy sub-linear time cases. It's always something ;-)
History
Date User Action Args
2020-11-06 21:28:40tim.peterssetrecipients: + tim.peters, gvanrossum, gregory.p.smith, vstinner, taleinat, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Dennis Sweeney, Zeturic
2020-11-06 21:28:40tim.peterssetmessageid: <1604698120.79.0.131756045932.issue41972@roundup.psfhosted.org>
2020-11-06 21:28:40tim.peterslinkissue41972 messages
2020-11-06 21:28:40tim.peterscreate