Message 378583 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dennis Sweeney
Recipients	Dennis Sweeney, Zeturic, ammar2, corona10, josh.r, pmpp, serhiy.storchaka, tim.peters, vstinner
Date	2020-10-13.22:47:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1602629240.85.0.745692176318.issue41972@roundup.psfhosted.org>
In-reply-to

Content
Another algorithmic possibility: Instead of the bitset, we could have a stack-allocated uint8_t jump[32]; // maybe 64? Maybe uint16_t? It would say this: If the last character lined up in the haystack is congruent to i mod (1 << 8), then jump ahead by (neede_len if jump[i]==255 else jump[i]), where jump[i] gives the distance between the end of the needle and the last occurrence in the needle of a character congruent to i. Is this sort of constant-but-more-than-a-few-bytes stack-space an acceptable constant memory cost? If so, I believe that could be a big improvement. There are also a bunch of little tweaks that could be done here: For example, should a hypothetical jump-table jump to line up the last character in the needle, or jump to line up the middle character in the needle? The max from two tables? Should we search for the last characters to be equal, or just make sure the last character is in the needle-bit-set (like in the PR)? Should there be a second bit-set for the right half of the string? Should there be a skip value computed for the last character as well as the middle character (middle character only in the PR)? etc. etc. I'll be sure to try some of these things.

Another algorithmic possibility: Instead of the bitset, we could have a stack-allocated

    uint8_t jump[32]; // maybe 64? Maybe uint16_t?

It would say this: If the last character lined up in the haystack is congruent to i mod (1 << 8), then jump ahead by (neede_len if jump[i]==255 else jump[i]), where jump[i] gives the distance between the end of the needle and the last occurrence in the needle of a character congruent to i.

Is this sort of constant-but-more-than-a-few-bytes stack-space an acceptable constant memory cost? If so, I believe that could be a big improvement.

There are also a bunch of little tweaks that could be done here: For example, should a hypothetical jump-table jump to line up the last character in the needle, or jump to line up the middle character in the needle? The max from two tables? Should we search for the last characters to be equal, or just make sure the last character is in the needle-bit-set (like in the PR)? Should there be a second bit-set for the right half of the string? Should there be a skip value computed for the last character as well as the middle character (middle character only in the PR)? etc. etc. I'll be sure to try some of these things.

History
Date	User	Action	Args
2020-10-13 22:47:20	Dennis Sweeney	set	recipients: + Dennis Sweeney, tim.peters, vstinner, pmpp, serhiy.storchaka, josh.r, ammar2, corona10, Zeturic
2020-10-13 22:47:20	Dennis Sweeney	set	messageid: <1602629240.85.0.745692176318.issue41972@roundup.psfhosted.org>
2020-10-13 22:47:20	Dennis Sweeney	link	issue41972 messages
2020-10-13 22:47:20	Dennis Sweeney	create