Issue 25898: Check for subsequence inside a sequence

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70086

classification

Title:	Check for subsequence inside a sequence
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.6

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	abarry, ethan.furman, josh.r, r.david.murray, rhettinger, seblin, socketpair
Priority:	normal	Keywords:	patch

Created on 2015-12-18 01:45 by seblin, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
has_subsequence.patch	seblin, 2015-12-18 01:45		review
subseq_index.patch	seblin, 2015-12-20 16:09		review

Messages (10)
msg256633 - (view)	Author: Sebastian Linke (seblin)	Date: 2015-12-18 01:45
With the attached patch I propose to add a new function to the "collections" module. It is meant to be used for determining whether a given subsequence is part of a particular sequence (with respect to the ordering of the subsequence). Doing this in pure Python (using the CPython interpreter) is relatively slow. So I did the implementation in C.
msg256638 - (view)	Author: Anilyka Barry (abarry) *	Date: 2015-12-18 03:58
Reviewed the Python code (unlike what my email said, it doesn't LGTM; I'm tired and just forgot). I checked the C code for the obvious pitfalls (didn't spot any), but it will require someone else better than me to look at it.
msg256639 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2015-12-18 04:01
Really optimal search algorithm should be something like that: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
msg256641 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2015-12-18 06:14
A utility like this seems like it would belong in `itertools`, not `collections`. It should also ideally avoid fully realizing the sequence so it could work with iterators/generators as well; PySequence_Fast will force creation of a `list`/`tuple` of the whole sequence when in practice, a `deque` with a maxlen could be used to only maintain the necessary window into the "haystack". It would also help to have a pure Python implementation (and until you have one, it's probably overkill to write the C accelerator) for other Python distributions, and to serve as a baseline for comparison to see if a C accelerator is justified. Something like this might be a decent point of comparison: def has_subsequence(it, searchseq, *, all=all, map=map, eq=operator.eq): searchseq = tuple(searchseq) if not searchseq: return True # Empty sequence in everything window = collections.deque(itertools.islice(it, len(searchseq)-1), len(searchseq)) for x in it: window.append(x) if all(map(eq, window, searchseq)): return True return False
msg256642 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2015-12-18 06:18
Aho Corasick doesn't seem likely to be useful here; it's good if the haystack is huge (or you have many haystacks to search) and you have many needles to look for (and the needles never change), but it pays a fairly steep setup cost; for a utility that searches for a single subsequence once, with no history, Aho-Corasick wouldn't help much. A variant on Boyer-Moore (which involves less preprocessing work on the needle) might help, but I'm not sure the feature is compelling enough to warrant acceleration in the first place.
msg256672 - (view)	Author: Sebastian Linke (seblin)	Date: 2015-12-18 11:35
@Josh What is the point of using iterators here? If the user wants an iterator to be checked then he can throw that iterator right away because it was exhausted by the function. I think that in most cases the user wants to do some postprocessing on a hairstack after being checked. This is why I restricted the arguments to be sequences. Keeping that restriction in mind a pure Python implementation could look like this: def has_subsequence(seq, subseq): if not subseq: return True if len(subseq) > len(seq): return False start = 0 while True: try: start = seq.index(subseq[0], start) except ValueError: # seq.index() did not match return False stop = start + len(subseq) if seq[stop - 1] == subseq[-1] and seq[start:stop] == subseq: return True start += 1 Unfortunately, this version potentially creates lots of subseqences for checking equality with the `subseq` argument. I tried to minimise the occurrence of those cases by pre-checking the first and the last element of a candidate. Anyway, this seems to be faster for small needles compared to your `deque`-based suggestion.
msg256689 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-18 16:15
If I were trying to channel Raymond I'd suggest posting the python as a recipe and see if there is uptake. However, I could be wrong and he might be interested. (I can't say that I've ever needed this check myself.) I'd be inclined to agree that it should only work on sequences and be in collections. If you can come up with a real use case where you wouldn't need the realized haystack afterwards I'd be interested to hear it :) Regardless, I don't see how it could be composed with other itertools, so it doesn't seem to belong there. As long as one is finding the needle, one might as well return the index, though. So subsequence_index instead of has_subsequence. And then you'd want an optional start index, and maybe an rsubsequence_index :) So, it could be a method of Sequence, but is it useful enough to be worth adding? Do any other languages have such a function as part of their built in toolkit? (Aside: you can avoid creating a subsequence by using a loop. You'd have to test with a bunch of variant sizes to see which is more efficient, but I'm guessing it would be the loop.)
msg256698 - (view)	Author: Ethan Furman (ethan.furman) *	Date: 2015-12-18 18:31
+1 for sequences +1 for subsequence_index instead of has_subsequence +1 for returning None when not found ;)
msg256770 - (view)	Author: Sebastian Linke (seblin)	Date: 2015-12-20 16:09
Slightly modified the implementation to return the index rather than a boolean.
msg350247 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-23 05:29
> If I were trying to channel Raymond I'd suggest posting the > python as a recipe and see if there is uptake. However, I > could be wrong and he might be interested. (I can't say > that I've ever needed this check myself.) Yes, exactly :-) Sebastian, thanks for the suggestion. I am going to decline this for collections and itertools based on the relative infrequency of need (for us, it usually only comes up in the context of strings). FWIW, there may be algorithmically more advanced techniques that might be worth looking into: knuth-morris-pratt, rabin-karp, etc. Please do post this to the Python Package Index or someother place where people can get to it if the need arises.

History
Date	User	Action	Args
2022-04-11 14:58:25	admin	set	github: 70086
2019-08-23 05:29:14	rhettinger	set	status: open -> closed resolution: rejected messages: + msg350247 stage: patch review -> resolved
2015-12-20 16:09:42	seblin	set	files: + subseq_index.patch messages: + msg256770
2015-12-18 18:31:49	ethan.furman	set	nosy: + ethan.furman messages: + msg256698
2015-12-18 16:15:48	r.david.murray	set	nosy: + r.david.murray messages: + msg256689
2015-12-18 11:35:56	seblin	set	messages: + msg256672
2015-12-18 06:18:39	josh.r	set	messages: + msg256642
2015-12-18 06:14:52	josh.r	set	nosy: + josh.r messages: + msg256641
2015-12-18 05:42:11	serhiy.storchaka	set	assignee: rhettinger
2015-12-18 04:01:32	socketpair	set	nosy: + socketpair messages: + msg256639
2015-12-18 03:58:16	abarry	set	nosy: + abarry messages: + msg256638 stage: patch review
2015-12-18 01:56:07	ethan.furman	set	nosy: + rhettinger
2015-12-18 01:45:11	seblin	create