classification
Title: Check for subsequence inside a sequence
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: abarry, ethan.furman, josh.r, r.david.murray, rhettinger, seblin, socketpair
Priority: normal Keywords: patch

Created on 2015-12-18 01:45 by seblin, last changed 2019-08-23 05:29 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
has_subsequence.patch seblin, 2015-12-18 01:45 review
subseq_index.patch seblin, 2015-12-20 16:09 review
Messages (10)
msg256633 - (view) Author: Sebastian Linke (seblin) Date: 2015-12-18 01:45
With the attached patch I propose to add a new function to the "collections" module. It is meant to be used for determining whether a given subsequence is part of a particular sequence (with respect to the ordering of the subsequence). 

Doing this in pure Python (using the CPython interpreter) is relatively slow. So I did the implementation in C.
msg256638 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2015-12-18 03:58
Reviewed the Python code (unlike what my email said, it doesn't LGTM; I'm tired and just forgot). I checked the C code for the obvious pitfalls (didn't spot any), but it will require someone else better than me to look at it.
msg256639 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-18 04:01
Really optimal search algorithm should be something like that:

https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
msg256641 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-12-18 06:14
A utility like this seems like it would belong in `itertools`, not `collections`. It should also ideally avoid fully realizing the sequence so it could work with iterators/generators as well; PySequence_Fast will force creation of a `list`/`tuple` of the whole sequence when in practice, a `deque` with a maxlen could be used to only maintain the necessary window into the "haystack".

It would also help to have a pure Python implementation (and until you have one, it's probably overkill to write the C accelerator) for other Python distributions, and to serve as a baseline for comparison to see if a C accelerator is justified.  Something like this might be a decent point of comparison:

def has_subsequence(it, searchseq, *, all=all, map=map, eq=operator.eq):
    searchseq = tuple(searchseq)
    if not searchseq:
        return True  # Empty sequence in everything
    window = collections.deque(itertools.islice(it, len(searchseq)-1), len(searchseq))
    for x in it:
        window.append(x)
        if all(map(eq, window, searchseq)):
            return True
    return False
msg256642 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-12-18 06:18
Aho Corasick doesn't seem likely to be useful here; it's good if the haystack is huge (or you have many haystacks to search) and you have many needles to look for (and the needles never change), but it pays a fairly steep setup cost; for a utility that searches for a single subsequence once, with no history, Aho-Corasick wouldn't help much.

A variant on Boyer-Moore (which involves less preprocessing work on the needle) might help, but I'm not sure the feature is compelling enough to warrant acceleration in the first place.
msg256672 - (view) Author: Sebastian Linke (seblin) Date: 2015-12-18 11:35
@Josh
What is the point of using iterators here? If the user wants an iterator to be checked then he can throw that iterator right away because it was exhausted by the function. I think that in most cases the user wants to do some postprocessing on a hairstack after being checked. This is why I restricted the arguments to be sequences.

Keeping that restriction in mind a pure Python implementation could look like this:

def has_subsequence(seq, subseq):
    if not subseq:
        return True
    if len(subseq) > len(seq):
        return False
    start = 0
    while True:
        try:
            start = seq.index(subseq[0], start)
        except ValueError:
            # seq.index() did not match
            return False
        stop = start + len(subseq)
        if seq[stop - 1] == subseq[-1] and seq[start:stop] == subseq:
            return True
        start += 1

Unfortunately, this version potentially creates lots of subseqences for checking equality with the `subseq` argument. I tried to minimise the occurrence of those cases by pre-checking the first and the last element of a candidate. Anyway, this seems to be faster for small needles compared to your `deque`-based suggestion.
msg256689 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-18 16:15
If I were trying to channel Raymond I'd suggest posting the python as a recipe and see if there is uptake.  However, I could be wrong and he might be interested.  (I can't say that I've ever needed this check myself.)

I'd be inclined to agree that it should only work on sequences and be in collections.  If you can come up with a real use case where you wouldn't need the realized haystack afterwards I'd be interested to hear it :)  Regardless, I don't see how it could be composed with other itertools, so it doesn't seem to belong there.

As long as one is finding the needle, one might as well return the index, though.  So subsequence_index instead of has_subsequence.  And then you'd want an optional start index, and maybe an rsubsequence_index :)  So, it could be a method of Sequence, but is it useful enough to be worth adding?  Do any other languages have such a function as part of their built in toolkit?

(Aside: you can avoid creating a subsequence by using a loop.  You'd have to test with a bunch of variant sizes to see which is more efficient, but I'm guessing it would be the loop.)
msg256698 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2015-12-18 18:31
+1 for sequences
+1 for subsequence_index instead of has_subsequence
+1 for returning None when not found  ;)
msg256770 - (view) Author: Sebastian Linke (seblin) Date: 2015-12-20 16:09
Slightly modified the implementation to return the index rather than a boolean.
msg350247 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-23 05:29
> If I were trying to channel Raymond I'd suggest posting the 
> python as a recipe and see if there is uptake.  However, I
>  could be wrong and he might be interested.  (I can't say 
> that I've ever needed this check myself.)

Yes, exactly :-)

Sebastian, thanks for the suggestion.

I am going to decline this for collections and itertools based on the relative infrequency of need (for us, it usually only comes up in the context of strings).

FWIW, there may be algorithmically more advanced techniques that might be worth looking into: knuth-morris-pratt, rabin-karp, etc.

Please do post this to the Python Package Index or someother place where people can get to it if the need arises.
History
Date User Action Args
2019-08-23 05:29:14rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg350247

stage: patch review -> resolved
2015-12-20 16:09:42seblinsetfiles: + subseq_index.patch

messages: + msg256770
2015-12-18 18:31:49ethan.furmansetnosy: + ethan.furman
messages: + msg256698
2015-12-18 16:15:48r.david.murraysetnosy: + r.david.murray
messages: + msg256689
2015-12-18 11:35:56seblinsetmessages: + msg256672
2015-12-18 06:18:39josh.rsetmessages: + msg256642
2015-12-18 06:14:52josh.rsetnosy: + josh.r
messages: + msg256641
2015-12-18 05:42:11serhiy.storchakasetassignee: rhettinger
2015-12-18 04:01:32socketpairsetnosy: + socketpair
messages: + msg256639
2015-12-18 03:58:16abarrysetnosy: + abarry

messages: + msg256638
stage: patch review
2015-12-18 01:56:07ethan.furmansetnosy: + rhettinger
2015-12-18 01:45:11seblincreate