classification
Title: difflib.SequenceMatcher.get_matching_blocks omits common strings
Type: behavior Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: terry.reedy Nosy List: Springem Springsbee, miss-islington, terry.reedy, tim.peters, xtreak
Priority: normal Keywords: patch

Created on 2018-10-26 19:00 by Springem Springsbee, last changed 2018-10-27 03:23 by terry.reedy. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 10144 merged terry.reedy, 2018-10-27 02:56
PR 10145 merged miss-islington, 2018-10-27 03:03
PR 10146 merged miss-islington, 2018-10-27 03:03
PR 10147 merged miss-islington, 2018-10-27 03:04
Messages (9)
msg328593 - (view) Author: Springem Springsbee (Springem Springsbee) Date: 2018-10-26 19:00
Hello, I'm using difflib's SequenceMatcher to locate common substrings. It seems like the matcher is missing a common substrings. I'm guessing this is a rather low-level issue in difflib. The autojunk parameter has no effect for obvious reasons. Alternate pairwise comparisons between the following 3 strings omit the 2-character match 'AC'

GATTACA
TAGACCA
ATACA

The following Github gist captures the issue, which I'll repeat here for redundancy https://gist.github.com/MatthewRalston/b0ab6ac1dbe322cb12063310ccdbb786

>import difflib
>string1 = "TAGACCA"
>string2 = "ATACA"
>s = difflib.SequenceMatcher(None, string1, string2)
>blox = s.get_matching_blocks()
>print(blox)
[Match(a=0, b=1, size=2), Match(a=5, b=3, size=2), Match(a=7, b=5, size=0)] # Missing Match(a=3, b=2, size=2)
>print([string1[x.a:x.a+x.size] for x in blox if x.size > 1])
['TA', 'CA'] # Missing the substring 'CA'
msg328595 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-26 19:19
Sorry, I find this too hard to follow.  At the end:

> ['TA', 'CA'] # Missing the substring 'CA'

the comment claims it's missing 'CA', while the output plainly shows it's _not_ missing 'CA'.

If your complaint is that's missing 'AC', yes, it is.  It's not intended to find _overlapping_ matches at all (read the docs).  The longest matching blocks in

TAGACCA
ATACA

are in fact TA, CA, and AC, but after the leftmost-longest TA is matched first, AC no longer exists _to_ match in the second string.  Only CA does.
msg328602 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-10-26 20:25
We can assume that "substring 'CA'" was meant to be "substring 'AC'", but as explained, missing 'AC' is not a bug.  (Tim wrote the module.)

I read the doc, and 'non-overlapping' is implied in the SequenceMatcher entry at the top of the file.

"The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; ... The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence."

However, a user of SequenceMatcher could easily miss that, and its implication, as Springem did.  For clarity, I think we should add 'non-overlapping to the first line of the .get_matching_blocks entry, which is in the middle of the page. "Return list of triples describing non-overlapping matching subsequences."

I also think "i+n != i' or j+n != j'" should be changed to "i+n < i' or j+n < j'" as '>' would mean overlapping.  So != must mean <.

I will prepare a doc PR later.
msg328606 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-26 21:08
I don't object to spelling it out more, and your (Terry's) suggestions are fine.  On the other hand, this module has been around for a loooong time now, and this is the first instance I've seen of someone surprised that it doesn't produce overlapping matches - it's obvious from "The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence" that a matching subsequence is wholly eliminated from further consideration.

At some point of ever-more tedious elaboration, the docs risk missing the forest for the trees.  I don't think _these_ docs are quite there yet - although the docs for `find_longest_match()` are.  Speaking of which, that method _could_ be used to find overlapping matches, one at a time, by passing appropriate slice indices.

Which can be horridly inefficient; e.g., find all overlapping matches between

'A' * 100

and

'A' * 150
msg328627 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-10-27 03:00
Tim, I share your concern about bloating docs, but think this one word worthwhile.  I suspect that people are conditioned to accept 'non-overlapping' because that is ofter (usually?) the default for linear searches and regex matching.
msg328628 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-10-27 03:03
New changeset d9bff4e81b8ca36fe6c4e90c0b9cf02bc020e713 by Terry Jan Reedy in branch 'master':
 bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
https://github.com/python/cpython/commit/d9bff4e81b8ca36fe6c4e90c0b9cf02bc020e713
msg328629 - (view) Author: miss-islington (miss-islington) Date: 2018-10-27 03:07
New changeset cb920c1442bf3b0899fee51915b4bf6614f2c1d7 by Miss Islington (bot) in branch '3.7':
bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
https://github.com/python/cpython/commit/cb920c1442bf3b0899fee51915b4bf6614f2c1d7
msg328630 - (view) Author: miss-islington (miss-islington) Date: 2018-10-27 03:09
New changeset 5282125650a70811f0d64ab231e2a6c8ac20c96b by Miss Islington (bot) in branch '3.6':
bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
https://github.com/python/cpython/commit/5282125650a70811f0d64ab231e2a6c8ac20c96b
msg328631 - (view) Author: miss-islington (miss-islington) Date: 2018-10-27 03:09
New changeset e389de8e3e897fa5ba4f71b0f711355fb7158049 by Miss Islington (bot) in branch '2.7':
bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
https://github.com/python/cpython/commit/e389de8e3e897fa5ba4f71b0f711355fb7158049
History
Date User Action Args
2018-10-27 03:23:10terry.reedysetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-10-27 03:09:14miss-islingtonsetmessages: + msg328631
2018-10-27 03:09:00miss-islingtonsetmessages: + msg328630
2018-10-27 03:07:46miss-islingtonsetnosy: + miss-islington
messages: + msg328629
2018-10-27 03:04:04miss-islingtonsetpull_requests: + pull_request9474
2018-10-27 03:03:52miss-islingtonsetpull_requests: + pull_request9473
2018-10-27 03:03:39miss-islingtonsetstage: needs patch -> patch review
pull_requests: + pull_request9472
2018-10-27 03:03:14terry.reedysetmessages: + msg328628
2018-10-27 03:00:58terry.reedysetmessages: + msg328627
stage: patch review -> needs patch
2018-10-27 02:56:22terry.reedysetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request9471
2018-10-26 21:08:46tim.peterssetmessages: + msg328606
2018-10-26 20:25:26terry.reedysetversions: + Python 3.8, - Python 3.4, Python 3.5
messages: + msg328602

assignee: terry.reedy
components: + Documentation
stage: needs patch
2018-10-26 19:22:00xtreaksetnosy: + xtreak
2018-10-26 19:19:12tim.peterssetnosy: + tim.peters
messages: + msg328595
2018-10-26 19:00:41Springem Springsbeecreate