Issue 35079: difflib.SequenceMatcher.get_matching_blocks omits common strings

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79260

classification

Title:	difflib.SequenceMatcher.get_matching_blocks omits common strings
Type:	behavior	Stage:	resolved
Components:	Documentation, Library (Lib)	Versions:	Python 3.8, Python 3.7, Python 3.6, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	terry.reedy	Nosy List:	Springem Springsbee, miss-islington, terry.reedy, tim.peters, xtreak
Priority:	normal	Keywords:	patch

Created on 2018-10-26 19:00 by Springem Springsbee, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 10144	merged	terry.reedy, 2018-10-27 02:56
PR 10145	merged	miss-islington, 2018-10-27 03:03
PR 10146	merged	miss-islington, 2018-10-27 03:03
PR 10147	merged	miss-islington, 2018-10-27 03:04

Messages (9)
msg328593 - (view)	Author: Springem Springsbee (Springem Springsbee)	Date: 2018-10-26 19:00
Hello, I'm using difflib's SequenceMatcher to locate common substrings. It seems like the matcher is missing a common substrings. I'm guessing this is a rather low-level issue in difflib. The autojunk parameter has no effect for obvious reasons. Alternate pairwise comparisons between the following 3 strings omit the 2-character match 'AC' GATTACA TAGACCA ATACA The following Github gist captures the issue, which I'll repeat here for redundancy https://gist.github.com/MatthewRalston/b0ab6ac1dbe322cb12063310ccdbb786 >import difflib >string1 = "TAGACCA" >string2 = "ATACA" >s = difflib.SequenceMatcher(None, string1, string2) >blox = s.get_matching_blocks() >print(blox) [Match(a=0, b=1, size=2), Match(a=5, b=3, size=2), Match(a=7, b=5, size=0)] # Missing Match(a=3, b=2, size=2) >print([string1[x.a:x.a+x.size] for x in blox if x.size > 1]) ['TA', 'CA'] # Missing the substring 'CA'
msg328595 - (view)	Author: Tim Peters (tim.peters) *	Date: 2018-10-26 19:19
Sorry, I find this too hard to follow. At the end: > ['TA', 'CA'] # Missing the substring 'CA' the comment claims it's missing 'CA', while the output plainly shows it's _not_ missing 'CA'. If your complaint is that's missing 'AC', yes, it is. It's not intended to find _overlapping_ matches at all (read the docs). The longest matching blocks in TAGACCA ATACA are in fact TA, CA, and AC, but after the leftmost-longest TA is matched first, AC no longer exists _to_ match in the second string. Only CA does.
msg328602 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-10-26 20:25
We can assume that "substring 'CA'" was meant to be "substring 'AC'", but as explained, missing 'AC' is not a bug. (Tim wrote the module.) I read the doc, and 'non-overlapping' is implied in the SequenceMatcher entry at the top of the file. "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; ... The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence." However, a user of SequenceMatcher could easily miss that, and its implication, as Springem did. For clarity, I think we should add 'non-overlapping to the first line of the .get_matching_blocks entry, which is in the middle of the page. "Return list of triples describing non-overlapping matching subsequences." I also think "i+n != i' or j+n != j'" should be changed to "i+n < i' or j+n < j'" as '>' would mean overlapping. So != must mean <. I will prepare a doc PR later.
msg328606 - (view)	Author: Tim Peters (tim.peters) *	Date: 2018-10-26 21:08
I don't object to spelling it out more, and your (Terry's) suggestions are fine. On the other hand, this module has been around for a loooong time now, and this is the first instance I've seen of someone surprised that it doesn't produce overlapping matches - it's obvious from "The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence" that a matching subsequence is wholly eliminated from further consideration. At some point of ever-more tedious elaboration, the docs risk missing the forest for the trees. I don't think _these_ docs are quite there yet - although the docs for `find_longest_match()` are. Speaking of which, that method _could_ be used to find overlapping matches, one at a time, by passing appropriate slice indices. Which can be horridly inefficient; e.g., find all overlapping matches between 'A' * 100 and 'A' * 150
msg328627 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-10-27 03:00
Tim, I share your concern about bloating docs, but think this one word worthwhile. I suspect that people are conditioned to accept 'non-overlapping' because that is ofter (usually?) the default for linear searches and regex matching.
msg328628 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-10-27 03:03
New changeset d9bff4e81b8ca36fe6c4e90c0b9cf02bc020e713 by Terry Jan Reedy in branch 'master': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/d9bff4e81b8ca36fe6c4e90c0b9cf02bc020e713
msg328629 - (view)	Author: miss-islington (miss-islington)	Date: 2018-10-27 03:07
New changeset cb920c1442bf3b0899fee51915b4bf6614f2c1d7 by Miss Islington (bot) in branch '3.7': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/cb920c1442bf3b0899fee51915b4bf6614f2c1d7
msg328630 - (view)	Author: miss-islington (miss-islington)	Date: 2018-10-27 03:09
New changeset 5282125650a70811f0d64ab231e2a6c8ac20c96b by Miss Islington (bot) in branch '3.6': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/5282125650a70811f0d64ab231e2a6c8ac20c96b
msg328631 - (view)	Author: miss-islington (miss-islington)	Date: 2018-10-27 03:09
New changeset e389de8e3e897fa5ba4f71b0f711355fb7158049 by Miss Islington (bot) in branch '2.7': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/e389de8e3e897fa5ba4f71b0f711355fb7158049

History
Date	User	Action	Args
2022-04-11 14:59:07	admin	set	github: 79260
2018-10-27 03:23:10	terry.reedy	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2018-10-27 03:09:14	miss-islington	set	messages: + msg328631
2018-10-27 03:09:00	miss-islington	set	messages: + msg328630
2018-10-27 03:07:46	miss-islington	set	nosy: + miss-islington messages: + msg328629
2018-10-27 03:04:04	miss-islington	set	pull_requests: + pull_request9474
2018-10-27 03:03:52	miss-islington	set	pull_requests: + pull_request9473
2018-10-27 03:03:39	miss-islington	set	stage: needs patch -> patch review pull_requests: + pull_request9472
2018-10-27 03:03:14	terry.reedy	set	messages: + msg328628
2018-10-27 03:00:58	terry.reedy	set	messages: + msg328627 stage: patch review -> needs patch
2018-10-27 02:56:22	terry.reedy	set	keywords: + patch stage: needs patch -> patch review pull_requests: + pull_request9471
2018-10-26 21:08:46	tim.peters	set	messages: + msg328606
2018-10-26 20:25:26	terry.reedy	set	versions: + Python 3.8, - Python 3.4, Python 3.5 messages: + msg328602 assignee: terry.reedy components: + Documentation stage: needs patch
2018-10-26 19:22:00	xtreak	set	nosy: + xtreak
2018-10-26 19:19:12	tim.peters	set	nosy: + tim.peters messages: + msg328595
2018-10-26 19:00:41	Springem Springsbee	create