classification
Title: SequenceMatcher bug with insert/delete block after "replace"
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Peter.Waller, chipx86, collinwinter, eli.bendersky, ggenellina, loewis, r.david.murray, rhettinger, terry.reedy, tfaing, tim.peters
Priority: normal Keywords: patch

Created on 2007-05-03 10:24 by chipx86, last changed 2011-06-15 18:35 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
test-sequencematcher.py chipx86, 2007-05-03 10:24 Testcase
difflib_patch.patch Peter.Waller, 2010-02-28 13:57 Patch against 78514 to fix Issue1711800
Messages (10)
msg31938 - (view) Author: Christian Hammond (chipx86) Date: 2007-05-03 10:24
difflib.SequenceMatcher fails to distinguish between a "replace" block and an "insert" or "delete" block when the "insert/delete" immediately follows a "replace". It will lump both changes together as one big "replace" block.

This happens due to how get_opcodes() works. get_opcodes() loops through the matching blocks, grouping them into tags and ranges. However, if a block of text is changed and then new text is immediately added, it can't see this. All it knows is that the next matching block is after the added text.

As an example, consider these strings:

"ABC"

"ABCD
EFG."

Any diffing program will show that the first line was replaced and the second was inserted. SequenceMatcher, however, just shows that there was one replace, and includes both lines in the range.

I've attached a testcase that reproduces this for both replace>insert and replace>delete blocks.
msg31939 - (view) Author: Gabriel Genellina (ggenellina) Date: 2007-05-07 05:40
Maybe you are more interested on a Differ object. These are the results from your example using '\n'.join(difflib.Differ().compare(a,b))

- This is my old file, containing only one line.
+ This is my new file.
+ It contains multiple lines.
+ SequenceMatcher should see two blocks as a result.

and 

+ This is my new file, containing only one line.
- This is my old file.
- It contains multiple lines.
- SequenceMatcher should see two blocks as a result.
msg31940 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2007-06-05 23:40
Thanks for the test case! Is there any chance you could also provide a patch to fix it?
msg76192 - (view) Author: Andrew Inglis (tfaing) Date: 2008-11-21 16:59
Are there any updates for this issue? I am having the same problem, or,
lack of feature, with SequenceMatcher. Gabriel, Differ doesn't seem to
have the functionality, or an easy way to patch it to get the
functionality, that Christian is talking about. Christian, what are
these "Any diffing program"[s] that you speak of that give the same
output of SequenceMatcher with this issue fixed? Any solutions you found
for python?

Thanks!
msg77519 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-12-10 09:10
Without a patch, this is not a candidate for 2.5.3.
msg77529 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-12-10 10:13
Tim, any comments or interest?
msg100203 - (view) Author: Peter Waller (Peter.Waller) Date: 2010-02-28 13:57
Here is a patch against 2.6.4.

A brief description of how it works:

If we're emitting a "replace" with unequal lengths, then instead emit a replace followed by an insert or delete to make up for the difference in length.

My code to see that it works:

a = "aa bbb cc D"
b = "aa czbb cc E"

sm = difflib.SequenceMatcher(None, a, b).get_opcodes()

print "A = ", a
print "B = ", b

for o in opcodes:
    tag, i1, i2, j1, j2 = o
    print "%30r %-20s %-20s" % (o, a[i1:i2], b[j1:j2])
msg138380 - (view) Author: Peter Waller (Peter.Waller) Date: 2011-06-15 15:07
Apologies for the bump, but it has been more than a year and I did attach a patch! :-)

What next?
msg138383 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-15 15:45
No need to apologize for the bump.  The trick is catching the interest of someone who feels qualified to judge the patch.  I've added a couple people to nosy who worked on difflib recently.  If no one speaks up in the next few days, it might be time to post to python-dev.
msg138387 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-06-15 18:34
I believe this issue should be closed and have set it to pending.

The original report of a 'bug' and the two 'testcases' were and are invalid as they are based on an incorrect understanding of SequenceMatcher. It is not a diff program and in particular not a line diff program but the function used internally by such. It works on sequences of characters, lines, numbers, or anything else. Given the character strings 'abc' and 'abcd\def' it correctly reports that the second is a copy of 3 chars from the first plus insertion of 5 more.

Gabriel correctly suggested the above in suggesting that if one wants to compare sequences of text lines, one might use Differ. One could also use SequenceMatcher directly, but this loses the diff-like formatting and report of within-line differences. I think this issue should have been closed then.

I do not know what functionality Andrew thinks Christian was talking about. Using Differ with

a = ['abc\n']
b = ['abcd\n', 'def\n']
for line in difflib.Differ().compare(a,b): print(line, end='')

# prints
- abc
+ abcd
?    +
+ def

One line is replaced with two, with the extra info that the first new line is the old line with an extra char. I do not believe that 'any diffing program' will report the latter. The '?' lines are easily filtered out if not wanted.

The patch by Peter has no motivation that I can see other than the idea that replacing a subsequence with one of a different length is somehow bad. Tim Peters did not think so and neither do I -- or Guido. Unequal replacement is built into the syntax of Python:

>>> s = [1,2,3]
>>> s[1:2] = [4,5,6]
>>> s
[1, 4, 5, 6, 3]

I would not be surprised it the proposed change broke some existing application or degraded performance a bit.
History
Date User Action Args
2011-06-15 18:35:18rhettingersetstatus: pending -> closed
2011-06-15 18:34:15terry.reedysetstatus: open -> pending
assignee: tim.peters ->
resolution: not a bug
messages: + msg138387
2011-06-15 15:45:18r.david.murraysetnosy: + terry.reedy, eli.bendersky, r.david.murray
messages: + msg138383
2011-06-15 15:07:05Peter.Wallersetmessages: + msg138380
2010-02-28 13:57:21Peter.Wallersetfiles: + difflib_patch.patch

nosy: + Peter.Waller
messages: + msg100203

keywords: + patch
2008-12-10 10:13:44rhettingersetassignee: tim.peters
messages: + msg77529
nosy: + rhettinger, tim.peters
2008-12-10 09:10:39loewissetnosy: + loewis
messages: + msg77519
versions: + Python 2.7, - Python 2.5.3
2008-11-21 16:59:29tfaingsetnosy: + tfaing
type: enhancement
messages: + msg76192
versions: + Python 2.5.3, - Python 2.6
2007-05-03 10:24:47chipx86create