Author vbr
Recipients georg.brandl, gjb1002, hagna, mrotondo, pitrou, r.david.murray, tim.peters, vbr
Date 2010-04-19.23:25:04
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1271719509.15.0.934934599383.issue2986@psf.upfronthosting.co.za>
In-reply-to
Content
I just stumbled on some seemingly different unexpected behaviour of
difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the "popular" heuristics.
I hopefully managed to replicate it on an illustrative sample text - in as included in the attached file. (I also mentioned this issue in hte python-list 
http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as there were no replies I eventually found, this might be more appropriate place.)
Both strings differ in a minimal way, each having one extra character
in a "strategic" position, which probably meets some pathological case
for difflib.
Instead of just reporting the insertion and deletion of these single
characters (which works well for most cases - with most other
positions of the differing characters), the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.
I didn't find any plausible mentions of such cases in the documentation, but after some searching I found several reports in the bug tracker mentioning the erroneous output of SequenceMatcher on longer repetitive sequences.

besides this
http://bugs.python.org/issue2986
e.g.
http://bugs.python.org/issue1711800
http://bugs.python.org/issue4622
http://bugs.python.org/issue1528074

In my case, disabling the "popular" heuristics as mentioned by John Machin in
http://bugs.python.org/issue1528074#msg29269

seems to have solved the problem; with a modified version of difflib containing:

                if 0:   # disable popular heuristics
                    if n >= 200 and len(indices) * 100 > n:
                        populardict[elt] = 1
                        del indices[:]

the comparison catches the differences in the test strings as expected - i.e. one character addition and deletion only. It is likely, that some other use cases for difflib may rely on the "popular"-heuristics but it also seems useful to have some control over this behaviour, which might not be appropriate in all cases.
(The issue seems to be the same in python 2.5, 2.6 and 3.1.)

regards,
   vbr
History
Date User Action Args
2010-04-19 23:25:09vbrsetrecipients: + vbr, tim.peters, georg.brandl, gjb1002, pitrou, hagna, r.david.murray, mrotondo
2010-04-19 23:25:09vbrsetmessageid: <1271719509.15.0.934934599383.issue2986@psf.upfronthosting.co.za>
2010-04-19 23:25:07vbrlinkissue2986 messages
2010-04-19 23:25:05vbrcreate