Message 103660 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vbr
Recipients	georg.brandl, gjb1002, hagna, mrotondo, pitrou, r.david.murray, tim.peters, vbr
Date	2010-04-19.23:25:04
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1271719509.15.0.934934599383.issue2986@psf.upfronthosting.co.za>
In-reply-to

Content
I just stumbled on some seemingly different unexpected behaviour of difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the "popular" heuristics. I hopefully managed to replicate it on an illustrative sample text - in as included in the attached file. (I also mentioned this issue in hte python-list http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as there were no replies I eventually found, this might be more appropriate place.) Both strings differ in a minimal way, each having one extra character in a "strategic" position, which probably meets some pathological case for difflib. Instead of just reporting the insertion and deletion of these single characters (which works well for most cases - with most other positions of the differing characters), the output of the SequenceMatcher decides to delete a large part of the string in between the differences and to insert the almost same text after that. The attached code simply prints the results of the comparison with the respective tags, and substrings. No junk function is used. I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3. I didn't find any plausible mentions of such cases in the documentation, but after some searching I found several reports in the bug tracker mentioning the erroneous output of SequenceMatcher on longer repetitive sequences. besides this http://bugs.python.org/issue2986 e.g. http://bugs.python.org/issue1711800 http://bugs.python.org/issue4622 http://bugs.python.org/issue1528074 In my case, disabling the "popular" heuristics as mentioned by John Machin in http://bugs.python.org/issue1528074#msg29269 seems to have solved the problem; with a modified version of difflib containing: if 0: # disable popular heuristics if n >= 200 and len(indices) * 100 > n: populardict[elt] = 1 del indices[:] the comparison catches the differences in the test strings as expected - i.e. one character addition and deletion only. It is likely, that some other use cases for difflib may rely on the "popular"-heuristics but it also seems useful to have some control over this behaviour, which might not be appropriate in all cases. (The issue seems to be the same in python 2.5, 2.6 and 3.1.) regards, vbr

I just stumbled on some seemingly different unexpected behaviour of
difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the "popular" heuristics.
I hopefully managed to replicate it on an illustrative sample text - in as included in the attached file. (I also mentioned this issue in hte python-list
http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as there were no replies I eventually found, this might be more appropriate place.)
Both strings differ in a minimal way, each having one extra character
in a "strategic" position, which probably meets some pathological case
for difflib.
Instead of just reporting the insertion and deletion of these single
characters (which works well for most cases - with most other
positions of the differing characters), the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.
I didn't find any plausible mentions of such cases in the documentation, but after some searching I found several reports in the bug tracker mentioning the erroneous output of SequenceMatcher on longer repetitive sequences.

besides this
http://bugs.python.org/issue2986
e.g.
http://bugs.python.org/issue1711800
http://bugs.python.org/issue4622
http://bugs.python.org/issue1528074

In my case, disabling the "popular" heuristics as mentioned by John Machin in
http://bugs.python.org/issue1528074#msg29269

seems to have solved the problem; with a modified version of difflib containing:

if 0: # disable popular heuristics
if n >= 200 and len(indices) * 100 > n:
populardict[elt] = 1
del indices[:]

the comparison catches the differences in the test strings as expected - i.e. one character addition and deletion only. It is likely, that some other use cases for difflib may rely on the "popular"-heuristics but it also seems useful to have some control over this behaviour, which might not be appropriate in all cases.
(The issue seems to be the same in python 2.5, 2.6 and 3.1.)

regards,
vbr

History
Date	User	Action	Args
2010-04-19 23:25:09	vbr	set	recipients: + vbr, tim.peters, georg.brandl, gjb1002, pitrou, hagna, r.david.murray, mrotondo
2010-04-19 23:25:09	vbr	set	messageid: <1271719509.15.0.934934599383.issue2986@psf.upfronthosting.co.za>
2010-04-19 23:25:07	vbr	link	issue2986 messages
2010-04-19 23:25:05	vbr	create