Author mrotondo
Recipients georg.brandl, hagna, mrotondo, tim.peters
Date 2009-03-30.00:40:13
SpamBayes Score 2.16493e-15
Marked as misclassified No
Message-id <>
From the source, it seems that there is undocumented behavior to
SequenceMatcher which is causing this error. If b is longer than 200
characters, it will consider any element x in b that takes up more than
1% of it's contents as "popular", and thus junk. 

So, in this case, difflib is treating each individual digit as an
element of your sequences, and each one takes up more than 1% of the
complete sequence b. Therefore, each one is "popular", and therefore

A snippet which demonstrates this:

from difflib import SequenceMatcher
for i in range(1, 202)[::10]:
  a = "a" * i
  b = "b" + "a" * i
  s = SequenceMatcher(None, a, b)
  print s.find_longest_match(0, len(a), 0, len(b))

Up til i=200, the strings match, but afterwards they do not because "a"
is "popular". 

Strangely, if you get rid of the "b" at the beginning of b, they
continue to match at lengths greater than 200. This may be a bug, I'll
keep looking into it but someone who knows more should probably take a
look too.

The comments from say some interesting things:
 # b2j also does not contain entries for "popular" elements, meaning 
 # elements that account for more than 1% of the total elements, and
 # when the sequence is reasonably large (>= 200 elements); this can
 # be viewed as an adaptive notion of semi-junk, and yields an enormous
 # speedup when, e.g., comparing program files with hundreds of
 # instances of "return NULL;"

This seems to mean that you won't actually get an accurate diff in
certain cases, which seems odd. At the very least, this behavior should
probably be documented. Do people think it should be changed to get rid
of the "popularity" heuristic?
Date User Action Args
2009-03-30 00:40:18mrotondosetrecipients: + mrotondo, tim.peters, georg.brandl, hagna
2009-03-30 00:40:17mrotondosetmessageid: <>
2009-03-30 00:40:15mrotondolinkissue2986 messages
2009-03-30 00:40:14mrotondocreate