New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
difflib.SequenceMatcher.find_longest_match() wrong result #43714
Comments
A short example script is attached. If one byte is chopped off the end of each string (len Other observations, none of which may be relevant:
if 0:
# if n >= 200 and len(indices) * 100 > n:
populardict[elt] = 1
del indices[:]
else:
indices.append(i)
5. The callable self.isbpopular is created but appears
to be unused.
6. The determination of "popular" doesn't accord with
the comments, which say 1%. However with len==500,
takes 6 to be popular. |
The quick test for this bug is: for i in xrange(190, 200):
text1 = "a" + "b"*i
text2 = "b"*i + "c"
m = difflib.SequenceMatcher(None, text1, text2)
(aptr,bptr,l) = m.find_longest_match(0, len(text1), 0, len(text2))
print "i:", i, " l:", l, " aptr:", aptr, " bptr:", bptr
assert l == i The assertion will fail when i==199 (the length of the texts will be 200). |
I have sent a testcase for this bug into the SourceForge. The ID is bpo-1678339. |
By the way, I found that the implementation should better be changed completely. The current one has a O(n^2) computational complexity, while the one, based on suffix trees using Ukkonen's algorithm would use only O(n) |
I think the bug is documentation only. According to the comments (but not the docstring) find_longest_match returns the longest _interesting_ match. A single element appearing too often is likely to cause spurious matches, and is therefore not interesting. I do agree that this should be noted more prominently, so people don't try things like comparing text strings letter-by-letter (where 1% is far too low a threshhold for a 26-character alphabest). And yes, the comments on popular are correct -- it ignores elements which constitute *more* than 1%. I recommend closing this set of tracker items. If you could submit changes to the documentation (docstrings and/or help files; maybe even the comments), I would recommend applying them. |
hi all, just got bitten by this, so i took the time to reiterate the issue. according to the docs: http://docs.python.org/library/difflib.html find_longest_match() should return the longest matching string: "If isjunk was omitted or None, find_longest_match() returns (i, j, k) but after a couple of hours debugging i finally convinced myself that any ideas on how to work around this bug/feature, and just get the from the comments (which i couldn't follow entirely), does it use some many thanks!
ps.: using ubuntu's python 2.5.2 ps2.: and example of a string pair where the issue shows up: s1='Floor Box SystemsFBS Floor Box Systems - Manufacturer & supplier
of FBS floor boxes, electrical ... experience, FBS Floor Box Systems
continue ... raceways, floor box. ...www.floorboxsystems.com'
s2='FBS Floor Box SystemsFBS Floor Box Systems - Manufacturer &
supplier of FBS floor boxes, electrical floor boxes, wood floor box,
concrete floor box, surface mount floor box, raised floor
...www.floorboxsystems.com' |
This appears to be one of at least three duplicate issues: bpo-1528074, bpo-2986, and bpo-4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) bpo-1711800 appears to be slightly different (if not, it could be closed also.) Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: