Issue 1528074: difflib.SequenceMatcher.find_longest_match() wrong result

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43714

classification

Title:	difflib.SequenceMatcher.find_longest_match() wrong result
Type:		Stage:
Components:	Library (Lib)	Versions:

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	difflib.SequenceMatcher not matching long sequences View: 2986
Assigned To:	tim.peters	Nosy List:	janpf, jimjjewett, rtvd, sjmachin, terry.reedy, tim.peters
Priority:	normal	Keywords:

Created on 2006-07-24 23:59 by sjmachin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
koara2.py	sjmachin, 2006-07-24 23:59	short self-contained script which shows bug

Messages (7)
msg29269 - (view)	Author: John Machin (sjmachin)	Date: 2006-07-24 23:59
A short example script is attached. Two strings, each 500 bytes long. Longest match is the first 32 bytes of each string. Result produced is a 10-byte match later in the string. If one byte is chopped off the end of each string (len now 499 each), the correct result is produced. Other observations, none of which may be relevant: 1. Problem may be in the heuristic for "popular" elements in the __chain_b method. In this particular example, the character '0' (which has frequency of 6 in the "b" string) is not "popular" with a len of 500 but is "popular" with a len of 499. 2. '0' is the last byte of the correct longest match. 3. The correct longest match is at the start of the each of the strings. 4. Disabling the "popular" heuristic (like below) appears to make the problem go away: if 0: # if n >= 200 and len(indices) * 100 > n: populardict[elt] = 1 del indices[:] else: indices.append(i) 5. The callable self.isbpopular is created but appears to be unused. 6. The determination of "popular" doesn't accord with the comments, which say 1%. However with len==500, takes 6 to be popular.
msg29270 - (view)	Author: Denys Rtveliashvili (rtvd)	Date: 2007-03-10 17:24
The quick test for this bug is: for i in xrange(190, 200): text1 = "a" + "b"i text2 = "b"i + "c" m = difflib.SequenceMatcher(None, text1, text2) (aptr,bptr,l) = m.find_longest_match(0, len(text1), 0, len(text2)) print "i:", i, " l:", l, " aptr:", aptr, " bptr:", bptr assert l == i The assertion will fail when i==199 (the length of the texts will be 200). And yes, the bug is clearly "populardict"-related.
msg29271 - (view)	Author: Denys Rtveliashvili (rtvd)	Date: 2007-03-11 15:29
I have sent a testcase for this bug into the SourceForge. The ID is #1678339. Also I have submitted a fix for this bug (ID #1678345), but the fix reduces the performance significantly.
msg29272 - (view)	Author: Denys Rtveliashvili (rtvd)	Date: 2007-03-14 20:11
By the way, I found that the implementation should better be changed completely. The current one has a O(n^2) computational complexity, while the one, based on suffix trees using Ukkonen's algorithm would use only O(n)
msg29273 - (view)	Author: Jim Jewett (jimjjewett)	Date: 2007-03-19 01:51
I think the bug is documentation only. According to the comments (but not the docstring) find_longest_match returns the longest _interesting_ match. A single element appearing too often is likely to cause spurious matches, and is therefore not interesting. I do agree that this should be noted more prominently, so people don't try things like comparing text strings letter-by-letter (where 1% is far too low a threshhold for a 26-character alphabest). And yes, the comments on popular are correct -- it ignores elements which constitute more than 1%. I recommend closing this set of tracker items. If you could submit changes to the documentation (docstrings and/or help files; maybe even the comments), I would recommend applying them.
msg81334 - (view)	Author: Jan (janpf)	Date: 2009-02-07 10:50
hi all, just got bitten by this, so i took the time to reiterate the issue. according to the docs: http://docs.python.org/library/difflib.html find_longest_match() should return the longest matching string: "If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b." but after a couple of hours debugging i finally convinced myself that the bug was in the library ... and i ended up here :) any ideas on how to work around this bug/feature, and just get the longest matching string ? (from a normal/newbie user perspective, that is, without patching the C++ library code and recompiling?) from the comments (which i couldn't follow entirely), does it use some concept of popularity that is not exposed by the API ? How is "popularity" defined ? many thanks! - jan ps.: using ubuntu's python 2.5.2 ps2.: and example of a string pair where the issue shows up: s1='Floor Box SystemsFBS Floor Box Systems - Manufacturer & supplier of FBS floor boxes, electrical ... experience, FBS Floor Box Systems continue ... raceways, floor box. ...www.floorboxsystems.com' s2='FBS Floor Box SystemsFBS Floor Box Systems - Manufacturer & supplier of FBS floor boxes, electrical floor boxes, wood floor box, concrete floor box, surface mount floor box, raised floor ...www.floorboxsystems.com'
msg108634 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-06-25 21:55
This appears to be one of at least three duplicate issues: #1528074, #2986, and #4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) #1711800 appears to be slightly different (if not, it could be closed also.) Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great.

History
Date	User	Action	Args
2022-04-11 14:56:19	admin	set	github: 43714
2010-06-25 21:55:07	terry.reedy	set	status: open -> closed nosy: + terry.reedy messages: + msg108634 superseder: difflib.SequenceMatcher not matching long sequences resolution: duplicate
2009-02-07 10:50:37	janpf	set	nosy: + janpf messages: + msg81334
2006-07-24 23:59:52	sjmachin	create