Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difflib.SequenceMatcher.find_longest_match() wrong result #43714

Closed
sjmachin mannequin opened this issue Jul 24, 2006 · 7 comments
Closed

difflib.SequenceMatcher.find_longest_match() wrong result #43714

sjmachin mannequin opened this issue Jul 24, 2006 · 7 comments
Assignees
Labels
stdlib Python modules in the Lib dir

Comments

@sjmachin
Copy link
Mannequin

sjmachin mannequin commented Jul 24, 2006

BPO 1528074
Nosy @tim-one, @terryjreedy
Superseder
  • bpo-2986: difflib.SequenceMatcher not matching long sequences
  • Files
  • koara2.py: short self-contained script which shows bug
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tim-one'
    closed_at = <Date 2010-06-25.21:55:07.722>
    created_at = <Date 2006-07-24.23:59:52.000>
    labels = ['library']
    title = 'difflib.SequenceMatcher.find_longest_match()  wrong result'
    updated_at = <Date 2010-06-25.21:55:07.720>
    user = 'https://bugs.python.org/sjmachin'

    bugs.python.org fields:

    activity = <Date 2010-06-25.21:55:07.720>
    actor = 'terry.reedy'
    assignee = 'tim.peters'
    closed = True
    closed_date = <Date 2010-06-25.21:55:07.722>
    closer = 'terry.reedy'
    components = ['Library (Lib)']
    creation = <Date 2006-07-24.23:59:52.000>
    creator = 'sjmachin'
    dependencies = []
    files = ['2076']
    hgrepos = []
    issue_num = 1528074
    keywords = []
    message_count = 7.0
    messages = ['29269', '29270', '29271', '29272', '29273', '81334', '108634']
    nosy_count = 6.0
    nosy_names = ['tim.peters', 'terry.reedy', 'jimjjewett', 'sjmachin', 'rtvd', 'janpf']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '2986'
    type = None
    url = 'https://bugs.python.org/issue1528074'
    versions = []

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Jul 24, 2006

    A short example script is attached.
    Two strings, each 500 bytes long. Longest match is the
    first 32 bytes of each string. Result produced is a
    10-byte match later in the string.

    If one byte is chopped off the end of each string (len
    now 499 each), the correct result is produced.

    Other observations, none of which may be relevant:

    1. Problem may be in the heuristic for "popular"
      elements in the __chain_b method. In this particular
      example, the character '0' (which has frequency of 6 in
      the "b" string) is not "popular" with a len of 500 but
      is "popular" with a len of 499.
    2. '0' is the last byte of the correct longest match.
    3. The correct longest match is at the start of the
      each of the strings.
    4. Disabling the "popular" heuristic (like below)
      appears to make the problem go away:
                    if 0:
                    # if n >= 200 and len(indices) * 100 > n:
                        populardict[elt] = 1
                        del indices[:]
                    else:
                        indices.append(i)
    5. The callable self.isbpopular is created but appears
    to be unused.
    6. The determination of "popular" doesn't accord with
    the comments, which say 1%. However with len==500,
    takes 6 to be popular.

    @sjmachin sjmachin mannequin assigned tim-one Jul 24, 2006
    @sjmachin sjmachin mannequin added the stdlib Python modules in the Lib dir label Jul 24, 2006
    @sjmachin sjmachin mannequin assigned tim-one Jul 24, 2006
    @sjmachin sjmachin mannequin added the stdlib Python modules in the Lib dir label Jul 24, 2006
    @rtvd
    Copy link
    Mannequin

    rtvd mannequin commented Mar 10, 2007

    The quick test for this bug is:

    for i in xrange(190, 200):
    	text1 = "a" + "b"*i
    	text2 = "b"*i + "c"
    	m = difflib.SequenceMatcher(None, text1, text2)
    	(aptr,bptr,l) = m.find_longest_match(0, len(text1), 0, len(text2))
    	print "i:", i, "  l:", l, "  aptr:", aptr, "  bptr:", bptr
    	assert l == i

    The assertion will fail when i==199 (the length of the texts will be 200).
    And yes, the bug is clearly "populardict"-related.

    @rtvd
    Copy link
    Mannequin

    rtvd mannequin commented Mar 11, 2007

    I have sent a testcase for this bug into the SourceForge. The ID is bpo-1678339.
    Also I have submitted a fix for this bug (ID bpo-1678345), but the fix reduces the performance significantly.

    @rtvd
    Copy link
    Mannequin

    rtvd mannequin commented Mar 14, 2007

    By the way, I found that the implementation should better be changed completely. The current one has a O(n^2) computational complexity, while the one, based on suffix trees using Ukkonen's algorithm would use only O(n)

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Mar 19, 2007

    I think the bug is documentation only.

    According to the comments (but not the docstring) find_longest_match returns the longest _interesting_ match. A single element appearing too often is likely to cause spurious matches, and is therefore not interesting.

    I do agree that this should be noted more prominently, so people don't try things like comparing text strings letter-by-letter (where 1% is far too low a threshhold for a 26-character alphabest).

    And yes, the comments on popular are correct -- it ignores elements which constitute *more* than 1%.

    I recommend closing this set of tracker items. If you could submit changes to the documentation (docstrings and/or help files; maybe even the comments), I would recommend applying them.

    @janpf
    Copy link
    Mannequin

    janpf mannequin commented Feb 7, 2009

    hi all,

    just got bitten by this, so i took the time to reiterate the issue.

    according to the docs:

    http://docs.python.org/library/difflib.html

    find_longest_match() should return the longest matching string:

    "If isjunk was omitted or None, find_longest_match() returns (i, j, k)
    such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi
    and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those
    conditions, the additional conditions k >= k', i <= i', and if i == i',
    j <= j' are also met. In other words, of all maximal matching blocks,
    return one that starts earliest in a, and of all those maximal matching
    blocks that start earliest in a, return the one that starts earliest in b."

    but after a couple of hours debugging i finally convinced myself that
    the bug was in the library ... and i ended up here :)

    any ideas on how to work around this bug/feature, and just get the
    longest matching string ? (from a normal/newbie user perspective, that
    is, without patching the C++ library code and recompiling?)

    from the comments (which i couldn't follow entirely), does it use some
    concept of popularity that is not exposed by the API ? How is
    "popularity" defined ?

    many thanks!

    • jan

    ps.: using ubuntu's python 2.5.2

    ps2.: and example of a string pair where the issue shows up:

    s1='Floor Box SystemsFBS Floor Box Systems - Manufacturer &amp; supplier
    of FBS floor boxes, electrical ... experience, FBS Floor Box Systems
    continue ... raceways, floor box. ...www.floorboxsystems.com'
    
    s2='FBS Floor Box SystemsFBS Floor Box Systems - Manufacturer &amp;
    supplier of FBS floor boxes, electrical floor boxes, wood floor box,
    concrete floor box, surface mount floor box, raised floor
    ...www.floorboxsystems.com'

    @terryjreedy
    Copy link
    Member

    This appears to be one of at least three duplicate issues: bpo-1528074, bpo-2986, and bpo-4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) bpo-1711800 appears to be slightly different (if not, it could be closed also.)

    Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants