Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difflib.SequenceMatcher.get_matching_blocks omits common strings #79260

Closed
SpringemSpringsbee mannequin opened this issue Oct 26, 2018 · 9 comments
Closed

difflib.SequenceMatcher.get_matching_blocks omits common strings #79260

SpringemSpringsbee mannequin opened this issue Oct 26, 2018 · 9 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@SpringemSpringsbee
Copy link
Mannequin

SpringemSpringsbee mannequin commented Oct 26, 2018

BPO 35079
Nosy @tim-one, @terryjreedy, @miss-islington, @tirkarthi
PRs
  • bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc #10144
  • [3.7] bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) #10145
  • [3.6] bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) #10146
  • [2.7] bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) #10147
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/terryjreedy'
    closed_at = <Date 2018-10-27.03:23:10.962>
    created_at = <Date 2018-10-26.19:00:41.504>
    labels = ['3.7', '3.8', 'type-bug', 'library', 'docs']
    title = 'difflib.SequenceMatcher.get_matching_blocks omits common strings'
    updated_at = <Date 2018-10-27.03:23:10.961>
    user = 'https://bugs.python.org/SpringemSpringsbee'

    bugs.python.org fields:

    activity = <Date 2018-10-27.03:23:10.961>
    actor = 'terry.reedy'
    assignee = 'terry.reedy'
    closed = True
    closed_date = <Date 2018-10-27.03:23:10.962>
    closer = 'terry.reedy'
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2018-10-26.19:00:41.504>
    creator = 'Springem Springsbee'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35079
    keywords = ['patch']
    message_count = 9.0
    messages = ['328593', '328595', '328602', '328606', '328627', '328628', '328629', '328630', '328631']
    nosy_count = 5.0
    nosy_names = ['tim.peters', 'terry.reedy', 'miss-islington', 'xtreak', 'Springem Springsbee']
    pr_nums = ['10144', '10145', '10146', '10147']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue35079'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7', 'Python 3.8']

    @SpringemSpringsbee
    Copy link
    Mannequin Author

    SpringemSpringsbee mannequin commented Oct 26, 2018

    Hello, I'm using difflib's SequenceMatcher to locate common substrings. It seems like the matcher is missing a common substrings. I'm guessing this is a rather low-level issue in difflib. The autojunk parameter has no effect for obvious reasons. Alternate pairwise comparisons between the following 3 strings omit the 2-character match 'AC'

    GATTACA
    TAGACCA
    ATACA

    The following Github gist captures the issue, which I'll repeat here for redundancy https://gist.github.com/MatthewRalston/b0ab6ac1dbe322cb12063310ccdbb786

    import difflib
    string1 = "TAGACCA"
    string2 = "ATACA"
    s = difflib.SequenceMatcher(None, string1, string2)
    blox = s.get_matching_blocks()
    print(blox)
    [Match(a=0, b=1, size=2), Match(a=5, b=3, size=2), Match(a=7, b=5, size=0)] # Missing Match(a=3, b=2, size=2)
    print([string1[x.a:x.a+x.size] for x in blox if x.size > 1])
    ['TA', 'CA'] # Missing the substring 'CA'

    @SpringemSpringsbee SpringemSpringsbee mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 26, 2018
    @tim-one
    Copy link
    Member

    tim-one commented Oct 26, 2018

    Sorry, I find this too hard to follow. At the end:

    ['TA', 'CA'] # Missing the substring 'CA'

    the comment claims it's missing 'CA', while the output plainly shows it's _not_ missing 'CA'.

    If your complaint is that's missing 'AC', yes, it is. It's not intended to find _overlapping_ matches at all (read the docs). The longest matching blocks in

    TAGACCA
    ATACA

    are in fact TA, CA, and AC, but after the leftmost-longest TA is matched first, AC no longer exists _to_ match in the second string. Only CA does.

    @terryjreedy
    Copy link
    Member

    We can assume that "substring 'CA'" was meant to be "substring 'AC'", but as explained, missing 'AC' is not a bug. (Tim wrote the module.)

    I read the doc, and 'non-overlapping' is implied in the SequenceMatcher entry at the top of the file.

    "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; ... The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence."

    However, a user of SequenceMatcher could easily miss that, and its implication, as Springem did. For clarity, I think we should add 'non-overlapping to the first line of the .get_matching_blocks entry, which is in the middle of the page. "Return list of triples describing non-overlapping matching subsequences."

    I also think "i+n != i' or j+n != j'" should be changed to "i+n < i' or j+n < j'" as '>' would mean overlapping. So != must mean <.

    I will prepare a doc PR later.

    @terryjreedy terryjreedy added docs Documentation in the Doc dir 3.8 only security fixes labels Oct 26, 2018
    @terryjreedy terryjreedy self-assigned this Oct 26, 2018
    @tim-one
    Copy link
    Member

    tim-one commented Oct 26, 2018

    I don't object to spelling it out more, and your (Terry's) suggestions are fine. On the other hand, this module has been around for a loooong time now, and this is the first instance I've seen of someone surprised that it doesn't produce overlapping matches - it's obvious from "The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence" that a matching subsequence is wholly eliminated from further consideration.

    At some point of ever-more tedious elaboration, the docs risk missing the forest for the trees. I don't think these docs are quite there yet - although the docs for find_longest_match() are. Speaking of which, that method could be used to find overlapping matches, one at a time, by passing appropriate slice indices.

    Which can be horridly inefficient; e.g., find all overlapping matches between

    'A' * 100

    and

    'A' * 150

    @terryjreedy
    Copy link
    Member

    Tim, I share your concern about bloating docs, but think this one word worthwhile. I suspect that people are conditioned to accept 'non-overlapping' because that is ofter (usually?) the default for linear searches and regex matching.

    @terryjreedy
    Copy link
    Member

    New changeset d9bff4e by Terry Jan Reedy in branch 'master':
    bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
    d9bff4e

    @miss-islington
    Copy link
    Contributor

    New changeset cb920c1 by Miss Islington (bot) in branch '3.7':
    bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
    cb920c1

    @miss-islington
    Copy link
    Contributor

    New changeset 5282125 by Miss Islington (bot) in branch '3.6':
    bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
    5282125

    @miss-islington
    Copy link
    Contributor

    New changeset e389de8 by Miss Islington (bot) in branch '2.7':
    bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144)
    e389de8

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants