Title: Junks in difflib
Components: Documentation Versions: Python 3.8
Assigned To: docs@python Nosy List: docs@python, hubertbdlb, terry.reedy, tim.peters
Created on 2021-03-11 09:37 by hubertbdlb, last changed 2022-04-11 14:59 by admin.

Author: Hubert Bonnisseur-De-La-Bathe (hubertbdlb) Date: 2021-03-11 09:37
Reading first at the documentation of difflib, I thought that the use of junks would have produced the result 

s = SequenceMatcher(lambda x : x == " ", "abcd efgh", "abcdefgh")
>>> [Match(a=0, b=0, size=8)]

At a second lecture, it is clear that such evaluation will return in fact two matches of length 4.

Would it be nicer to have get_matching_block return the length 8 match ? 

Don't know if it's in the spirit of the lib, I'm just asking.
Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-03-13 07:27
Currently return tuple (i, j, n), means that a[i:i+n] == b[j:j+n], where both matching blocks are the same length.

This would not be the case if a has an ignored space and b does not. Changing the current definition would break existing code and would require quadruples to return two different lengths.  This would require either a new parameter for the function to select the behavior or a new function with a new name.

Either option would require justification by actual use cases.  I cannot see what they might be.  An way to have junk chars completely ignored is to strip them from both strings before calling SequenceMatcher.
