This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Junks in difflib
Type: enhancement Stage:
Components: Documentation Versions: Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, hubertbdlb, terry.reedy, tim.peters
Priority: normal Keywords:

Created on 2021-03-11 09:37 by hubertbdlb, last changed 2022-04-11 14:59 by admin.

Messages (2)
msg388491 - (view) Author: Hubert Bonnisseur-De-La-Bathe (hubertbdlb) Date: 2021-03-11 09:37
Reading first at the documentation of difflib, I thought that the use of junks would have produced the result 

s = SequenceMatcher(lambda x : x == " ", "abcd efgh", "abcdefgh")
>>> [Match(a=0, b=0, size=8)]

At a second lecture, it is clear that such evaluation will return in fact two matches of length 4.

Would it be nicer to have get_matching_block return the length 8 match ? 

Don't know if it's in the spirit of the lib, I'm just asking.
msg388595 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-03-13 07:27
Currently return tuple (i, j, n), means that a[i:i+n] == b[j:j+n], where both matching blocks are the same length.

This would not be the case if a has an ignored space and b does not. Changing the current definition would break existing code and would require quadruples to return two different lengths.  This would require either a new parameter for the function to select the behavior or a new function with a new name.

Either option would require justification by actual use cases.  I cannot see what they might be.  An way to have junk chars completely ignored is to strip them from both strings before calling SequenceMatcher.
Date User Action Args
2022-04-11 14:59:42adminsetgithub: 87639
2021-03-13 07:27:23terry.reedysetnosy: + terry.reedy
messages: + msg388595
2021-03-11 10:11:20xtreaksetnosy: + tim.peters
2021-03-11 09:37:51hubertbdlbcreate