This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add ratio_min() function to the difflib library
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: gibu, python-dev, taleinat, tim.peters
Priority: normal Keywords: patch

Created on 2021-12-15 17:37 by gibu, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 30125 closed python-dev, 2021-12-15 17:40
Messages (4)
msg408622 - (view) Author: Giacomo (gibu) * Date: 2021-12-15 17:37
Here I propose a new function, namely .ratio_min(self,m). 

.ratio_min(self,m) is an extension of the difflib's function .ratio(self). Equivalently to .ratio(self), .ratio_min(self,m) returns a measure of two sequences' similarity (float in [0,1]). In addition to .ratio(), it can ignore matched substrings if these substrings have length less than a given threshold m. m is the second variable of the function. 

It is very useful to avoid spurious high similarity scores. 

    # NEW FUNCTION: 

    def ratio_min(self,m):
        """Return a measure of the sequences' similarity (float in [0,1]).
        Where T is the total number of elements in both sequences, and
        M_min is the number of matches with every single match has length at least m, this is 2.0*M_min / T.
        Note that this is 1 if the sequences are identical, and 0 if
        they have no substring of length m or more in common.
        .ratio_min() is similar to .ratio(). 
        .ratio_min(1) is equivalent to .ratio().
        
        >>> s = SequenceMatcher(None, "abcd", "bcde")
        >>> s.ratio_min(1)
        0.75
        >>> s.ratio_min(2)
        0.75
        >>> s.ratio_min(3)
        0.75
        >>> s.ratio_min(4)
        0.0
        """

        matches = sum(triple[-1] for triple in self.get_matching_blocks() if triple[-1] >=m)
        return _calculate_ratio(matches, len(self.a) + len(self.b))
msg408629 - (view) Author: Alex Waygood (AlexWaygood) * (Python triager) Date: 2021-12-15 18:08
I am removing 3.10 from the "versions" field, since additions to the standard library are only considered for unreleased versions of Python.
msg410104 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2022-01-08 18:39
Thanks for the suggestion and the PR, Giacomo!

However, in my opinion, this is better suited to be something like a cookbook recipe.  The number of use cases for this will be low, and there would be little advantage to having this in the stdlib rather than elsewhere.
msg411048 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2022-01-20 22:15
I'm closing this for now since nobody has followed up and to the best of my understanding this wouldn't be an appropriate addition to the stdlib. 

This can be re-opened in the future if needed, of course.
History
Date User Action Args
2022-04-11 14:59:53adminsetgithub: 90244
2022-01-20 22:15:08taleinatsetstatus: open -> closed
resolution: rejected
messages: + msg411048

stage: patch review -> resolved
2022-01-08 18:39:37taleinatsetnosy: + taleinat
messages: + msg410104
2021-12-15 20:53:03rhettingersetassignee: tim.peters
2021-12-15 18:08:19AlexWaygoodsetnosy: - AlexWaygood
2021-12-15 18:08:06AlexWaygoodsetnosy: + AlexWaygood, tim.peters

messages: + msg408629
versions: - Python 3.10
2021-12-15 17:40:50python-devsetkeywords: + patch
nosy: + python-dev

pull_requests: + pull_request28344
stage: patch review
2021-12-15 17:37:14gibucreate