This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author gibu
Recipients gibu
Date 2021-12-15.17:37:14
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1639589834.45.0.869314940446.issue46086@roundup.psfhosted.org>
In-reply-to
Content
Here I propose a new function, namely .ratio_min(self,m). 

.ratio_min(self,m) is an extension of the difflib's function .ratio(self). Equivalently to .ratio(self), .ratio_min(self,m) returns a measure of two sequences' similarity (float in [0,1]). In addition to .ratio(), it can ignore matched substrings if these substrings have length less than a given threshold m. m is the second variable of the function. 

It is very useful to avoid spurious high similarity scores. 

    # NEW FUNCTION: 

    def ratio_min(self,m):
        """Return a measure of the sequences' similarity (float in [0,1]).
        Where T is the total number of elements in both sequences, and
        M_min is the number of matches with every single match has length at least m, this is 2.0*M_min / T.
        Note that this is 1 if the sequences are identical, and 0 if
        they have no substring of length m or more in common.
        .ratio_min() is similar to .ratio(). 
        .ratio_min(1) is equivalent to .ratio().
        
        >>> s = SequenceMatcher(None, "abcd", "bcde")
        >>> s.ratio_min(1)
        0.75
        >>> s.ratio_min(2)
        0.75
        >>> s.ratio_min(3)
        0.75
        >>> s.ratio_min(4)
        0.0
        """

        matches = sum(triple[-1] for triple in self.get_matching_blocks() if triple[-1] >=m)
        return _calculate_ratio(matches, len(self.a) + len(self.b))
History
Date User Action Args
2021-12-15 17:37:14gibusetrecipients: + gibu
2021-12-15 17:37:14gibusetmessageid: <1639589834.45.0.869314940446.issue46086@roundup.psfhosted.org>
2021-12-15 17:37:14gibulinkissue46086 messages
2021-12-15 17:37:14gibucreate