Message 252944 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Allan Lewis, Lewis Haley, tim.peters
Date	2015-10-13.18:01:47
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1444759307.74.0.61011312541.issue25391@psf.upfronthosting.co.za>
In-reply-to

Content
Do note that this is not an "edit distance" (like Levenshtein) algorithm. It works as documented instead ;-) , searching (in effect recursively) for the leftmost longest contiguous matching blocks. Both "leftmost" and "contiguous" are crucial to understanding what it does. I expect you're most surprised by the 2nd example, comparing: location,location,location location.location,location The longest contiguous matching block is "location,location", in the first string at slice 0:17 (leftmost!) and in the second string at slice 9:26. That leaves a wholly unmatched ",location" at the end of the first string and a wholly unmatched "location." at the start of the second string. That's why the ratio is so low. We have a total of 172 = 34 matching characters (in the single matching block) out of 226 = 52 characters total, so the ratio is 34/52 ~= 0.65. Had it searched for the _rightmost_ longest matching blocks instead, then the trailing "location,location" pieces of both strings would have matched first, and then the leading "location" pieces of both strings, giving a ratio of about 0.96 instead. Indeed, that's essentially what happens in your 3rd example. .quick_ratio() and .real_quick_ratio() use entirely different algorithms, and - again as documented - their only purpose is to give an upper bound on what .ratio() returns (but do so faster). Anyway, a couple things to take from this: 1. Some apps really want an "edit distance" kind of algorithm instead. I would welcome one, but nobody so far has volunteered to implement one. "A problem" is that there are many such algorithms (e.g., computational biology has driven many developments in this area). 2. It's far too late to change what current difflib functions implement. The primary use case for the "leftmost longest contiguous matches" design was to improve the quality (as perceived by human eyes) of diffs generated for text files containing code (like C or Python source code), and it works very well for that purpose most times. It doesn't work well for all purposes at all times. Note that "works well (as perceived by human eyes)" is largely subjective, so arguing about that won't get far ;-) Since it's functioning as designed & documented, I'm closing this as "not a bug". It may make sense to open a different "enhancement" issue instead asking for a different algorithm(s).

Do note that this is not an "edit distance" (like Levenshtein) algorithm. It works as documented instead ;-) , searching (in effect recursively) for the leftmost longest contiguous matching blocks. Both "leftmost" and "contiguous" are crucial to understanding what it does.

I expect you're most surprised by the 2nd example, comparing:

location,location,location
location.location,location

The longest contiguous matching block is "location,location", in the first string at slice 0:17 (leftmost!) and in the second string at slice 9:26. That leaves a wholly unmatched ",location" at the end of the first string and a wholly unmatched "location." at the start of the second string. That's why the ratio is so low. We have a total of 17*2 = 34 matching characters (in the single matching block) out of 2*26 = 52 characters total, so the ratio is 34/52 ~= 0.65.

Had it searched for the _rightmost_ longest matching blocks instead, then the trailing "location,location" pieces of both strings would have matched first, and then the leading "location" pieces of both strings, giving a ratio of about 0.96 instead. Indeed, that's essentially what happens in your 3rd example.

.quick_ratio() and .real_quick_ratio() use entirely different algorithms, and - again as documented - their only purpose is to give an upper bound on what .ratio() returns (but do so faster).

Anyway, a couple things to take from this:

1. Some apps really want an "edit distance" kind of algorithm instead. I would welcome one, but nobody so far has volunteered to implement one. "A problem" is that there are many such algorithms (e.g., computational biology has driven many developments in this area).

2. It's far too late to change what current difflib functions implement. The primary use case for the "leftmost longest contiguous matches" design was to improve the quality (as perceived by human eyes) of diffs generated for text files containing code (like C or Python source code), and it works very well for that purpose most times. It doesn't work well for all purposes at all times. Note that "works well (as perceived by human eyes)" is largely subjective, so arguing about that won't get far ;-)

Since it's functioning as designed & documented, I'm closing this as "not a bug". It may make sense to open a different "enhancement" issue instead asking for a different algorithm(s).

History
Date	User	Action	Args
2015-10-13 18:01:47	tim.peters	set	recipients: + tim.peters, Lewis Haley, Allan Lewis
2015-10-13 18:01:47	tim.peters	set	messageid: <1444759307.74.0.61011312541.issue25391@psf.upfronthosting.co.za>
2015-10-13 18:01:47	tim.peters	link	issue25391 messages
2015-10-13 18:01:47	tim.peters	create