Message 13487 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	yeti-dn
Recipients
Date	2002-11-29.10:54:30
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
The algorithm used for approximate string matching doesn't find the optimal edit sequence (it finds longest blocks instead). Example: >>> from difflib import SequenceMatcher >>> sm = SequenceMatcher() >>> sm.set_seqs('axfot', 'aoftax') >>> sm.ratio() 0.36363636363636365 >>> sm.get_matching_blocks() [(0, 4, 2), (5, 6, 0)] >>> sm.get_opcodes() [('insert', 0, 0, 0, 4), ('equal', 0, 2, 4, 6), ('delete', 2, 5, 6, 6)] What's wrong? Levenshtein distance with weight 2 for item replacement is only 5 (the weight 2 corresponds to what ratio() is supposed to compute, the classic Levenshtein distance is 4), so one would expect to get similarity (i.e. ratio()) (11-5)/11 = 6/11 = 0.545454545454..., and not only 4/11. And really, the maximal matching blocks are: [(0, 0, 1), (2, 2, 1), (4, 3, 1)] and the minimal edit sequence is: [('equal', 0, 1, 0, 1), ('replace', 1, 2, 1, 2), ('equal', 2, 3, 2, 3), ('delete', 3, 4, 3, 3), ('equal', 4, 5, 3, 4), ('insert', 5, 5, 4, 6)] The impact of this ``feature'' on diff-like applications may be even positive, beause the edit sequence then consists of smaller number of operations on lager chunks. Thus I'm not sure if this is something which should be fixed. However, it should be at least noted in the documentation the ratio() function gives only a lower bound of the string similarity (so people like me won't be tempted to use it to check results of their own Levenshtein distance/string similarity implementation).

The algorithm used for approximate string matching
doesn't find the optimal edit sequence (it finds
longest blocks instead).

Example:

>>> from difflib import SequenceMatcher
>>> sm = SequenceMatcher()
>>> sm.set_seqs('axfot', 'aoftax')
>>> sm.ratio()
0.36363636363636365
>>> sm.get_matching_blocks()
[(0, 4, 2), (5, 6, 0)]
>>> sm.get_opcodes()
[('insert', 0, 0, 0, 4), ('equal', 0, 2, 4, 6),
('delete', 2, 5, 6, 6)]

What's wrong?

Levenshtein distance with weight 2 for item replacement
is only 5 (the weight 2 corresponds to what ratio() is
supposed to compute, the classic Levenshtein distance
is 4), so one would expect to get similarity (i.e.
ratio()) (11-5)/11 = 6/11 = 0.545454545454..., and not
only 4/11.

And really, the maximal matching blocks are:
[(0, 0, 1), (2, 2, 1), (4, 3, 1)]
and the minimal edit sequence is:
[('equal', 0, 1, 0, 1), ('replace', 1, 2, 1, 2),
('equal', 2, 3, 2, 3), ('delete', 3, 4, 3, 3),
('equal', 4, 5, 3, 4), ('insert', 5, 5, 4, 6)]

The impact of this ``feature'' on diff-like
applications may be even positive, beause the edit
sequence then consists of smaller number of operations
on lager chunks.  Thus I'm not sure if this is
something which should be fixed.  However, it should be
at least noted in the documentation the ratio()
function gives only a lower bound of the string
similarity (so people like me won't be tempted to use
it to check results of their own Levenshtein
distance/string similarity implementation).

History
Date	User	Action	Args
2007-08-23 14:09:12	admin	link	issue645629 messages
2007-08-23 14:09:12	admin	create