This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author mcft
Recipients mcft
Date 2018-03-20.20:09:08
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1521576548.54.0.467229070634.issue33112@psf.upfronthosting.co.za>
In-reply-to
Content
difflib.SequenceMatcher fails to make a proper alignment between 2 sequences with only 3 single letter changes. Its performance is completely off with a similarity ratio of 0.16, in stead of the more accurate 0.99.

Here is a snippet to replicate the failure:
>>> aa_ref = 'MTLFTTLLVLIFERLFKLGEHWQLDHRLEAFFRRVKHFSLGRTLGMTIIAMGVTFLLLRALQGVLFNVPTLLVWLLIGLLCIGAGKVRLHYHAYLTAASRNDSHARATMAGELTMIHGVPAGCDEREYLRELQNALLWINFRFYLAPLFWLIVGGTWGPVTLMGYAFLRAWQYWLARYQTPHHRLQSGIDAVLHVLDWVPVRLAGVVYALIGHGEKALPAWFASLGDFHTSQYQVLTRLAQFSLAREPHVDKVETPKAAVSMAKKTSFVVVVVIALLTIYGALV'
>>> aa_seq = 'MTLFTTLLVLIFERLFKLGEHWQLDHRLEAFFRRVKHFSLGRTLCMTIIAMGVTFLLLRALQGVLFNVPTLLVWLLIGLLCIGAGKVRLHYHAYLTAASRNDSHAHATMAGELTMIHGVPAGCDEREYLRELQNALLWINFRFYLAPLFWLIVGGTWGPVTLMGYAFLRAWQYWLARYQTPHHRLQSGIDAVLHALDWVPVRLAGVVYALIGHGEKALPAWFASLGDFHTSQYQVLTRLAQFSLAREPHVDKVETPKAAVSMAKKTSFVVVVVIALLTIYGALV'
>>> sum(a!=b for a, b in zip(aa_ref, aa_seq))
3
>>> match = SequenceMatcher(a=aa_ref, b=aa_seq)
>>> match.ratio()
0.1619718309859155
>>> match.get_opcodes()
[('equal', 0, 43, 0, 43), ('delete', 43, 79, 43, 43), ('equal', 79, 81, 43, 45), ('replace', 81, 122, 45, 80), ('equal', 122, 123, 80, 81), ('replace', 123, 284, 81, 284)]
History
Date User Action Args
2018-03-20 20:09:08mcftsetrecipients: + mcft
2018-03-20 20:09:08mcftsetmessageid: <1521576548.54.0.467229070634.issue33112@psf.upfronthosting.co.za>
2018-03-20 20:09:08mcftlinkissue33112 messages
2018-03-20 20:09:08mcftcreate