Message314163
difflib.SequenceMatcher fails to make a proper alignment between 2 sequences with only 3 single letter changes. Its performance is completely off with a similarity ratio of 0.16, in stead of the more accurate 0.99.
Here is a snippet to replicate the failure:
>>> aa_ref = 'MTLFTTLLVLIFERLFKLGEHWQLDHRLEAFFRRVKHFSLGRTLGMTIIAMGVTFLLLRALQGVLFNVPTLLVWLLIGLLCIGAGKVRLHYHAYLTAASRNDSHARATMAGELTMIHGVPAGCDEREYLRELQNALLWINFRFYLAPLFWLIVGGTWGPVTLMGYAFLRAWQYWLARYQTPHHRLQSGIDAVLHVLDWVPVRLAGVVYALIGHGEKALPAWFASLGDFHTSQYQVLTRLAQFSLAREPHVDKVETPKAAVSMAKKTSFVVVVVIALLTIYGALV'
>>> aa_seq = 'MTLFTTLLVLIFERLFKLGEHWQLDHRLEAFFRRVKHFSLGRTLCMTIIAMGVTFLLLRALQGVLFNVPTLLVWLLIGLLCIGAGKVRLHYHAYLTAASRNDSHAHATMAGELTMIHGVPAGCDEREYLRELQNALLWINFRFYLAPLFWLIVGGTWGPVTLMGYAFLRAWQYWLARYQTPHHRLQSGIDAVLHALDWVPVRLAGVVYALIGHGEKALPAWFASLGDFHTSQYQVLTRLAQFSLAREPHVDKVETPKAAVSMAKKTSFVVVVVIALLTIYGALV'
>>> sum(a!=b for a, b in zip(aa_ref, aa_seq))
3
>>> match = SequenceMatcher(a=aa_ref, b=aa_seq)
>>> match.ratio()
0.1619718309859155
>>> match.get_opcodes()
[('equal', 0, 43, 0, 43), ('delete', 43, 79, 43, 43), ('equal', 79, 81, 43, 45), ('replace', 81, 122, 45, 80), ('equal', 122, 123, 80, 81), ('replace', 123, 284, 81, 284)] |
|
Date |
User |
Action |
Args |
2018-03-20 20:09:08 | mcft | set | recipients:
+ mcft |
2018-03-20 20:09:08 | mcft | set | messageid: <1521576548.54.0.467229070634.issue33112@psf.upfronthosting.co.za> |
2018-03-20 20:09:08 | mcft | link | issue33112 messages |
2018-03-20 20:09:08 | mcft | create | |
|