This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author rtvd
Recipients
Date 2007-03-11.15:28:30
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
This is a fix for the bug #1528074 in the difflib's SequenceMatcher which makes (among other possible things) find_longest_match return invalid results sometimes.

The previously submitted test case for this bug has the #1678339 ID.

The find_longest_match and __chain_b methods in the SequenceMatcher are perfectly optimized. The find_longest_match would work both fast and correctly if only the __chain_b did not break it's assumptions.

The find_longest_match assumes that the b2j mapping has a mapping of all non-junk elements in b to lists of their indices in the "b" list. However, when __chain_b creates the b2j mapping, it removes popular elements from the list and marking the elements as popular in the "populardict". As a result, the find_longest_match method can't find the indices for the popular elements and they become automatically considered as something like a junk.

I tried to fix the bug by both changing the find_longest_match and __chain_b methods. No matter how hard I tried, the change dropped the performance and slowed down the matching by 5-10 times. The impact of find_longest_match method was larger, so I decided to send a patch for the __chain_b.

Please, note, that even though the method starts to work properly and the test cases pass on my computer just fine, the ingenious optimizations performed before become broken, so it would be great if a guru in Python code optimization tries to improve the things a bit.

One more point: if the indices are not removed, the memory consumption on the large strings can become quite great. If this is a serious concern, a fix in the find_longest_match will need to be done instead. However that fix would probably be far less efficient that this one.
History
Date User Action Args
2007-08-23 15:57:30adminlinkissue1678345 messages
2007-08-23 15:57:30admincreate