classification
Title: A fix for the bug #1528074 [warning: quite slow]
Type: Stage:
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, jimjjewett, jorend, rtvd
Priority: normal Keywords: patch

Created on 2007-03-11 15:28 by rtvd, last changed 2008-01-21 17:50 by georg.brandl. This issue is now closed.

Files
File name Uploaded Description Edit
python-bug-1528074-fix.diff rtvd, 2007-03-11 15:28 A fix for the bug #1528074 (covered by the testcase #1678339)
Messages (6)
msg52175 - (view) Author: Denys Rtveliashvili (rtvd) Date: 2007-03-11 15:28
This is a fix for the bug #1528074 in the difflib's SequenceMatcher which makes (among other possible things) find_longest_match return invalid results sometimes.

The previously submitted test case for this bug has the #1678339 ID.

The find_longest_match and __chain_b methods in the SequenceMatcher are perfectly optimized. The find_longest_match would work both fast and correctly if only the __chain_b did not break it's assumptions.

The find_longest_match assumes that the b2j mapping has a mapping of all non-junk elements in b to lists of their indices in the "b" list. However, when __chain_b creates the b2j mapping, it removes popular elements from the list and marking the elements as popular in the "populardict". As a result, the find_longest_match method can't find the indices for the popular elements and they become automatically considered as something like a junk.

I tried to fix the bug by both changing the find_longest_match and __chain_b methods. No matter how hard I tried, the change dropped the performance and slowed down the matching by 5-10 times. The impact of find_longest_match method was larger, so I decided to send a patch for the __chain_b.

Please, note, that even though the method starts to work properly and the test cases pass on my computer just fine, the ingenious optimizations performed before become broken, so it would be great if a guru in Python code optimization tries to improve the things a bit.

One more point: if the indices are not removed, the memory consumption on the large strings can become quite great. If this is a serious concern, a fix in the find_longest_match will need to be done instead. However that fix would probably be far less efficient that this one.
msg52176 - (view) Author: Jim Jewett (jimjjewett) Date: 2007-03-19 01:59
I think the bug is documentation only.  

According to the comments (but not the docstring) find_longest_match returns the longest _interesting_ match.  A single element appearing too often is likely to cause spurious matches, and is therefore not interesting.

I do agree that this should be noted more prominently, so people don't try things like comparing text strings letter-by-letter (where 1% is far too low a threshhold for a 26-character alphabet).

And yes, the comments on popular are correct -- it ignores elements which constitute *more* than 1%.  

I recommend closing this set of tracker items. If you could submit changes to the documentation (docstrings and/or help files; maybe even the comments), I would recommend applying them.
msg52177 - (view) Author: Jason Orendorff (jorend) Date: 2007-04-20 21:13
Recommend closing this.
msg52178 - (view) Author: Denys Rtveliashvili (rtvd) Date: 2007-04-21 06:34
I am sure this is NOT a bug in the documentation. There is no such thing as "interesting" match unless it can be described mathematically. Instead, the documentation 
specifies _precisely_ what exactly the function must do in a very precise manner.

The problem is that it does a different thing.

Another issue of this algorithm is that it uses a quite slow algorithm. It's computational complexity is O(n*m). Another one based on generalized suffix trees has a much more modest complexity O(n+m). The details are here:
http://en.wikipedia.org/wiki/Longest-common_substring_problem
msg52179 - (view) Author: Jason Orendorff (jorend) Date: 2007-04-23 12:06
rtvd,  Thanks for taking the time to post this patch and participate in the discussion.

This patch changes intentional, by-design behavior.  It would hurt difflib's performance and reduce its usability, by making the matches worse.  Subjectively worse, admittedly, but with many programmers sharing similar subjective views.  Read the comments in the source code and you'll get the gist.

And even a mathematician would dispute your assertion that there's no such thing as "interesting".  That's a pretty extreme view.
msg61429 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-01-21 17:50
Rejecting as per discussion.
History
Date User Action Args
2008-01-21 17:50:48georg.brandlsetstatus: open -> closed
nosy: + georg.brandl
resolution: rejected
messages: + msg61429
2007-03-11 15:28:30rtvdcreate