Author eli.bendersky
Recipients LambertDW, eli.bendersky, georg.brandl, ggenellina, gjb1002, hagna, janpf, jimjjewett, mrotondo, pitrou, r.david.murray, rtvd, sjmachin, terry.reedy, tim.peters, vbr
Date 2010-07-07.02:40:11
SpamBayes Score 0.0
Marked as misclassified No
Message-id <AANLkTikQa-LXUqa6nzKoEHbIN6MOLdP7C3OpDQKvcJq1@mail.gmail.com>
In-reply-to <1278458304.94.0.760078714631.issue2986@psf.upfronthosting.co.za>
Content
Thanks!
Now let's see what the other devs say. The first response seems not to have
understood what you meant completely :-)

Eli

On Wed, Jul 7, 2010 at 01:18, Terry J. Reedy <report@bugs.python.org> wrote:

>
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
>
> [Also posted to pydev for additional input, with Subject line
> Issue 2986: difflib.SequenceMatcher is partly broken
> Developed with input from Eli Bendersky, who will write patchfile(s) for
> whichever change option is chosen.]
>
> Summary: difflib.SeqeunceMatcher was developed, documented, and originally
> operated as "a flexible class for comparing pairs of sequences of any
> [hashable] type". An "experimental" heuristic was added in 2.3a1 to speed up
> its application to sequences of code lines, which are selected from an
> unbounded set of possibilities. As explained below, this heuristic partly to
> completely disables SequenceMatcher for realistic-length sequences from a
> small finite alphabet. The regression is easy to fix. The docs were never
> changed to reflect the effect of the heuristic, but should be, with whatever
> additional change is made.
>
> In the commit message for revision 26661, which added the heuristic, Tim
> Peters wrote "While I like what I've seen of the effects so far, I still
> consider this experimental.  Please give it a try!" Several people who have
> tried it discovered the problem with small alphabets and posted to the
> tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed
> duplicates of #2986. The heuristic needs revision.
>
> Open questions (discussed after the examples): what exactly to do, which
> versions to do it too, and who will do it.
>
> ---
> Some minimal difference examples:
>
> from difflib import SequenceMatcher as SM
>
> # base example
> print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
> # should be and is 0.9975 (rounded)
>
> # make 'y' junk
> print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio())
> # should be and is 0.0
>
> # Increment b by 1 char
> print(SM(None, 'x' + 'y'*199, 'y'*200).ratio())
> # should be .995, but now is 0.0 because y is treated as junk
>
> # Reverse a and b, which increments b
> print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
> # should be .9975, as before, but now is 0.0 because y is junked
>
> The reason for the bug is the heuristic: if the second sequence is at least
> 200 items long then any item occurring more than one percent of the time in
> the second sequence is treated as junk. This was aimed at recurring code
> lines like 'else:' and 'return', but can be fatal for small alphabets where
> common items are necessary content.
>
> A more realistic example than the above is comparing DNA gene sequences.
> Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate
> sequence of matches and edits and .ratio works as documented and expected.
>  For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are
> <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic
> and practical use.
>
> With the heuristic, everything is junk and there is only one match, ''==''
> augmented by the initial prefix of matching bases. This is followed by one
> edit: replace the rest of the first sequence with the rest of the second
> sequence. A much faster way to find the first mismatch would be
>   i = 0
>   while first[i] == second[i]:
>      i+=1
> The match ratio, based on the initial matching prefix only, is spuriously
> low.
>
> ---
> Questions:
>
> 1: what change should be make.
>
> Proposed fix: Disentangle the heuristic from the calculation of the
> internal b2j dict that maps items to indexes in the second sequence b. Only
> apply the heuristic (or not) afterward.
>
> Version A: Modify the heuristic to only eliminate common items when there
> are more than, say, 100 items (when len(b2j)> 100 where b2j is first
> calculated without popularity deletions).
>
> The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone.
> On the other hand, realistic sequences of more than 200 code lines should
> have at least 100 different lines, and so the heuristic should continue to
> be applied when it (mostly?) 'should' be. This change leaves the API
> unchanged and does not require a user decision.
>
> Version B: add a parameter to .__init__ to make the heuristic optional. If
> the default were True ('use it'), then the code would run the same as now
> (even when bad). With the heuristic turned off, users would be able to get
> the .ratio they may expect and need. On the other hand, users would have to
> understand the heuristic to know when and when not to use it.
>
> Version C: A more radical alternative would be to make one or more of the
> tuning parameters user settable, with one setting turning it off.
>
> 2. What type of issue is this, and what version get changed.
>
> I see the proposal as partial reversion of a change that sometimes causes a
> regression, in order to fix the regression. Such would usually be called a
> bugfix. Other tracker reviewers claim this issue is a feature request, not a
> bugfix. Either way, 3.2 gets the fix. The practical issue is whether at
> least 2.7(.1) should get the fix, or whether the bug should forever continue
> in 2.x.
>
> 3. Who will make the change.
>
> Eli will write a patch and I will check it. However, Georg Brandel assigned
> the issue to Tim Peters, with a request for comment, but Tim never
> responded. Is there an active committer who will grab the issue and do a
> commit review when a patch is ready?
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue2986>
> _______________________________________
>
Files
File name Uploaded
unnamed eli.bendersky, 2010-07-07.02:40:05
History
Date User Action Args
2010-07-08 02:18:04terry.reedyunlinkissue2986 messages
2010-07-07 02:40:13eli.benderskysetrecipients: + eli.bendersky, tim.peters, georg.brandl, terry.reedy, jimjjewett, sjmachin, gjb1002, ggenellina, pitrou, rtvd, vbr, LambertDW, hagna, r.david.murray, janpf, mrotondo
2010-07-07 02:40:11eli.benderskylinkissue2986 messages
2010-07-07 02:40:11eli.benderskycreate