This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: difflib.SequenceMatcher not matching long sequences
Type: behavior Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder: difflib.SequenceMatcher: expose junk sets, deprecate undocumented isb... functions.
View: 10534
Assigned To: terry.reedy Nosy List: LambertDW, barry, eli.bendersky, georg.brandl, ggenellina, gjb1002, hagna, hodgestar, janpf, jcea, jimjjewett, mrotondo, pitrou, r.david.murray, rtvd, sjmachin, terry.reedy, tim.peters, vbr
Priority: high Keywords: patch

Created on 2008-05-27 20:29 by hagna, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
difflib_test_inq.py vbr, 2010-04-19 23:25 test file for difflib.SequenceMatcher comparing strings with minimal differences
issue2986.docs26.1.patch eli.bendersky, 2010-07-24 04:59 review
issue2986.fix27.4.patch eli.bendersky, 2010-09-03 04:45
issue2986.docs31.1.patch eli.bendersky, 2010-11-08 05:10 review
issue2986.fix27.5.patch eli.bendersky, 2010-11-11 08:24
issue2986.fix32.5.patch hodgestar, 2010-11-20 15:18 Version of issue2986.fix27.5.patch that applies and passes tests in Python 3.2a.
Pull Requests
URL Status Linked Edit
PR 17082 closed python-dev, 2019-11-07 16:24
Messages (37)
msg67428 - (view) Author: Nate (hagna) Date: 2008-05-27 20:29
The following code shows no matches though the strings clearly match.

from difflib import * 

a =
'''3904320338155955662857322172779218727992471109386112515279452352973279311752006856588512503244702012502812653160306927721351031250270279878152125021081471125246894603319162986283456469448293252335442814953964029718671705515246437056879456095915444174665464026255415736754542680178373675412998898571410483714801783736754144828361714801783736754133068408714801783736754140859665714801783736754153851004471480178373675415715864371410690714801783736754147488890714801783736205957668017837367545448801783104170539154677705102536314736754477780178373675415217103227148017837367541737811137714801783736754172791151671480178373675417692995271480178373675417575983571480178373675417398965871480178310417055026467770551235573705687945609591544562532964082675415736300610425832914520311514810301595721999571547897879113780178373618951021983280377781981989237498913678981414213198924949892679989164882577810944751102884217048258978791137801783104170511836542073627327981801279360326159714801783736171798080178310415420736447510213871790638471586131412631592131012571210126718031314200414571314893700123874777987006697747115770067074789312578013869801783104120529166337056879456095918495136604565251349544838956219513495753741344870733943253617458316356794745831634651172458316348316144586052838244151360641656349118903581890331689038658903263218549028909605134957536316060'''
b =
'''4634320338155955662857322172779218727992471109386112515279452352973279311752006856588512503244702012502812653160306927721351031250270279878152125021081471125246894603319162986283456469448293252335442814953964029718671705515246437056879456095915444174665464026255415736754542680178373675412998898571410483714801783736754144828361714801783736754133068408714801783736754140859665714801783736754153851004471480178373675415715864371410690714801783736754147488890714801783736205957668017837367545448801783104170539154677705102536314736754477780178373675413182108117148017837367541737811137714801783736754172791151671480178373675417692995271480178373675417575983571480178373675417398965871480178310417055026467770551235573705687945609591544562532964082675415736300610425832914520311514810301595721999571547897879113780178373618951021983280377781981989237498913678981414213198924949892679989164882577810944751102884217048258978791137801783104170511836542073627327981801279360326159714801783736171798080178310415420736447510213871790638471412131420041457131485122165131466702097131466731723131466741536131466751581131466771649131466761975131467212090131467261974131467231858131467201556131467212538131467221553131467221943131467231748131466711452131467271787131412578013869801783104154307361718482280178373638585436251621338931320893185072980138084820801545115716861861152948618615002682261422349251058108327767521397977810837298017831041205291663370568794560959184951366045652513495448389562195134957537413448707339432536174583163'''
lst = [(a,b)]
for a, b in lst:
    print "---------------------------"
    s = SequenceMatcher(None, a, b)
    print "length of a is %d" % len(a)
    print "length of b is %d" % len(b)
    print s.find_longest_match(0, len(a), 0, len(b))
    print s.ratio()
    for block in s.get_matching_blocks():
        m = a[block[0]:block[0]+block[2]]
        print "a[%d] and b[%d] match for %d elements and it is \"%s\"" %
(block[0], block[1], block[2], m)
msg84387 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-03-29 12:59
Tim, I think you've had some enlightening comments about difflib issues
in the past.
msg84446 - (view) Author: Mike Rotondo (mrotondo) Date: 2009-03-30 00:40
From the source, it seems that there is undocumented behavior to
SequenceMatcher which is causing this error. If b is longer than 200
characters, it will consider any element x in b that takes up more than
1% of it's contents as "popular", and thus junk. 

So, in this case, difflib is treating each individual digit as an
element of your sequences, and each one takes up more than 1% of the
complete sequence b. Therefore, each one is "popular", and therefore
ignored.

A snippet which demonstrates this:

from difflib import SequenceMatcher
for i in range(1, 202)[::10]:
  a = "a" * i
  b = "b" + "a" * i
  s = SequenceMatcher(None, a, b)
  print s.find_longest_match(0, len(a), 0, len(b))

Up til i=200, the strings match, but afterwards they do not because "a"
is "popular". 

Strangely, if you get rid of the "b" at the beginning of b, they
continue to match at lengths greater than 200. This may be a bug, I'll
keep looking into it but someone who knows more should probably take a
look too.

The comments from difflib.py say some interesting things:
 # b2j also does not contain entries for "popular" elements, meaning 
 # elements that account for more than 1% of the total elements, and
 # when the sequence is reasonably large (>= 200 elements); this can
 # be viewed as an adaptive notion of semi-junk, and yields an enormous
 # speedup when, e.g., comparing program files with hundreds of
 # instances of "return NULL;"

This seems to mean that you won't actually get an accurate diff in
certain cases, which seems odd. At the very least, this behavior should
probably be documented. Do people think it should be changed to get rid
of the "popularity" heuristic?
msg84449 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-03-30 01:33
On Mon, 30 Mar 2009 at 00:40, Mike Rotondo wrote:
> This seems to mean that you won't actually get an accurate diff in
> certain cases, which seems odd. At the very least, this behavior should
> probably be documented. Do people think it should be changed to get rid
> of the "popularity" heuristic?

A better way, I think, would be to provide a way to turn
it off (and then document it, of course).
msg93438 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-02 10:58
The popularity heuristic could be tuned to depend on the number N of
distinct elements in the sequence, and kick in if an element appears say
more than 1/(N**0.5) of the time.
msg103660 - (view) Author: Vlastimil Brom (vbr) Date: 2010-04-19 23:25
I just stumbled on some seemingly different unexpected behaviour of
difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the "popular" heuristics.
I hopefully managed to replicate it on an illustrative sample text - in as included in the attached file. (I also mentioned this issue in hte python-list 
http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as there were no replies I eventually found, this might be more appropriate place.)
Both strings differ in a minimal way, each having one extra character
in a "strategic" position, which probably meets some pathological case
for difflib.
Instead of just reporting the insertion and deletion of these single
characters (which works well for most cases - with most other
positions of the differing characters), the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.
I didn't find any plausible mentions of such cases in the documentation, but after some searching I found several reports in the bug tracker mentioning the erroneous output of SequenceMatcher on longer repetitive sequences.

besides this
http://bugs.python.org/issue2986
e.g.
http://bugs.python.org/issue1711800
http://bugs.python.org/issue4622
http://bugs.python.org/issue1528074

In my case, disabling the "popular" heuristics as mentioned by John Machin in
http://bugs.python.org/issue1528074#msg29269

seems to have solved the problem; with a modified version of difflib containing:

                if 0:   # disable popular heuristics
                    if n >= 200 and len(indices) * 100 > n:
                        populardict[elt] = 1
                        del indices[:]

the comparison catches the differences in the test strings as expected - i.e. one character addition and deletion only. It is likely, that some other use cases for difflib may rely on the "popular"-heuristics but it also seems useful to have some control over this behaviour, which might not be appropriate in all cases.
(The issue seems to be the same in python 2.5, 2.6 and 3.1.)

regards,
   vbr
msg108636 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-06-25 21:56
This appears to be one of at least three duplicate issues: #1528074, #2986, and #4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) #1711800 appears to be slightly different (if not, it could be closed also.)

Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great.
msg108856 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-06-28 19:35
The discussion on #152807 references two other closed tracker issues:
#1678339 Test case that currently fails
#1678345 Patch to change behavior - rejected because crippled behavior is supposedly intentional and removing the change would slow things down.

The patch simply removes the internal heuristic. I think a better patch would be to make it optional, with a tunable popularity threshold.

I say 'supposedly intentional' because the code comments only justify the popularity hack for code line comparison and give no indication of awareness that it disables SequenceMatcher for general purpose use, and in particular, for non-toy finite character set comparisons of the type (ascii) used in all the examples.
msg109090 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-07-02 07:16
The new "junk heuristic" has been added to difflib.py in SVN revision 26661 in 2002 (which is, incidentally, the last revision to modify difflib.py). Its commit log says:

---------------------------------------------
Mostly in SequenceMatcher.{__chain_b, find_longest_match}:
This now does a dynamic analysis of which elements are so frequently
repeated as to constitute noise.  The primary benefit is an enormous
speedup in find_longest_match, as the innermost loop can have factors
of 100s less potential matches to worry about, in cases where the
sequences have many duplicate elements.  In effect, this zooms in on
sequences of non-ubiquitous elements now.

While I like what I've seen of the effects so far, I still consider
this experimental.  Please give it a try!
---------------------------------------------
msg109442 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-06 23:18
[Also posted to pydev for additional input, with Subject line
Issue 2986: difflib.SequenceMatcher is partly broken
Developed with input from Eli Bendersky, who will write patchfile(s) for whichever change option is chosen.]

Summary: difflib.SeqeunceMatcher was developed, documented, and originally operated as "a flexible class for comparing pairs of sequences of any [hashable] type". An "experimental" heuristic was added in 2.3a1 to speed up its application to sequences of code lines, which are selected from an unbounded set of possibilities. As explained below, this heuristic partly to completely disables SequenceMatcher for realistic-length sequences from a small finite alphabet. The regression is easy to fix. The docs were never changed to reflect the effect of the heuristic, but should be, with whatever additional change is made.

In the commit message for revision 26661, which added the heuristic, Tim Peters wrote "While I like what I've seen of the effects so far, I still consider this experimental.  Please give it a try!" Several people who have tried it discovered the problem with small alphabets and posted to the tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed duplicates of #2986. The heuristic needs revision.

Open questions (discussed after the examples): what exactly to do, which versions to do it too, and who will do it.

---
Some minimal difference examples:

from difflib import SequenceMatcher as SM

# base example
print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.9975 (rounded)

# make 'y' junk
print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio())
# should be and is 0.0

# Increment b by 1 char
print(SM(None, 'x' + 'y'*199, 'y'*200).ratio())
# should be .995, but now is 0.0 because y is treated as junk

# Reverse a and b, which increments b
print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
# should be .9975, as before, but now is 0.0 because y is junked

The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content.

A more realistic example than the above is comparing DNA gene sequences. Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate sequence of matches and edits and .ratio works as documented and expected.  For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic and practical use.

With the heuristic, everything is junk and there is only one match, ''=='' augmented by the initial prefix of matching bases. This is followed by one edit: replace the rest of the first sequence with the rest of the second sequence. A much faster way to find the first mismatch would be
   i = 0
   while first[i] == second[i]:
      i+=1
The match ratio, based on the initial matching prefix only, is spuriously low.

---
Questions:

1: what change should be make.

Proposed fix: Disentangle the heuristic from the calculation of the internal b2j dict that maps items to indexes in the second sequence b. Only apply the heuristic (or not) afterward.

Version A: Modify the heuristic to only eliminate common items when there are more than, say, 100 items (when len(b2j)> 100 where b2j is first calculated without popularity deletions).

The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone. On the other hand, realistic sequences of more than 200 code lines should have at least 100 different lines, and so the heuristic should continue to be applied when it (mostly?) 'should' be. This change leaves the API unchanged and does not require a user decision.

Version B: add a parameter to .__init__ to make the heuristic optional. If the default were True ('use it'), then the code would run the same as now (even when bad). With the heuristic turned off, users would be able to get the .ratio they may expect and need. On the other hand, users would have to understand the heuristic to know when and when not to use it. 

Version C: A more radical alternative would be to make one or more of the tuning parameters user settable, with one setting turning it off.

2. What type of issue is this, and what version get changed.

I see the proposal as partial reversion of a change that sometimes causes a regression, in order to fix the regression. Such would usually be called a bugfix. Other tracker reviewers claim this issue is a feature request, not a bugfix. Either way, 3.2 gets the fix. The practical issue is whether at least 2.7(.1) should get the fix, or whether the bug should forever continue in 2.x.

3. Who will make the change.

Eli will write a patch and I will check it. However, Georg Brandel assigned the issue to Tim Peters, with a request for comment, but Tim never responded. Is there an active committer who will grab the issue and do a commit review when a patch is ready?
msg109507 - (view) Author: Vlastimil Brom (vbr) Date: 2010-07-07 23:17
I guess, I am not supposed to post to python-dev - not being a python developer, hopefully it is appropriate to add a comment here - only based on my current usage of (a modified) difflib.SequenceMatcher.
It seems, the mentions of text comparison in that thread, e.g. 
http://mail.python.org/pipermail/python-dev/2010-July/101515.html
etc. rather imply line-by-line comparison, and possibly character comparison of matched lines.
For me the direct character-wise comparison is more useful in most cases.
With the popular heuristics disabled the results look pretty well.
(the script only involves changing the background colour of the compared texts - based on the SequenceMatcher - get_opcodes() )
Just now, I only need to disable the popular check, currently I use a monkey-patched subclass of SequenceMatcher with extended signature and modified __chain_b function.
cf. http://mail.python.org/pipermail/python-list/2010-June/1247907.html

I would vote for extending the SequenceMatcher API to enable adjustments (leaving the default values as the current ones) - enable/disable popular check, set the thresholds for string length and "popular" frequency (and eventually other parameters, which might be added).

Are there some restrictions on API changes in a library due to a moratorium - even if the default behaviour remains unchanged?
Otherwise, what might be the disadvantages of this approach?
If the current behaviour is considered appropriate for the original usecases, other uses would be also made possible/easier - only at the cost of learning the meaning of the added parameters - from the enhanced docs, of course.

vbr
msg109636 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-08 22:47
Anyone can post on Python-dev, but non-developers should do so judiciously and with respect for the purpose of the list. It is also polite to introduce oneself with the first post. In any case, Tim Peters has approved making some change. The remaining question is exactly what.

There is no problem with extending the API in 3.2. The debate there is over 2.7.

My fourth proposal, detailed on pydev, is to introduce a fourth paramater, 'common', to set the frequency threshold to None or int 1-99.
msg109639 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-08 22:52
> There is no problem with extending the API in 3.2. The debate there is
> over 2.7.

We could extend the API as long as it stays backwards-compatible (that
is, the default value for the new argument produces the same behaviour
as before).
msg109654 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-09 01:12
My proposal F, to expose the common frequency threshold as a fourth positional parameter with default 1, would do that: repeat current behavior. We should, and Eli and I would, add some of the anomalous cases to the test suite and verily that the default is to reproduce the current anomalies, and that passing None changes the result.

Any opinions, anyone, on 'common', 'thresh', 'threshold', or anything else as the new parameter name?

We will have to explain in the doc patch that the parameter is new in 2.7.1 to fix a partial bug and that giving any explicit value will make code not run with 2.7 (.0).

Exposing the set of common values as an instance attribute, as I proposed on pydev, would be a new feature not needed to fix the bug. So it should be limited to 3.2.
msg110251 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-14 01:45
[copied from pydev post]

Summary: adding an autojunk heuristic to difflib without also adding a way to turn it off was a bug because it disabled running code.

2.6 and 3.1 each have, most likely, one final version each. Don't fix for these but add something to the docs explaining the problem and future fix.

2.7 will have several more versions over several years and will be used by newcomers who might encounter the problem but not know to diagnose it and patch a private copy of the module. So it should have a fix.  Solutions thought of so far.

1. Modify the heuristic to somewhat fix the problem. Bad (unacceptable) because this would silently change behavior and could break tests.

2. Add a parameter that defaults to using the heuristic but allows turning it off. Perhaps better, but code that used the new API would crash if run on 2.7.0

3.
Tim Peters
> Think the most pressing thing is to give people a way to turn the damn
> thing off.  An ugly way would be to trigger on an unlikely
> input-output behavior of the existing isjunk argument.  For example,
> if
> 
>      isjunk("what's the airspeed velocity of an unladen swallow?")
> 
> returned
> 
>      "don't use auto junk!"
> 
> and 2.7.1 recognized that as meaning "don't use auto junk", code could
> be written under 2.7.1 that didn't blow up under 2.7.  It could
> _behave_ differently, although that's true of any way of disabling the
> auto-junk heuristics.

Ugly, but perhaps crazy brilliant. Use of such a hack would obviously be temporary. Perhaps its use could be made to issue a -3 warning if such were enabled.

I would simplify the suggestion to something like
    isjunk("disable!heuristic") == True
so one could pass
    lambda s:s=="disable!heuristic"
It should be something easy to document and write. This issue is the only place such a string should appear, so it should be safe.

Tim and Antoine: if you two can agree on what to do for 2.7, Eli and I will code it.

This suggestion amounts to a suggestion that the fix for 2.7 be decoupled from a better fix for 3.2. I agree. The latter can be discussed once 2.7 is settled.
msg110261 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-14 09:06
Le mercredi 14 juillet 2010 à 01:45 +0000, Terry J. Reedy a écrit :
> 
> 2. Add a parameter that defaults to using the heuristic but allows
> turning it off. Perhaps better, but code that used the new API would
> crash if run on 2.7.0

Yes, but this is an exceptional situation. We normally don't add new
APIs in bugfix versions. We'll have to live with it.

> 3.
> [...]
> Ugly, but perhaps crazy brilliant. Use of such a hack would obviously
> be temporary. Perhaps its use could be made to issue a -3 warning if
> such were enabled.

It's still incredibly ugly. Besides, code written for 2.7.1 might not
"blow up" with 2.7, but it will still have different behaviour.
If you are using the new parameter, it's because you *need* it, hence
different behaviour will be unacceptable; therefore, better to raise an
error as the API change proposal does.
msg111372 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-23 18:31
For 2.6 and 3.1, this is a documentation only issue.
For 2.7, this is a doc + behavior issue.
For 3.2, this is a doc + behavior + new feature issue.

For 2.6.6 (release candidate due Aug 2, 10 days), I propose to add the following paragraph after the current 'Timing:' paragraph in the SequenceMatcher entry ('Heuristic:' should be bold-faced, like 'Timing:')

Heuristic: To speed matching, items that appear more than 1% of the time in sequences of at least 200 items are treated as junk. This has the unfortunate side-effect of giving bad results for sequences constructed from a small set of items. An option to turn off the heuristic will be added to a future version.

I would have said 'to 2.7.1' but that has not happened yet. I thought about putting the heuristic paragraph first, but I think it fits better after the discussion of quadratic run time. I think it should be a separate paragraph and not tacked on the end of the previous paragraph so people will be more likely to take notice.

I have marked this a release blocker because at least 6 issues have been filed for this bug and so I think it important that the explanation be added to the next released doc. I plan to temporarily reassign this to docs@python in a few days.
msg111425 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-07-24 04:59
Here's a patch for Doc/library/difflib.rst of the 2.6 branch, following Terry's suggested addition to the docs of the SequenceMatcher class.

Tested 'make html'.
msg112116 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-31 07:06
Deferring to after 3.2a1.
msg112120 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-31 08:00
Committed 2.6 patch in r83314.
msg112490 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2010-08-02 16:17
Georg committed this patch to the 2.6 tree, and besides, this is doesn't seem like a blocking issue, so I'm kicking 2.6 off the list and knocking the priority down.
msg115335 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-09-01 21:32
While refactoring the code for 2.7, I discovered that the description of the heuristic for 2.6 and in the code comments is off by 1. "items that appear more than 1% of the time" should actually be "items whose duplicates (after the first) appear more than 1% of the time". The discrepancy arises because in the following code

        for i, elt in enumerate(b):
            if elt in b2j:
                indices = b2j[elt]
                if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
            else:
                b2j[elt] = [i]

len(indices) is retrieved *before* the index i of the current elt is added. Whatever one might think the heuristic 'should' have been (and by the nature of heuristics, there is no right answer), the default behavior must remain as it is, so we adjusted the code and doc to match that.
msg115419 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-09-03 04:45
Attaching a patch (developed jointly with Terry Reedy) for 2.7 that adds an 'autojunk' parameter to SequenceMatcher's constructor. The parameter is True by default which retains the current behavior in 2.6 and earlier, but can be set by the user to False to disable the popularity heuristic. The patch also fixes some documentation inconsistencies that Terry raised in this message.

Notes:
1. Tests run successfully. Added new test class in test_difflib for testing with the new autojunk parameter False
2. Patch generated vs. Hg mirror
msg115787 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-09-07 16:02
The patch changes the internal function that constructs the dict mapping b items to indexes to read as follows:
  create b2j mapping
  if isjunk function, move junk items to junk set
  if autojunk, move popular items to popular set

I helped write and test the 2.7 patch and verify that default behavior remains unchanged. I believe it is ready to commit.

3.1 and 3.2 patches will follow.
msg120713 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-11-08 05:10
Adding a documentation patch for 3.1 which is similar to the 2.6 documentation patch that's been committed by Georg into 2.6
msg120927 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-10 18:20
Tim told me to continue with this as he has no time.
rev86401 - apply 3.1 doc fix

I cannot apply 2.7 patch. I has different header lines. In particular, TortoiseSVN cannot fetch nonexistent revision "Mon Aug 30 06:37:52 2010 +0300". Please regenerate against current 2.7 with method used for 2.6/3.1.
msg120939 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-11-11 08:24
Attaching a new patch for 2.7 freshly generated vs. current 2.7 maintenance branch from SVN.
msg120992 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-12 00:22
issue2986.fix27.5.patch applied, with version note added to doc, as
rev86418

Only thing left is patch for 3.2, which Eli and I will produce.
msg121079 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-12 21:10
r86437 - correct and replicate version-added message
msg121596 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-11-20 05:45
Terry, when is the deadline for producing the patch for 3.2? Perhaps we should at least submit the 2.7 patch for now so that it goes in for sure?
msg121662 - (view) Author: Simon Cross (hodgestar) Date: 2010-11-20 15:18
I made the minor changes needed to get Eli Bendersky's patch to apply against 3.2. Diff attached.
msg121697 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-20 17:55
Deadline is probably next Fri. However I will apply this or slight revision thereof in a couple of days to make sure this much is in. I have to fixup some work stuff today.
msg121902 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-11-21 11:04
Simon's patch fix for 3.2 looks good to me - applies cleanly to py3k and tests pass.
msg122335 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-25 06:27
Since I am not sure I will be able to do any more before the 3.2b1 feature freeze, I went ahead with the minimal patch after checking the differences from the 2.7 version and redoing the Misc/News entry.
(I suspect putting a new entry immediately after the appropriate heading, instead of between other headings, is probably least likely to fatally conflict with intervening changes.) r86745 Thank you Eli and Simon.

Leaving this open for possible further changes.
msg122337 - (view) Author: Simon Cross (hodgestar) Date: 2010-11-25 06:48
My vote is that this bug be closed and a new feature request be opened. Failing that, it would be good to have a concise description of what else we would like done (and the priority should be downgraded, I guess).
msg122338 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2010-11-25 06:59
Terry, I agree with Simon re closing and opening a new feature request. This issue has too much baggage in it, and you we always link to it. A new feature request should be opened strictly for 3.2

If you want I can close this issue and open a new one, but I'm waiting for your approval.
msg122401 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-25 20:23
Agreed. #10534. This is really a 'follow-on' rather than 'superseder',
but the forward reference should be easy for anyone to find.
History
Date User Action Args
2022-04-11 14:56:35adminsetgithub: 47235
2019-11-07 16:24:52python-devsetpull_requests: + pull_request16592
2011-01-09 03:22:36terry.reedysetnosy: tim.peters, barry, georg.brandl, terry.reedy, jcea, jimjjewett, sjmachin, gjb1002, ggenellina, pitrou, rtvd, vbr, LambertDW, hodgestar, hagna, r.david.murray, eli.bendersky, janpf, mrotondo
stage: needs patch -> resolved
2011-01-09 02:31:29jceasetnosy: + jcea
2010-11-25 20:23:36terry.reedysetstatus: open -> closed
versions: + Python 2.6, Python 3.1, Python 2.7
resolution: fixed
messages: + msg122401

superseder: difflib.SequenceMatcher: expose junk sets, deprecate undocumented isb... functions.
type: enhancement -> behavior
2010-11-25 06:59:23eli.benderskysetmessages: + msg122338
2010-11-25 06:48:02hodgestarsetmessages: + msg122337
2010-11-25 06:27:28terry.reedysettype: behavior -> enhancement
messages: + msg122335
2010-11-21 11:04:43eli.benderskysetmessages: + msg121902
2010-11-20 17:55:13terry.reedysetmessages: + msg121697
2010-11-20 15:18:19hodgestarsetfiles: + issue2986.fix32.5.patch
nosy: + hodgestar
messages: + msg121662

2010-11-20 05:45:30eli.benderskysetmessages: + msg121596
2010-11-12 21:10:57terry.reedysetmessages: + msg121079
2010-11-12 00:22:41terry.reedysetstage: commit review -> needs patch
messages: + msg120992
versions: - Python 2.7
2010-11-11 08:24:06eli.benderskysetfiles: + issue2986.fix27.5.patch

messages: + msg120939
2010-11-10 19:55:37terry.reedysetversions: - Python 3.1
2010-11-10 19:54:59terry.reedysetmessages: - msg120925
2010-11-10 18:20:56terry.reedysetmessages: + msg120927
2010-11-10 18:13:04terry.reedysetassignee: tim.peters -> terry.reedy
messages: + msg120925
2010-11-08 05:10:24eli.benderskysetfiles: + issue2986.docs31.1.patch

messages: + msg120713
2010-09-07 16:02:28terry.reedysetmessages: + msg115787
stage: test needed -> commit review
2010-09-03 04:46:06eli.benderskysetfiles: + issue2986.fix27.4.patch

messages: + msg115419
2010-09-01 21:32:58terry.reedysetmessages: + msg115335
2010-08-02 16:17:13barrysetpriority: release blocker -> high

messages: + msg112490
versions: - Python 2.6
2010-07-31 18:24:27georg.brandlsetpriority: deferred blocker -> release blocker
2010-07-31 08:00:47georg.brandlsetmessages: + msg112120
2010-07-31 07:06:02georg.brandlsetpriority: release blocker -> deferred blocker

messages: + msg112116
2010-07-24 04:59:19eli.benderskysetfiles: + issue2986.docs26.1.patch
keywords: + patch
messages: + msg111425
2010-07-23 18:31:32terry.reedysetpriority: normal -> release blocker
versions: + Python 2.6, Python 3.1, Python 2.7
nosy: + barry

messages: + msg111372

type: enhancement -> behavior
2010-07-14 09:06:44pitrousetmessages: + msg110261
2010-07-14 01:45:22terry.reedysetmessages: + msg110251
2010-07-09 01:12:03terry.reedysetmessages: + msg109654
2010-07-08 22:52:45pitrousetmessages: + msg109639
2010-07-08 22:47:55terry.reedysetmessages: + msg109636
2010-07-08 02:18:21terry.reedysetmessages: - msg109450
2010-07-08 02:18:04terry.reedysetmessages: - msg109449
2010-07-07 23:17:01vbrsetmessages: + msg109507
2010-07-07 02:46:20eli.benderskysetmessages: + msg109450
2010-07-07 02:43:07eli.benderskysetfiles: - unnamed
2010-07-07 02:40:11eli.benderskysetfiles: + unnamed

messages: + msg109449
2010-07-06 23:18:23terry.reedysetmessages: + msg109442
2010-07-02 07:16:09eli.benderskysetmessages: + msg109090
2010-06-28 19:35:37terry.reedysetmessages: + msg108856
2010-06-25 21:56:19terry.reedysetnosy: + LambertDW, jimjjewett, terry.reedy, rtvd, janpf, ggenellina, sjmachin, eli.bendersky

messages: + msg108636
versions: - Python 2.7
2010-06-25 21:55:55terry.reedylinkissue4622 superseder
2010-06-25 21:55:07terry.reedylinkissue1528074 superseder
2010-04-19 23:25:07vbrsetfiles: + difflib_test_inq.py
nosy: + vbr
messages: + msg103660

2009-10-02 10:58:47pitrousetnosy: + pitrou
messages: + msg93438
2009-10-01 12:24:49gjb1002setnosy: + gjb1002
2009-05-28 01:09:05r.david.murraysetversions: + Python 3.2, - Python 2.5
nosy: tim.peters, georg.brandl, hagna, r.david.murray, mrotondo
priority: normal
components: + Documentation, Library (Lib), - Extension Modules
type: enhancement
stage: test needed
2009-03-30 01:33:30r.david.murraysetnosy: + r.david.murray
messages: + msg84449
2009-03-30 00:40:16mrotondosetnosy: + mrotondo

messages: + msg84446
versions: + Python 2.7
2009-03-29 12:59:19georg.brandlsetassignee: tim.peters

messages: + msg84387
nosy: + georg.brandl, tim.peters
2008-05-27 20:29:56hagnacreate