This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: dreadful performance in difflib: ndiff and HtmlDiff
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.4, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Christopher.Cabanne, eric.araujo, funkyhat, gruszczy, heidar.rafn, jackdied, neologix, terry.reedy, tim.peters, zitrax
Priority: normal Keywords: patch

Created on 2009-09-17 14:54 by heidar.rafn, last changed 2022-04-11 14:56 by admin.

Files
File name Uploaded Description Edit
python.difflib.bug.tgz heidar.rafn, 2009-09-17 14:54 a gzipped tar with 6 testfiles - see comment above
11740.patch gruszczy, 2011-04-08 20:28 review
Messages (14)
msg92768 - (view) Author: Heiðar Rafn Harðarson (heidar.rafn) Date: 2009-09-17 14:54
Relatively small set of lines with differences in most lines can destroy
the performance of difflib.HtmlDiff.make_table and difflib.ndiff.
I am using it like this:
    ...
    htmldiffer = HtmlDiff()
    return htmldiffer.make_table(src_lines, dst_lines, 
        fromdesc="file1",
        todesc="file2",
        context=True)

I have written the src_lines and dst_lins to files and tried this with
the Tools/scripts/diff.py wrapper with same results when using the
switches -m or -n.
The performance is fine when using difflib.unified_diff or switch -u on
diff.py

Attached are files that show this clearly.
left200.txt,right200.txt - 200 lines of text - duration 11 seconds.
left500.txt,right500.txt - 500 lines of text - duration 2min 58 sec
left1000.txt,right1000.txt - 1000 lines of text - duration 29min 4sec

tested on Intel dualcore T2500 2GHz with 2 GB of memory, python 2.5.2 on
Ubuntu. Same problom on python 2.6 on Fedora-11
For reference, the kdiff3 utility performs beautifully on these files.
msg99883 - (view) Author: Jack Diederich (jackdied) * (Python committer) Date: 2010-02-23 00:18
Here is a profile run of the 200 line case, run on the 3.x trunk, and  with all the trivial functions removed.  quick_ratio and __contains__ dominate the time.  The process was CPU bound, memory usage stayed low.


17083154 function calls (17066360 primitive calls) in 248.779 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10660375   70.516    0.000   70.516    0.000 :0(__contains__)
   114482    1.048    0.000    1.048    0.000 :0(append)
  4434776   29.254    0.000   29.254    0.000 :0(get)
685047/685042    4.400    0.000    4.400    0.000 :0(len)
   197349    1.512    0.000    1.512    0.000 :0(min)
   197133    1.452    0.000    1.452    0.000 difflib.py:228(set_seq1)
     2500    1.632    0.001    3.092    0.001 difflib.py:299(__chain_b)
     1082    2.520    0.002    4.768    0.004 difflib.py:350(find_longest_match)
   339143    2.580    0.000    2.580    0.000 difflib.py:40(_calculate_ratio)
   141727  120.599    0.001  220.970    0.002 difflib.py:661(quick_ratio)
   196736    6.956    0.000   12.549    0.000 difflib.py:690(real_quick_ratio)
 8974/794    5.052    0.001  248.431    0.313 difflib.py:945(_fancy_replace)
msg102345 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-04 16:59
This is because difflib.ndiff (called by difflib.HtmlDiff.make_table), contrarily to difflib.unified_diff (and probably kdiff3), doesn't restrict itself to contiguous lines, and searches diff even inside lines, so the complexity is much worse (see how many times _fancy_replace and quick_ratio are being called).
It might be a good idea to allow the user to specify the type of diff needed (ndiff vs unified_diff).
msg112910 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-04 22:15
An api addition makes this a feature request.

Heiðar, Please copy of difflib.py, hand patch HtmlDiff to  use unified_diff instead of ndiff, and see if that solves the problem.
msg132797 - (view) Author: Filip Gruszczyński (gruszczy) Date: 2011-04-02 14:04
Check also this:

http://bugs.python.org/issue11740
msg133329 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-08 17:24
> Check also this:
>
> http://bugs.python.org/issue11740

You should indicate it as duplicate.
msg133330 - (view) Author: Filip Gruszczyński (gruszczy) Date: 2011-04-08 17:25
I have no idea how I should do this. If you explain to me, how it should be done, I'll be happy to do this from now on :-)
msg133339 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-04-08 20:25
Filip, you can look at the changed Benjamin made. One needs to be OP or have admin privileges to change the headers.

Could you reupload your patch to this issue?
msg133340 - (view) Author: Filip Gruszczyński (gruszczy) Date: 2011-04-08 20:28
Here it is.
msg139891 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-07-05 16:10
The patch by Filip does not add new features, so I’m adjusting versions.

I cannot review the patch only by reading it, but if someone gives me a timeit command I can post a benchmark for my Debian machine.
msg222588 - (view) Author: Christopher Cabanne (Christopher.Cabanne) Date: 2014-07-08 23:21
I ran into the slowness of difflib.HtmlDiff.make_table yesterday, and see this issue is three years stale now; is there any update or chance that the fix will be applied?
msg222589 - (view) Author: Christopher Cabanne (Christopher.Cabanne) Date: 2014-07-08 23:28
Here is a timeit command to test with the attached test files: 

python -m timeit -n 10 "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('righ500.txt').readlines())"
msg222590 - (view) Author: Christopher Cabanne (Christopher.Cabanne) Date: 2014-07-08 23:29
Sorry about the typo; here is the corrected timeit command: 

python -m timeit -n 10  "import difflib; difflib.HtmlDiff().make_table(open('left500.txt').readlines(), open('right500.txt').readlines())"
msg223459 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-07-19 02:39
I'm sympathetic, but I don't see a good solution here without using incompatible code.

ndiff was built to generate "the highest quality diff possible", for text written and edited by humans, where "quality" is measured by human judgment (not an abstract mathematical measure).  That was its goal, and I think it does well at that.  It wasn't intended to unwind mechanical changes made to machine-generated gibberish (to human eyes) text.

Not that such a thing has no value, but it was never ndiff's _intent_ to handle such stuff.  So the best solution here would be to use a different differencing engine.

As already noted, unified_diff skips any attempt to find "merely similar" lines, and that's the great source of expense here:  ndiff takes time essentially cubic in the number of lines, given inputs with a great many similar (but no identical) lines.  It was noted that kdiff3 does fine on these inputs very quickly, but, from the kdiff3 FAQ:  "If similar lines appear next to each other, this actually is coincidence but this fortunately is often the case."  It just so happens that in the inputs attached to this report, the first lines in each file pair are similar, and the second lines likewise, etc etc.  This allows kdiff3 to do a wonderful job without any time at all spent _trying_ to find similar lines.

But there's nothing coincidental here in what ndiff does:  it finds "_the_ most similar pair", and that's a quadratic-time operation (which turns into cubic time when repeated over & over).

I didn't write any of the HTML-generating functions, nor have I ever used them, so I'm not in a good position to judge what they "should do".  If similar lines aren't interesting in this context, then switching to unified_diff would save gobs of time.  If similar lines are interesting, then new code would be required to get at _some_ notion of that less expensively than ndiff does it (e.g., the similarity scores for all pairs could be computed once and saved away, reducing worst-case cubic time to worst-case quadratic time, but at the cost of needing additional memory quadratic in the number of lines).

I'm -1 on 11740.patch:  there is no reason, in general, to give up looking for the most similar pair - and give up on intraline differencing - simply because the total number of lines in the two blocks exceeds 99.  It would reduce diff quality for all uses of the Differ class, not just for the specific uses of Differ made indirectly (via ndiff) by the make_table function - and it would be backward-incompatible anyway.
History
Date User Action Args
2022-04-11 14:56:53adminsetgithub: 51180
2017-05-06 22:48:36gregory.p.smithsetnosy: - gregory.p.smith
2017-05-06 22:48:22gregory.p.smithsetassignee: gregory.p.smith ->
2014-07-19 02:39:25tim.peterssetmessages: + msg223459
2014-07-09 04:57:35rhettingersetnosy: + tim.peters
2014-07-09 00:44:05gregory.p.smithsetversions: + Python 3.4, Python 3.5, - Python 3.2, Python 3.3
2014-07-09 00:43:30gregory.p.smithsetassignee: gregory.p.smith

nosy: + gregory.p.smith
2014-07-08 23:29:36Christopher.Cabannesetmessages: + msg222590
2014-07-08 23:28:01Christopher.Cabannesetmessages: + msg222589
2014-07-08 23:21:32Christopher.Cabannesetnosy: + Christopher.Cabanne
messages: + msg222588
2013-03-21 11:16:18funkyhatsetnosy: + funkyhat
2011-07-05 16:10:35eric.araujosetnosy: + eric.araujo

messages: + msg139891
versions: + Python 2.7, Python 3.2
2011-04-08 20:28:42gruszczysetfiles: + 11740.patch
keywords: + patch
messages: + msg133340
2011-04-08 20:25:25terry.reedysetmessages: + msg133339
versions: + Python 3.3, - Python 3.2
2011-04-08 17:49:23benjamin.petersonlinkissue11740 superseder
2011-04-08 17:25:55gruszczysetmessages: + msg133330
2011-04-08 17:24:20neologixsetmessages: + msg133329
2011-04-02 14:04:20gruszczysetnosy: + gruszczy
messages: + msg132797
2010-08-04 22:15:04terry.reedysetversions: + Python 3.2, - Python 2.6, Python 2.5
nosy: + terry.reedy

messages: + msg112910

type: performance -> enhancement
2010-06-21 14:06:10zitraxsetnosy: + zitrax
2010-04-04 16:59:19neologixsetnosy: + neologix
messages: + msg102345
2010-02-23 00:18:35jackdiedsetnosy: + jackdied
messages: + msg99883
2009-09-17 15:01:54heidar.rafnsettitle: awful performance in difflib: ndiff and HtmlDiff -> dreadful performance in difflib: ndiff and HtmlDiff
2009-09-17 14:54:57heidar.rafncreate