This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: difflib html diff takes extremely long
Type: performance Stage:
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: dreadful performance in difflib: ndiff and HtmlDiff
View: 6931
Assigned To: Nosy List: gruszczy, mkorourk@adobe.com, ysj.ray
Priority: normal Keywords: patch

Created on 2011-04-01 21:25 by mkorourk@adobe.com, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
Example.zip mkorourk@adobe.com, 2011-04-01 21:25
11740.patch gruszczy, 2011-04-02 11:56 review
Messages (3)
msg132767 - (view) Author: Michael O'Rourke (mkorourk@adobe.com) Date: 2011-04-01 21:25
If you try to difference the attached files with difflib and a html difference it take 10 minutes or more. In comparison other differencing tools like windiff and araxis merge will show the diff within a second.

Example code I'm using is:


sourceText = open("source.xml", "rU").readlines()
targetText = open("target.xml", "rU").readlines()

html_diff = difflib.HtmlDiff(tabsize=4)
result = html_diff.make_file(sourceText, targetText, "Source", "Target", context=True, numlines=10)
f = open('c:/libdiff_html.html', 'w')
f.write(result)
finish()
msg132786 - (view) Author: ysj.ray (ysj.ray) Date: 2011-04-02 04:15
Reproduced in 3.3
msg132794 - (view) Author: Filip Gruszczyński (gruszczy) Date: 2011-04-02 11:56
The culprit seems to be Differ._fancy_replace. There is a nasty quadratic loop there, that has pretty complex internal code. I have done a quick a fix, that makes example run below a second at the expense of not calling _fancy_replace for longer chunks and using _plain_replace instead.

Another solution for long chunks would be to split them into smaller parts and process separately. This way quadratic time will be smaller and we still can benefit from _fancy_helper logic.
History
Date User Action Args
2022-04-11 14:57:15adminsetgithub: 55949
2011-04-08 17:49:23benjamin.petersonsetstatus: open -> closed
resolution: duplicate
superseder: dreadful performance in difflib: ndiff and HtmlDiff
2011-04-02 11:56:11gruszczysetfiles: + 11740.patch
keywords: + patch
messages: + msg132794
2011-04-02 04:15:01ysj.raysetversions: + Python 3.1, Python 3.2, Python 3.3
nosy: + ysj.ray

messages: + msg132786

components: + Library (Lib), - None
2011-04-02 00:08:49gruszczysetnosy: + gruszczy
2011-04-01 21:25:38mkorourk@adobe.comcreate