This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Provide a way for potentially long runtime difflib algorithms to be aborted by the caller (and report progress?)
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, jftuga, rbcollins, rhettinger, tim.peters
Priority: normal Keywords: patch

Created on 2015-08-20 22:18 by jftuga, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
difflib-diff.patch jftuga, 2015-08-20 22:18 Adds timeout feature to difflib SequenceMatcher ratio() and quick_ratio()
Messages (5)
msg248919 - (view) Author: John Taylor (jftuga) * Date: 2015-08-20 22:18
SequenceMatcher in the difflib module contain ratio() and quick_ratio() methods which can take a long time to run with certain input.  One example is two slightly different versions of jquery.min.js.

I have written a patch against python-350b4 that adds a timeout to these methods.  The new functionality also has the capability to "fall through" to the next quickest comparison method should a timeout occur. If a timeout does occur and using a fall through method is not desired, then -1 is returned for the ratio.

I'd like this to be incorporated into Python 3.5.0 if it is not too late.
msg248920 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-08-20 22:39
I'm not sure that it's a good idea to add a timeout to such algorithm. It can be very surprising to have a difference result depending on the system load (CPU usage of _other_ applications) and on the CPU performances.

If you really want this result, I would prefer to design the feature outside the Python stdlib. You might modify the stdlib to allow incremental computation.

About the patch itself, which kind of timer should be used? Monotonic clock? System clock? Process time (CPU time)?

Maybe we can optimize the code?
msg248921 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-08-20 22:47
In general, it isn't good design to incorporate timeout logic in computation logic.  What would be better is a general purpose, reusable, decoupled tool: run_with_time_limit(some_computation, some_args, time_limit).  Such a tool might be based on separate process that can be timed or killed, it might use signals, or may be based on threading.Timer.

I did a quick look around the net.  Timeouts on diff APIs aren't common (i.e GNU diff doesn't have a timeout) but there are a couple of precedents (you aren't the first to have had concerns about the running time for unfavorable inputs).
msg248922 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2015-08-20 23:08
So - I'm with Victor and Raymond here. I think modifying difflib to provide external control over the poor-O components would permit many more benefits than just controlling time: you could wrap them in a timer module to get what this patch does, you could replace them with alternative implementations (e.g. parallel ones).
msg412718 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-02-07 08:22
The way I'd go about this is to support some form of periodic checkpoint in the algorithm where it checks in with code supplied by the difflib user. Traditionally that'd have been via callbacks, there might be an async style way to express that these days. Those could indicate that they want the operation to be aborted. If it is possible to estimate progress, supplying that as input to the checkpoint API would be useful.

This leaves decision logic on when to abort something entirely up to the user rather than being clock based which is often not what the user wants.

I'm re-titling the issue as the original patch and proposal of a timeout isn't the direction several core devs have suggested we head.
History
Date User Action Args
2022-04-11 14:58:20adminsetgithub: 69092
2022-02-07 13:36:55vstinnersetnosy: - vstinner
2022-02-07 08:22:58gregory.p.smithsetnosy: + gregory.p.smith
title: Patch: add timeout to difflib SequenceMatcher ratio() and quick_ratio() -> Provide a way for potentially long runtime difflib algorithms to be aborted by the caller (and report progress?)
messages: + msg412718

stage: needs patch
2015-08-20 23:08:06rbcollinssetnosy: + rbcollins
messages: + msg248922
2015-08-20 22:47:37rhettingersetnosy: + tim.peters
2015-08-20 22:47:20rhettingersetnosy: + rhettinger
messages: + msg248921
2015-08-20 22:39:52vstinnersetnosy: + vstinner
messages: + msg248920
2015-08-20 22:18:07jftugacreate