Title: Patch: add timeout to difflib SequenceMatcher ratio() and quick_ratio()
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.5
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: jftuga, rbcollins, rhettinger, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2015-08-20 22:18 by jftuga, last changed 2015-08-20 23:08 by rbcollins.

File name Uploaded Description Edit
difflib-diff.patch jftuga, 2015-08-20 22:18 Adds timeout feature to difflib SequenceMatcher ratio() and quick_ratio()
Messages (4)
msg248919 - (view) Author: John Taylor (jftuga) * Date: 2015-08-20 22:18
SequenceMatcher in the difflib module contain ratio() and quick_ratio() methods which can take a long time to run with certain input.  One example is two slightly different versions of jquery.min.js.

I have written a patch against python-350b4 that adds a timeout to these methods.  The new functionality also has the capability to "fall through" to the next quickest comparison method should a timeout occur. If a timeout does occur and using a fall through method is not desired, then -1 is returned for the ratio.

I'd like this to be incorporated into Python 3.5.0 if it is not too late.
msg248920 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-08-20 22:39
I'm not sure that it's a good idea to add a timeout to such algorithm. It can be very surprising to have a difference result depending on the system load (CPU usage of _other_ applications) and on the CPU performances.

If you really want this result, I would prefer to design the feature outside the Python stdlib. You might modify the stdlib to allow incremental computation.

About the patch itself, which kind of timer should be used? Monotonic clock? System clock? Process time (CPU time)?

Maybe we can optimize the code?
msg248921 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-08-20 22:47
In general, it isn't good design to incorporate timeout logic in computation logic.  What would be better is a general purpose, reusable, decoupled tool: run_with_time_limit(some_computation, some_args, time_limit).  Such a tool might be based on separate process that can be timed or killed, it might use signals, or may be based on threading.Timer.

I did a quick look around the net.  Timeouts on diff APIs aren't common (i.e GNU diff doesn't have a timeout) but there are a couple of precedents (you aren't the first to have had concerns about the running time for unfavorable inputs).
msg248922 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2015-08-20 23:08
So - I'm with Victor and Raymond here. I think modifying difflib to provide external control over the poor-O components would permit many more benefits than just controlling time: you could wrap them in a timer module to get what this patch does, you could replace them with alternative implementations (e.g. parallel ones).
Date User Action Args
2015-08-20 23:08:06rbcollinssetnosy: + rbcollins
messages: + msg248922
2015-08-20 22:47:37rhettingersetnosy: + tim.peters
2015-08-20 22:47:20rhettingersetnosy: + rhettinger
messages: + msg248921
2015-08-20 22:39:52vstinnersetnosy: + vstinner
messages: + msg248920
2015-08-20 22:18:07jftugacreate