Issue 24384: difflib.SequenceMatcher faster quick_ratio with lower bound specification

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68572

classification

Title:	difflib.SequenceMatcher faster quick_ratio with lower bound specification
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	floyd, taleinat, tim.peters
Priority:	normal	Keywords:

Created on 2015-06-04 20:12 by floyd, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
difflib_SequenceMatcher_quick_ratio_ge.py	floyd, 2015-06-04 20:12	Example for how a potential difflib.SequenceMatcher.quick_ratio_ge method could be implemented

Messages (7)
msg244840 - (view)	Author: floyd (floyd) *	Date: 2015-06-04 20:12
I guess a lot of users of difflib call the SequenceMatcher in the following way (where a and b often have different lengths): if difflib.SequenceMatcher.quick_ratio(None, a, b) >= threshold: However, for this use case the current quick_ratio is quite a performance loss. Therefore I propose to add an additional, optimized version quick_ratio_ge which would be called like this: if difflib.SequenceMatcher.quick_ratio_ge(None, a, b, threshold): As we are able to calculate upper bounds for threshold depending on the lengths of a and b this function would return much faster in a lot of cases. An example of how quick_ratio_ge could be implemented is attached.
msg244841 - (view)	Author: floyd (floyd) *	Date: 2015-06-04 20:22
Now that I gave it another thought, I think it would be better if we simply add threshold as a named parameter of quick_ratio
msg244979 - (view)	Author: Tal Einat (taleinat) *	Date: 2015-06-07 22:47
Your second suggestion of adding a 'threshold' parameter to quick_ratio() is a bad idea in my opinion, since it would then be two significantly different functions crammed into one. The separate function would be possible. However, is there a compelling reason for this to be in the stdlib, rather than an online code recipe?
msg244990 - (view)	Author: floyd (floyd) *	Date: 2015-06-08 07:18
Agree with the separate function (especially as the return value would change from float to bool). In my experience this is one of the most often occuring use cases for difflib in practice. Another reason is that it is not obvious that the user can optimize it with the appended version. Some more opinions would be nice. If this suggestion is rejected we could include a performance warning for this use case in the docs and/or I'll write an online code recipe which can be linked to.
msg244991 - (view)	Author: Tal Einat (taleinat) *	Date: 2015-06-08 07:36
You should post this on the python-ideas mailing list if you think this should be added to the stdlib. Make sure to reference this issue if you do so.
msg319304 - (view)	Author: Tal Einat (taleinat) *	Date: 2018-06-11 14:22
Since this is a small enhancement proposal that is not sure to be approved, and there has been no followup for years, I vote to close this.
msg319305 - (view)	Author: floyd (floyd) *	Date: 2018-06-11 14:40
Yes, I agree this should be closed. Especially because my proposed code is so incredibly bad (e.g. regarding performance) that it should be rejected. Back then I was horribly wrong and didn't understand the problem well enough yet. If somebody would like to have such a function, this is all it needs: def quick_ratio_ge(self, a, b, threshold): return threshold <= 2.0*(len(a))/(len(a)+len(b)) Here is how I actually use it in code: https://github.com/modzero/burp-ResponseClusterer/blob/master/ResponseClusterer.py#L343 Sorry for the fuzz

History
Date	User	Action	Args
2022-04-11 14:58:17	admin	set	github: 68572
2018-06-11 14:40:32	floyd	set	status: open -> closed resolution: rejected messages: + msg319305 stage: resolved
2018-06-11 14:22:18	taleinat	set	messages: + msg319304
2015-06-08 07:36:44	taleinat	set	messages: + msg244991
2015-06-08 07:18:11	floyd	set	messages: + msg244990
2015-06-07 22:47:11	taleinat	set	nosy: + taleinat messages: + msg244979
2015-06-05 02:27:15	rhettinger	set	nosy: + tim.peters
2015-06-04 20:22:45	floyd	set	messages: + msg244841
2015-06-04 20:12:55	floyd	create