Issue 26904: Difflib quick_ratio() could use Counter()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/71091

classification

Title:	Difflib quick_ratio() could use Counter()
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	closed	Resolution:	later
Dependencies:		Superseder:
Assigned To:		Nosy List:	mscuthbert, rhettinger, serhiy.storchaka, wolma
Priority:	normal	Keywords:	patch

Created on 2016-05-02 04:29 by mscuthbert, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
difflib_quick_ratio.patch	mscuthbert, 2016-05-02 04:29	quick_ratio patch for difflib	review
fasterDifflib.py	mscuthbert, 2016-05-03 10:04
newQuickRatio.py	mscuthbert, 2017-09-08 06:25

Messages (7)
msg264618 - (view)	Author: Michael Cuthbert (mscuthbert) *	Date: 2016-05-02 04:29
The implementation used in difflib.SequenceMatcher().quick_ratio() counts how often each member of the sequence (character, list entry, etc.) appears in order to calculate its lower bound. Counting how often an entry appears in an iterable has been sped up in cPython w/ the _count_elements() c function in _collections. This patch uses the collections uses collections.Counter() as a way of getting this speedup (and any future speedups in cPython or other implementations); code is somewhat simplified also. Performance: There is a slight overhead to creating two collections.Counter() objects rather than simple dicts. On two Mac systems (Python 3.5 on stock Macbook Air, and Py 3.6a0 latest on recent Mac Pro) the new implementation passes the speed of the previous when the length of the iterable is around 60 items. As the number of items increases, the performance gains increase significantly. By 400 items, the new implementation's speed is about 3x the old, and seems to approach 3.6x asymptotically. Below 60 items, the older implementation is faster; reaching a max of 8x the speed of the new when comparing a string of one element against a string of one element. (The degenerate case of comparing one or two empty iterables is checked for in this implementation and is thus faster than the old implementation). I believe that users using quick_ratio() are more likely to be looking for speed improvements on larger Like the previous implementation, the count of the items in seq2 (self.b) is cached; if run again, it is about 41% faster (compared to 47% faster before). This is my first patch submission, so any suggestions would be appreciated.
msg264621 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-05-02 05:34
Could you provide a script or a data used by you for benchmarking, so we can repeat this?
msg264636 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2016-05-02 09:30
Given your comment about sum((fullacount & fullbcount).values()), why not use its in-place version: fullacount &= fullbcount matches = sum(fullacount.values()) ?
msg264701 - (view)	Author: Michael Cuthbert (mscuthbert) *	Date: 2016-05-03 09:11
@wolma -- you're right, that the inplace __iand__ version of Counter is substantially faster -- it is still slower than the current code (since it is still basically a superset of it). However, testing shows that it is close enough to the current code as to potentially be worth using it, in order to avoid the same issue that arose here (a small speed tweak in this code prevents it from taking advantage of larger improvements later). If you think that is worth changing, I can make a new patch that incorporates it. I don't have formal benchmarks on it, but quick tests show a 10% regression from the current speedups to using &= instead of the custom summation method.
msg264709 - (view)	Author: Michael Cuthbert (mscuthbert) *	Date: 2016-05-03 10:04
Here are the results I obtained along with the test code I used to get the results. The test code also has a "hybrid" code which I did not propose, but maybe I should have, which uses the old code for very short (but not degenerate) tests and then the new code for larger tests. It does add code complexity however. Results: SpeedResult(wordLen=0, tests=1000001, cached=True, oldNew='old', time='1.22 µsec') SpeedResult(wordLen=0, tests=1000001, cached=True, oldNew='new', time='525 nsec') SpeedResult(wordLen=0, tests=1000001, cached=True, oldNew='hybrid', time='522 nsec') SpeedResult(wordLen=0, tests=1000001, cached=False, oldNew='old', time='1.78 µsec') SpeedResult(wordLen=0, tests=1000001, cached=False, oldNew='new', time='605 nsec') SpeedResult(wordLen=0, tests=1000001, cached=False, oldNew='hybrid', time='617 nsec') SpeedResult(wordLen=1, tests=500001, cached=True, oldNew='old', time='1.76 µsec') SpeedResult(wordLen=1, tests=500001, cached=True, oldNew='new', time='11.4 µsec') SpeedResult(wordLen=1, tests=500001, cached=True, oldNew='hybrid', time='2.64 µsec') SpeedResult(wordLen=1, tests=500001, cached=False, oldNew='old', time='3.07 µsec') SpeedResult(wordLen=1, tests=500001, cached=False, oldNew='new', time='19.6 µsec') SpeedResult(wordLen=1, tests=500001, cached=False, oldNew='hybrid', time='3.9 µsec') SpeedResult(wordLen=5, tests=166667, cached=True, oldNew='old', time='3.2 µsec') SpeedResult(wordLen=5, tests=166667, cached=True, oldNew='new', time='14.4 µsec') SpeedResult(wordLen=5, tests=166667, cached=True, oldNew='hybrid', time='3.98 µsec') SpeedResult(wordLen=5, tests=166667, cached=False, oldNew='old', time='5.16 µsec') SpeedResult(wordLen=5, tests=166667, cached=False, oldNew='new', time='21.8 µsec') SpeedResult(wordLen=5, tests=166667, cached=False, oldNew='hybrid', time='6.14 µsec') SpeedResult(wordLen=20, tests=47620, cached=True, oldNew='old', time='9.01 µsec') SpeedResult(wordLen=20, tests=47620, cached=True, oldNew='new', time='22.4 µsec') SpeedResult(wordLen=20, tests=47620, cached=True, oldNew='hybrid', time='9.79 µsec') SpeedResult(wordLen=20, tests=47620, cached=False, oldNew='old', time='14.2 µsec') SpeedResult(wordLen=20, tests=47620, cached=False, oldNew='new', time='32.2 µsec') SpeedResult(wordLen=20, tests=47620, cached=False, oldNew='hybrid', time='15.8 µsec') SpeedResult(wordLen=60, tests=16394, cached=True, oldNew='old', time='22.2 µsec') SpeedResult(wordLen=60, tests=16394, cached=True, oldNew='new', time='34.9 µsec') SpeedResult(wordLen=60, tests=16394, cached=True, oldNew='hybrid', time='35.9 µsec') SpeedResult(wordLen=60, tests=16394, cached=False, oldNew='old', time='39.4 µsec') SpeedResult(wordLen=60, tests=16394, cached=False, oldNew='new', time='51.9 µsec') SpeedResult(wordLen=60, tests=16394, cached=False, oldNew='hybrid', time='50.7 µsec') SpeedResult(wordLen=200, tests=4976, cached=True, oldNew='old', time='68.8 µsec') SpeedResult(wordLen=200, tests=4976, cached=True, oldNew='new', time='55.8 µsec') SpeedResult(wordLen=200, tests=4976, cached=True, oldNew='hybrid', time='54.9 µsec') SpeedResult(wordLen=200, tests=4976, cached=False, oldNew='old', time='116 µsec') SpeedResult(wordLen=200, tests=4976, cached=False, oldNew='new', time='75.4 µsec') SpeedResult(wordLen=200, tests=4976, cached=False, oldNew='hybrid', time='76.6 µsec') SpeedResult(wordLen=1000, tests=1000, cached=True, oldNew='old', time='320 µsec') SpeedResult(wordLen=1000, tests=1000, cached=True, oldNew='new', time='109 µsec') SpeedResult(wordLen=1000, tests=1000, cached=True, oldNew='hybrid', time='111 µsec') SpeedResult(wordLen=1000, tests=1000, cached=False, oldNew='old', time='529 µsec') SpeedResult(wordLen=1000, tests=1000, cached=False, oldNew='new', time='183 µsec') SpeedResult(wordLen=1000, tests=1000, cached=False, oldNew='hybrid', time='175 µsec') SpeedResult(wordLen=10000, tests=100, cached=True, oldNew='old', time='3.15 msec') SpeedResult(wordLen=10000, tests=100, cached=True, oldNew='new', time='651 µsec') SpeedResult(wordLen=10000, tests=100, cached=True, oldNew='hybrid', time='673 µsec') SpeedResult(wordLen=10000, tests=100, cached=False, oldNew='old', time='5.18 msec') SpeedResult(wordLen=10000, tests=100, cached=False, oldNew='new', time='1.25 msec') SpeedResult(wordLen=10000, tests=100, cached=False, oldNew='hybrid', time='1.24 msec') SpeedResult(wordLen=100000, tests=10, cached=True, oldNew='old', time='33.8 msec') SpeedResult(wordLen=100000, tests=10, cached=True, oldNew='new', time='8.35 msec') SpeedResult(wordLen=100000, tests=10, cached=True, oldNew='hybrid', time='8.33 msec') SpeedResult(wordLen=100000, tests=10, cached=False, oldNew='old', time='53.3 msec') SpeedResult(wordLen=100000, tests=10, cached=False, oldNew='new', time='15.2 msec') SpeedResult(wordLen=100000, tests=10, cached=False, oldNew='hybrid', time='15.3 msec') SpeedResult(wordLen=1000000, tests=1, cached=True, oldNew='old', time='551 msec') SpeedResult(wordLen=1000000, tests=1, cached=True, oldNew='new', time='168 msec') SpeedResult(wordLen=1000000, tests=1, cached=True, oldNew='hybrid', time='158 msec') SpeedResult(wordLen=10000000, tests=1, cached=True, oldNew='old', time='5.48 sec') SpeedResult(wordLen=10000000, tests=1, cached=True, oldNew='new', time='1.61 sec') SpeedResult(wordLen=10000000, tests=1, cached=True, oldNew='hybrid', time='1.62 sec')
msg293746 - (view)	Author: Michael Cuthbert (mscuthbert) *	Date: 2017-05-16 03:29
Poking to see if there's still interest in getting this into 3.7. Thanks!
msg301681 - (view)	Author: Michael Cuthbert (mscuthbert) *	Date: 2017-09-08 06:25
I've tried to get the system to not be slower on small sets by not creating a Counter for less than 60 items, and managed to get within 10% of the speed for small sequences while maintaining the 3-3.6x speedup for big comparisons and testing that the results were unchanged, but the code just became far too unmanageable for such a small function as quick_ratio, with code forking that would be hard to maintain. I'll upload the stub of the replacement for quick_ratio for future developments, in case the creation of Counter objects (the bottleneck) ever disappears, in which case this code will be simpler and faster than the current implementation, but until then, I think it's best to close this issue.

History
Date	User	Action	Args
2022-04-11 14:58:30	admin	set	github: 71091
2017-09-08 06:25:50	mscuthbert	set	status: open -> closed files: + newQuickRatio.py messages: + msg301681 resolution: later stage: patch review -> resolved
2017-05-16 03:29:13	mscuthbert	set	messages: + msg293746 versions: + Python 3.7, - Python 3.6
2016-05-03 10:04:53	mscuthbert	set	files: + fasterDifflib.py messages: + msg264709
2016-05-03 09:11:46	mscuthbert	set	messages: + msg264701
2016-05-02 09:30:35	wolma	set	nosy: + wolma messages: + msg264636
2016-05-02 05:34:14	serhiy.storchaka	set	nosy: + rhettinger, serhiy.storchaka messages: + msg264621 stage: patch review
2016-05-02 04:29:13	mscuthbert	create