Issue 45851: statistics.multimode is inefficient (time and space) (mode somewhat, too)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90009

classification

Title:	statistics.multimode is inefficient (time and space) (mode somewhat, too)
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	Stefan Pochmann, rhettinger, steven.daprano
Priority:	normal	Keywords:	patch

Created on 2021-11-20 01:42 by Stefan Pochmann, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
multimode_mode.py	Stefan Pochmann, 2021-11-20 01:44	Benchmarks

Pull Requests
URL	Status	Linked	Edit
PR 29662	merged	rhettinger, 2021-11-20 15:41

Messages (5)
msg406638 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2021-11-20 01:42
The current implementation is: def multimode(data): counts = Counter(iter(data)).most_common() maxcount, mode_items = next(groupby(counts, key=itemgetter(1)), (0, [])) return list(map(itemgetter(0), mode_items)) The most_common() does a complete sort of Counter item tuples, taking O(n log n) time and quite big O(n) extra space (mostly for all those tuples). When Raymond Hettinger suggested it in https://bugs.python.org/issue35892#msg336338 he said it should have "running speed that is optimal for the desired result". But then he detailed that with "Slow O(n log n), loads all data in memory, full sort". Which seems like a mistake, as that's not optimal. It's easy to do in O(n) time and O(1) extra memory (in addition to the Counter and result, I mean): def multimode(data): counts = Counter(iter(data)) if not counts: return [] maxcount = max(counts.values()) return [value for value, count in counts.items() if count == maxcount] If there are only very few different values then the time/space after creating the Counter is insignificant compared to the Counter creation. But if there are many different values, it can be significant. statistics.mode takes O(n) time and O(1) space, which is optimal, but I found an apparently faster way anyway (code at end). For example for data = random.choices(range(n), k=n): \| multimode \| mode n \| current proposal \| current proposal -----------+-------------------+------------------ 1 \| 131% 70% \| 125% 58% 10 \| 144% 73% \| 119% 53% 100 \| 126% 71% \| 108% 29% 1,000 \| 123% 65% \| 62% 22% 10,000 \| 172% 55% \| 53% 18% 100,000 \| 164% 44% \| 55% 20% 1,000,000 \| 85% 20% \| 22% 4% 10,000,000 \| 56% 12% \| 11% 4% All four start with Counter(iter(data)), so I took that as baseline and the above results show relative additional times. For example 55% means if Counter construction alone took 10 seconds, the function took 15.5 seconds. An extreme case, data = list(range(n)): \| multimode \| mode n \| current proposal \| current proposal -----------+------------------+----------------- 1 \| 128% 67% \| 124% 56% 10 \| 187% 93% \| 141% 52% 100 \| 316% 149% \| 181% 45% 1,000 \| 380% 174% \| 213% 46% 10,000 \| 349% 111% \| 146% 30% 100,000 \| 397% 128% \| 159% 34% 1,000,000 \| 336% 95% \| 112% 24% 10,000,000 \| 349% 97% \| 109% 23% I also tried a bunch of other cases, didn't find one where my versions weren't quite a bit faster. My mode() version: from operator import indexOf from itertools import islice def mode(data): counts = Counter(iter(data)) if not counts: raise StatisticsError('no mode for empty data') from None maxcount = max(counts.values()) index = indexOf(counts.values(), maxcount) return next(islice(counts, index, None))
msg406639 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2021-11-20 01:44
(somehow the benchmark script didn't get attached, trying again)
msg406669 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-11-20 16:04
New changeset 04e03f496cf7da48ce4f545b41579d7d45f59ad2 by Raymond Hettinger in branch 'main': bpo-45851: Avoid full sort in statistics.multimode() (#29662) https://github.com/python/cpython/commit/04e03f496cf7da48ce4f545b41579d7d45f59ad2
msg406670 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-11-20 16:05
Accepting the suggestion for multimode() to use max() instead of a full sort. This is a nice improvement. Thank you. Leaving mode() as-is. The existing code is cleaner and does its work in a single pass over the counter.
msg406675 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2021-11-20 17:27
Ok, thanks, had somewhat expected that, as the multimode proposal was rather clearly better but the mode proposal not so much.

History
Date	User	Action	Args
2022-04-11 14:59:52	admin	set	github: 90009
2021-11-20 17:27:37	Stefan Pochmann	set	messages: + msg406675
2021-11-20 16:05:47	rhettinger	set	status: open -> closed resolution: fixed messages: + msg406670 stage: patch review -> resolved
2021-11-20 16:04:44	rhettinger	set	messages: + msg406669
2021-11-20 15:41:00	rhettinger	set	keywords: + patch stage: patch review pull_requests: + pull_request27904
2021-11-20 14:21:59	rhettinger	set	assignee: rhettinger
2021-11-20 13:02:12	AlexWaygood	set	nosy: + rhettinger, steven.daprano
2021-11-20 01:44:51	Stefan Pochmann	set	files: + multimode_mode.py messages: + msg406639
2021-11-20 01:42:08	Stefan Pochmann	create