This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: statistics.multimode is inefficient (time and space) (mode somewhat, too)
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Stefan Pochmann, rhettinger, steven.daprano
Priority: normal Keywords: patch

Created on 2021-11-20 01:42 by Stefan Pochmann, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
multimode_mode.py Stefan Pochmann, 2021-11-20 01:44 Benchmarks
Pull Requests
URL Status Linked Edit
PR 29662 merged rhettinger, 2021-11-20 15:41
Messages (5)
msg406638 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2021-11-20 01:42
The current implementation is:

def multimode(data):
    counts = Counter(iter(data)).most_common()
    maxcount, mode_items = next(groupby(counts, key=itemgetter(1)), (0, []))
    return list(map(itemgetter(0), mode_items))

The most_common() does a complete sort of Counter item tuples, taking O(n log n) time and quite big O(n) extra space (mostly for all those tuples).

When Raymond Hettinger suggested it in https://bugs.python.org/issue35892#msg336338 he said it should have "running speed that is optimal for the desired result". But then he detailed that with "Slow O(n log n), loads all data in memory, full sort".

Which seems like a mistake, as that's not optimal. It's easy to do in O(n) time and O(1) extra memory (in addition to the Counter and result, I mean):

def multimode(data):
    counts = Counter(iter(data))
    if not counts:
        return []
    maxcount = max(counts.values())
    return [value for value, count in counts.items() if count == maxcount]

If there are only very few *different* values then the time/space after creating the Counter is insignificant compared to the Counter creation. But if there are many different values, it can be significant.

statistics.mode takes O(n) time and O(1) space, which is optimal, but I found an apparently faster way anyway (code at end).

For example for data = random.choices(range(n), k=n):

           |     multimode     |       mode
     n     | current  proposal | current  proposal
-----------+-------------------+------------------
         1 |    131%    70%    |    125%   58%
        10 |    144%    73%    |    119%   53%
       100 |    126%    71%    |    108%   29%
     1,000 |    123%    65%    |     62%   22%
    10,000 |    172%    55%    |     53%   18%
   100,000 |    164%    44%    |     55%   20%
 1,000,000 |     85%    20%    |     22%    4%
10,000,000 |     56%    12%    |     11%    4%

All four start with Counter(iter(data)), so I took that as baseline and the above results show relative additional times. For example 55% means if Counter construction alone took 10 seconds, the function took 15.5 seconds.

An extreme case, data = list(range(n)):

           |     multimode    |       mode
     n     | current proposal | current proposal
-----------+------------------+-----------------
         1 |   128%   67%     |   124%   56%
        10 |   187%   93%     |   141%   52%
       100 |   316%  149%     |   181%   45%
     1,000 |   380%  174%     |   213%   46%
    10,000 |   349%  111%     |   146%   30%
   100,000 |   397%  128%     |   159%   34%
 1,000,000 |   336%   95%     |   112%   24%
10,000,000 |   349%   97%     |   109%   23%

I also tried a bunch of other cases, didn't find one where my versions weren't quite a bit faster.

My mode() version:

from operator import indexOf
from itertools import islice

def mode(data):
    counts = Counter(iter(data))
    if not counts:
        raise StatisticsError('no mode for empty data') from None
    maxcount = max(counts.values())
    index = indexOf(counts.values(), maxcount)
    return next(islice(counts, index, None))
msg406639 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2021-11-20 01:44
(somehow the benchmark script didn't get attached, trying again)
msg406669 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-20 16:04
New changeset 04e03f496cf7da48ce4f545b41579d7d45f59ad2 by Raymond Hettinger in branch 'main':
bpo-45851: Avoid full sort in statistics.multimode() (#29662)
https://github.com/python/cpython/commit/04e03f496cf7da48ce4f545b41579d7d45f59ad2
msg406670 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-20 16:05
Accepting the suggestion for multimode() to use max() instead of a full sort.  This is a nice improvement.  Thank you.

Leaving mode() as-is.  The existing code is cleaner and does its work in a single pass over the counter.
msg406675 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2021-11-20 17:27
Ok, thanks, had somewhat expected that, as the multimode proposal was rather clearly better but the mode proposal not so much.
History
Date User Action Args
2022-04-11 14:59:52adminsetgithub: 90009
2021-11-20 17:27:37Stefan Pochmannsetmessages: + msg406675
2021-11-20 16:05:47rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg406670

stage: patch review -> resolved
2021-11-20 16:04:44rhettingersetmessages: + msg406669
2021-11-20 15:41:00rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request27904
2021-11-20 14:21:59rhettingersetassignee: rhettinger
2021-11-20 13:02:12AlexWaygoodsetnosy: + rhettinger, steven.daprano
2021-11-20 01:44:51Stefan Pochmannsetfiles: + multimode_mode.py

messages: + msg406639
2021-11-20 01:42:08Stefan Pochmanncreate