Author rhettinger
Recipients rhettinger, steven.daprano
Date 2019-02-03.18:51:25
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1549219885.57.0.86866159357.issue35892@roundup.psfhosted.org>
In-reply-to
Content
The current code for mode() does a good deal of extra work to support its two error outcomes (empty input and multimodal input).  That latter case is informative but doesn't provide any reasonable way to find just one of those modes, where any of the most popular would suffice.  This arises in nearest neighbor algorithms for example. I suggest adding an option to the API:

   def mode(seq, *, first_tie=False):       
       if tie_goes_to_first:
           # CHOOSE FIRST x ∈ S | ∄ y ∈ S : x ≠ y ∧ count(y) > count(x)
           return return Counter(seq).most_common(1)[0][0]
       ...

Use it like this:

    >>> data = 'ABBAC'
    >>> assert mode(data, first_tie=True) == 'A'

With the current API, there is no reasonable way to get to 'A' from 'ABBAC'.

Also, the new code path is much faster than the existing code path because it extracts only the 1 most common using min() rather than the n most common which has to sort the whole items() list.  New path: O(n).  Existing path: O(n log n).

Note, the current API is somewhat awkward to use.  In general, a user can't know in advance that the data only contains a single mode.  Accordingly, every call to mode() has to be wrapped in a try-except.  And if the user just wants one of those modal values, there is no way to get to it.  See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html for comparison.

There may be better names for the flag.  "tie_goes_to_first_encountered" seemed a bit long though ;-)
History
Date User Action Args
2019-02-03 18:51:29rhettingersetrecipients: + rhettinger, steven.daprano
2019-02-03 18:51:25rhettingersetmessageid: <1549219885.57.0.86866159357.issue35892@roundup.psfhosted.org>
2019-02-03 18:51:25rhettingerlinkissue35892 messages
2019-02-03 18:51:25rhettingercreate