classification
Title: Fix awkwardness of statistics.mode() for multimodal datasets
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: Windson Yang, francismb, rhettinger, steven.daprano
Priority: normal Keywords:

Created on 2019-02-03 18:51 by rhettinger, last changed 2019-02-17 19:27 by rhettinger.

Messages (11)
msg334796 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-03 18:51
The current code for mode() does a good deal of extra work to support its two error outcomes (empty input and multimodal input).  That latter case is informative but doesn't provide any reasonable way to find just one of those modes, where any of the most popular would suffice.  This arises in nearest neighbor algorithms for example. I suggest adding an option to the API:

   def mode(seq, *, first_tie=False):       
       if tie_goes_to_first:
           # CHOOSE FIRST x ∈ S | ∄ y ∈ S : x ≠ y ∧ count(y) > count(x)
           return return Counter(seq).most_common(1)[0][0]
       ...

Use it like this:

    >>> data = 'ABBAC'
    >>> assert mode(data, first_tie=True) == 'A'

With the current API, there is no reasonable way to get to 'A' from 'ABBAC'.

Also, the new code path is much faster than the existing code path because it extracts only the 1 most common using min() rather than the n most common which has to sort the whole items() list.  New path: O(n).  Existing path: O(n log n).

Note, the current API is somewhat awkward to use.  In general, a user can't know in advance that the data only contains a single mode.  Accordingly, every call to mode() has to be wrapped in a try-except.  And if the user just wants one of those modal values, there is no way to get to it.  See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html for comparison.

There may be better names for the flag.  "tie_goes_to_first_encountered" seemed a bit long though ;-)
msg335147 - (view) Author: Francis MB (francismb) * Date: 2019-02-10 10:15
>> There may be better names for the flag.  "tie_goes_to_first_encountered" seemed a bit long though ;-)

Could it may be an alternative to set the mode tie case in a form like:

def mode(seq, *, case=CHOOSE_FIRST):
  [...]

(or TIE_CHOOSE_FIRST, ...) where CHOOSE_FIRST is an enum ?

Thanks!
msg335182 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-11 00:12
Thanks Raymond for the interesting use-case.

The original design of mode() was support only the basic form taught in 
secondary schools, namely a single unique mode for categorical data or 
discrete numerical data.

I think it is time to consider a richer interface to support more uses, 
such as estimating the mode of continuous numerical data, and the 
multi-mode case you bring up.

One reason I've been hesitant is that deciding what is the right 
behaviour is quite hard (or at least it has been for me). I think there 
are a few cases to consider:

- the existing behaviour (which may not be very useful?) which is to
  raise an exception unless the mode is unique; 

- would it be more useful to return an out-of-band value like a NAN 
  or None?

- the multi-mode case where you want some arbitrary(?) mode, possibly
  the left-most (smallest) for numeric data;

- the multi-mode case where you want all the modes.

I like Francis' suggestion to use an enum to select the behavioural, er, 
mode (pun intended). What do you think?
msg335679 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-16 10:48
I would stick with "first_tie=False".  That caters to the common case, avoids API complications, and does something similar to what other stats packages are doing.

----- API Survey ----

Maple: """This function is only guaranteed to return one potential mode - in cases where multiple modes exist, only the first detected mode is guaranteed to be returned.""" -- https://www.maplesoft.com/support/help/maple/view.aspx?path=Statistics%2fMode

R: 'R does not have a standard in-built function to calculate mode.' -- https://www.tutorialspoint.com/r/r_mean_median_mode.htm

Matlab: "When there are multiple values occurring equally frequently, mode returns the smallest of those values. ... If A is an empty 0-by-0 matrix, mode(A) returns NaN." -- https://www.mathworks.com/help/matlab/ref/mode.html

Mathematica: "The mode of a set of data is implemented in the Wolfram Language as Commonest[data]. ... When several elements occur with equal frequency, Commonest[list,…] picks first the ones that occur first in list." -- http://mathworld.wolfram.com/Mode.html and https://reference.wolfram.com/language/ref/Commonest.html

SciPy: "If there is more than one such value, only the smallest is returned." -- https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.mode.html

SAS: "Most frequent value (if not unique, the smallest mode)" -- https://documentation.sas.com/?docsetId=qcug&docsetTarget=qcug_capability_sect225.htm&docsetVersion=14.2&locale=en
msg335680 - (view) Author: Francis MB (francismb) * Date: 2019-02-16 11:28
Good point Raymond!

Only a minor observation on the packages API:  


[1] SciPy: scipy.stats.mode(a, axis=0, nan_policy='propagate')
"Returns an array of the modal (most common) **value** in the passed array." --> Here it claims to return just ONE value

And use of different policies on parameters :
nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional
Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default is ‘propagate’.

Equivalent one could say 'multivalue_policy'


[2] Matlab: Mode: "Most frequent **values** in array"

...returns the sample mode of A, which is the most frequently occurring *value* in A...."

IMHO it seems inconsistent *values* vs. *value* (or a doc-bug ?).


An a question:
Does it that mean that mode in that case really should potentially return an array of values, e.g. all the values with equal frequency?

In that case the user has the chance to get the first, the last or just all, ...


----
[1] https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.mode.html

[2] https://la.mathworks.com/help/matlab/ref/mode.html
msg335684 - (view) Author: Windson Yang (Windson Yang) * Date: 2019-02-16 13:36
IMHO, we don't need to add the option. We can return the smallest value from the **table** instead of the code below.

    if len(table) == 1:
        return table[0][0]

[1] https://github.com/python/cpython/blob/master/Lib/statistics.py#L502
msg335704 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-16 17:44
> We can return the smallest value from the **table** instead
> of the code below.

Internally, that does too much work and then throws most of it away.

The difference between Counter(data).most_common()[1] and Counter(data).most_common(1) is that the former materializes the entire items iterator into a list and does a full sort, while the latter is memory friendly and makes a single pass with min().
msg335726 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-16 22:52
I've been thinking about this a bit more.  ISTM that for now, it's best to get mode() working for everyday use, returning just the first mode encountered.  This keeps the signature simple (Iterable -> Scalar).

In the future, if needed, there is room to add separate functions with their own clean signatures and use cases.  For example, having a separate multimode() that always returns a list would be clearer and easier to use than setting a flag on the current scalar valued mode() function. 

Another example would be mode_estimate() that returns a scalar float
that estimates the mode given data sampled from a continuous variable. This warrants its own function because the signature, use cases, options, and implementation method are all completely different from a mode based on counting.

Categorical, binned, or ordinal data:

   mode(data: Iterable, *, first_tie=False) -> object
   multimode(data: Iterable) -> List[object]

Continuous data:

   mode(data: Iterable[Real]) -> Real
msg335746 - (view) Author: Windson Yang (Windson Yang) * Date: 2019-02-17 02:27
I only tested stats.mode() from scipy, data = 'BBAAC' should also return 'A'. But in your code **return return Counter(seq).most_common(1)[0][0]** will return B. Did I miss something?
msg335764 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-17 09:10
> Did I miss something?

Yes.  It doesn't really matter which mode is returned as long as it is deterministically chosen.  We're proposing to return the first mode rather than the smallest mode.  

Scipy returns the smallest mode because that is convenient given that the underlying operation is np.unique() which returns unique values in sorted order [1]. 

We want to return the first mode encountered because that is convenient given that the underlying operation is max() which returns the first maximum value it encounters.

Another advantage of return-first rather than return-smallest is that our mode() would work for data values that are hashable but not orderable (i.e. frozensets).

[1] https://github.com/scipy/scipy/blob/v0.19.1/scipy/stats/stats.py#L378-L443
msg335785 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-17 19:27
The attraction to having a first_tie=False flag is that it is backwards compatible with the current API.

However on further reflection, people would almost never want the existing behavior of raising an exception rather than returning a useful result.

So, we could make first_tie the default behavior and not have a flag at all.  We would also add a separate multimode() function with a clean signature, always returning a list.  FWIW, MS Excel evolved to this solution as well (it has MODE.SGNL and MODE.MULT).

This would give us a clean, fast API with no flags:

    mode(Iterable) -> scalar
    multimode(Iterable) -> list  

ISTM this is an all round win:

* Fixes the severe usability problems with mode().  Currently, it can't be used without a try/except. The except-clause can't distinguish between an empty data condition and no unique mode condition, nor can the except clause access the dataset counts to get to a useful answer.

* It makes mode() memory friendly and faster.  Currently, mode() has to convert the counter items into a list() and run a full sort.  However, if we only need the first mode, we can feed the items iterator to max() and make only a single pass, never having to create a list.

However, there would be an incompatibility in the API change.  It is possible that a person really does want an exception instead of a value when there are multiple modes.  In that case, this would break their code so that they have to switch to multimode() as a remediation.  OTOH, it is also possible that people have existing code that uses mode() but their code has a latent bug because they weren't using a try/except and haven't yet encountered a multi-modal dataset. 

So here are some options:

* Change mode() to return the first tie instead of raising an exception.  This is a behavior change but leaves you with the cleanest API going forward.

* Add a Deprecation warning to the current behavior of mode() when it finds multimodal data.  Then change the behavior in Python 3.9.  This is uncomfortable for one release, but does get us to the cleanest API and best performance.

* Change mode() to have a flag where first_tie defaults to False.  This is backwards compatible, but it doesn't fix latent bugs, it leaves us with on-going API complexities, and the default case remains the least usable option.

* Leave mode() as-is and just document its limitations.

For any of those options, we should still add a separate multimode() function.
History
Date User Action Args
2019-02-17 19:27:53rhettingersetmessages: + msg335785
2019-02-17 09:10:34rhettingersetmessages: + msg335764
2019-02-17 02:27:07Windson Yangsetmessages: + msg335746
2019-02-16 22:52:05rhettingersetmessages: + msg335726
2019-02-16 17:44:35rhettingersetmessages: + msg335704
2019-02-16 13:36:18Windson Yangsetnosy: + Windson Yang
messages: + msg335684
2019-02-16 11:28:23francismbsetmessages: + msg335680
2019-02-16 10:48:06rhettingersetmessages: + msg335679
2019-02-11 00:12:13steven.dapranosetmessages: + msg335182
2019-02-10 10:15:23francismbsetnosy: + francismb
messages: + msg335147
2019-02-03 18:51:25rhettingercreate