classification
Title: return list of modes for a multimodal distribution instead of raising a StatisticsError
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: rhettinger, scotchka, sria91, steven.daprano, terry.reedy, wolma
Priority: normal Keywords: patch

Created on 2016-12-13 04:21 by sria91, last changed 2019-03-11 11:01 by steven.daprano. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 49 sria91, 2016-12-13 04:21
PR 50 sria91, 2016-12-13 09:56
PR 5732 closed sria91, 2018-02-18 10:05
Messages (15)
msg283071 - (view) Author: Srikanth Anantharam (sria91) * Date: 2016-12-13 04:21
return minimum of modes for a multimodal distribution

instead of raising a StatisticsError
msg283085 - (view) Author: Wolfgang Maier (wolma) * Date: 2016-12-13 08:50
What's the justification for this proposed change? Isn't it better to report the fact that there isn't an unambiguous result instead of returning a rather arbitrary one?
msg283089 - (view) Author: Srikanth Anantharam (sria91) * Date: 2016-12-13 09:35
A better choice would be to return a tuple of values (sliced from the
table). And let the user decide which one to use.

Hope that's justifiable...

Thanks & Regards
Srikanth Anantharam
+91 7204 350429
https://sria91.github.io/

Sent from Android

On 13-Dec-2016 2:20 PM, "Wolfgang Maier" <report@bugs.python.org> wrote:

>
> Wolfgang Maier added the comment:
>
> What's the justification for this proposed change? Isn't it better to
> report the fact that there isn't an unambiguous result instead of returning
> a rather arbitrary one?
>
> ----------
> nosy: +steven.daprano, wolma
> versions: +Python 3.7 -Python 3.5
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue28956>
> _______________________________________
>
msg283090 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-12-13 09:54
On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:
> 
> Srikanth Anantharam added the comment:
> 
> A better choice would be to return a tuple of values (sliced from the
> table). And let the user decide which one to use.

The current mode() function is designed for a very basic use-case, where 
you have an obvious single mode from discrete data.

The problem with dealing with multiple modes is that its not easy to 
tell the difference between a genuinely multi-modal sample and one which 
just happens to have a few samples with the same value:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal 
with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

I have plans for introducing a binning function to collect data into 
bins and run statistics on the bins. That might be a better way to deal 
with multi-modal samples: if you bin the data (for discrete data, use a 
bin size of 1) and then look at the frequencies, you can decide how many 
modes there are.

Thanks for the suggestion.
msg283091 - (view) Author: Srikanth Anantharam (sria91) * Date: 2016-12-13 10:08
Please see the updated pull request PR 50, with the changes.

Thanks & Regards
Srikanth Anantharam
+91 7204 350429
https://sria91.github.io/

Sent from Android

On 13-Dec-2016 3:26 PM, "Srikanth Anantharam" <report@bugs.python.org>
wrote:

>
> Changes by Srikanth Anantharam <sria91@gmail.com>:
>
>
> ----------
> pull_requests: +4
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue28956>
> _______________________________________
>
msg283092 - (view) Author: Srikanth Anantharam (sria91) * Date: 2016-12-13 10:17
@steven:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]
is clearly unimodal with mode 8

data would have been bimodal if 4 repeated exactly the same (7) number of
times as 8, like this:
data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

in which case the new patch in PR 50 would return a tuple
(4, 8)

Thanks & Regards
Srikanth Anantharam
+91 7204 350429
https://sria91.github.io/

Sent from Android

On 13-Dec-2016 3:24 PM, "Steven D'Aprano" <report@bugs.python.org> wrote:

Steven D'Aprano added the comment:

On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote:
>
> Srikanth Anantharam added the comment:
>
> A better choice would be to return a tuple of values (sliced from the
> table). And let the user decide which one to use.

The current mode() function is designed for a very basic use-case, where
you have an obvious single mode from discrete data.

The problem with dealing with multiple modes is that its not easy to
tell the difference between a genuinely multi-modal sample and one which
just happens to have a few samples with the same value:

data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal
with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9?

I have plans for introducing a binning function to collect data into
bins and run statistics on the bins. That might be a better way to deal
with multi-modal samples: if you bin the data (for discrete data, use a
bin size of 1) and then look at the frequencies, you can decide how many
modes there are.

Thanks for the suggestion.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue28956>
_______________________________________
msg283154 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-12-14 01:55
On Tue, Dec 13, 2016 at 10:08:10AM +0000, Srikanth Anantharam wrote:
> Please see the updated pull request PR 50, with the changes.

I'm rejecting that pull request. As I said, mode() intentionally 
returns only the single, unique mode. I may add a more advanced API or a 
second function for dealing with multi-modal samples, but even if I do, 
your suggestion wouldn't be sufficient. Your pull request merely returns 
the entire list of unique values:

    return tuple(value for value, frequency in table)

with no way for the caller to tell which values might be a mode and 
which are not.

(By the way, even if this function behaviour was acceptible, which I 
stress it is not, this would not be sufficient for me to accept as a 
patch. You should preferably update the documentation and the tests as 
well. At the very least, you should update the function's docstring to 
explain the changed return value.)

I'm sorry that I have to reject this, I am interested in having better 
support for multiple modes. I'm not closing this issue just yet, if you 
are interested in continuing the discussion, what would be *VERY* 
valuable for me would be for you or some other volunteer to do some 
research into numerical techniques for objectively determining the 
number and value of modes, rather than just plotting a graph and 
subjectively deciding whether a value is a peak or not.

Thanks for your interest.
msg283155 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-12-14 02:00
On Tue, Dec 13, 2016 at 10:17:21AM +0000, Srikanth Anantharam wrote:
> 
> Srikanth Anantharam added the comment:
> 
> @steven:
> 
> data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]
> is clearly unimodal with mode 8
> 
> data would have been bimodal if 4 repeated exactly the same (7) number of
> times as 8, like this:
> data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9]

Bimodal distributions do not require both modes to be exactly the same 
height. And certainly when you have a sample from a bimodal 
distribution, you should not expect exactly the same frequency for the 
two modes. Just from random sampling error you will expect one or the 
other to have a larger frequency.

You shouldn't take my example too literally. With such a small sample of 
discrete values, it becomes a (hard) matter of personal judgement. The 
point I was attempting to make was that identifying sample modes outside 
of the simplest unimodal case is tricky and requires much thought.
msg283453 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-12-17 00:22
Srikanth, when you reply by email, please remove the quotation of the previous message.  On the web page, it is just noise.  The only exception should be when you reply to a specific sentence and need to quote that sentence for context.

In my particular experience, mode() is unusally reserved for crudely describing unordered categorical data, where the concept of 'minimum' does not apply.

Mode is useful for determining the winner of a vote (or other decision process), but in general, it is not a substitute for a more comprehensive look at a dataset.

Problems with possibly returning a tuple of data items instead of a data item include:

1. The user then has to be prepared to handle a tuple instead of a data item.  It would be better then to always return a tuple, even for 1 item.

2. Data items can be tuples, making a tuple return ambiguous.  Example use case: planar points with int coordinates.

>>> mode(((0,0), (0,0), (0,1)))
(0, 0)

So, while StatisticsError is a nuisance, so are the apparent alternatives.  I think we should leave mode alone and close this.
msg312303 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-02-18 10:08
What makes the minimum mode better than the maximum?
msg312305 - (view) Author: Srikanth Anantharam (sria91) * Date: 2018-02-18 10:13
Please review the new PR with tests.
I'll update the documentation if the PR is acceptable.
msg337620 - (view) Author: Henry Chen (scotchka) * Date: 2019-03-10 16:42
The problem remains that the function can return a number or a list for input that is a list of numbers. This means the user will need to handle both possibilities every time, which is a heavy burden for such a simple function.

SciPy's mode function does return the minimum mode when there is a tie, which as far as I can tell is an arbitrary choice. But in that context, since the input is almost always numerical, a minimum is at least well defined, which is not true for an input with a mix of types.

For the general use case, the current behavior - raising an exception - in case of tie conveys the most information.
msg337622 - (view) Author: Henry Chen (scotchka) * Date: 2019-03-10 17:10
Yes, the mode function could ALWAYS return a list, but that breaks backward compatibility, as does the currently proposed change.
msg337625 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-10 18:04
See the competing proposal and PR at https://bugs.python.org/issue35892 and https://github.com/python/cpython/pull/12089
msg337656 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-03-11 11:01
I'm closing this issue in favour of Raymond's #35892, thank you to everyone even if your PRs didn't get used, I appreciate your efforts.
History
Date User Action Args
2019-03-11 11:01:24steven.dapranosetstatus: open -> closed
resolution: rejected
messages: + msg337656

stage: patch review -> resolved
2019-03-10 18:04:21rhettingersetnosy: + rhettinger
messages: + msg337625
2019-03-10 17:10:13scotchkasetmessages: + msg337622
2019-03-10 16:42:03scotchkasetnosy: + scotchka
messages: + msg337620
2018-02-18 10:13:00sria91setmessages: + msg312305
title: return minimum of modes for a multimodal distribution instead of raising a StatisticsError -> return list of modes for a multimodal distribution instead of raising a StatisticsError
2018-02-18 10:08:48steven.dapranosetmessages: + msg312303
2018-02-18 10:05:23sria91setkeywords: + patch
stage: patch review
pull_requests: + pull_request5512
2016-12-17 00:22:30terry.reedysetnosy: + terry.reedy
messages: + msg283453
2016-12-14 02:00:57steven.dapranosetmessages: + msg283155
2016-12-14 01:55:37steven.dapranosetmessages: + msg283154
2016-12-13 10:17:21sria91setmessages: + msg283092
2016-12-13 10:08:10sria91setmessages: + msg283091
2016-12-13 09:56:46sria91setpull_requests: + pull_request4
2016-12-13 09:54:04steven.dapranosetmessages: + msg283090
2016-12-13 09:35:22sria91setmessages: + msg283089
2016-12-13 08:50:12wolmasetnosy: + steven.daprano, wolma

messages: + msg283085
versions: + Python 3.7, - Python 3.5
2016-12-13 04:21:41sria91create