Issue 28956: return list of modes for a multimodal distribution instead of raising a StatisticsError

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73142

classification

Title:	return list of modes for a multimodal distribution instead of raising a StatisticsError
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	rhettinger, scotchka, sria91, steven.daprano, terry.reedy, wolma
Priority:	normal	Keywords:	patch

Created on 2016-12-13 04:21 by sria91, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 49		sria91, 2016-12-13 04:21
PR 50		sria91, 2016-12-13 09:56
PR 5732	closed	sria91, 2018-02-18 10:05

Messages (15)
msg283071 - (view)	Author: Srikanth Anantharam (sria91) *	Date: 2016-12-13 04:21
return minimum of modes for a multimodal distribution instead of raising a StatisticsError
msg283085 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2016-12-13 08:50
What's the justification for this proposed change? Isn't it better to report the fact that there isn't an unambiguous result instead of returning a rather arbitrary one?
msg283089 - (view)	Author: Srikanth Anantharam (sria91) *	Date: 2016-12-13 09:35
A better choice would be to return a tuple of values (sliced from the table). And let the user decide which one to use. Hope that's justifiable... Thanks & Regards Srikanth Anantharam +91 7204 350429 https://sria91.github.io/ Sent from Android On 13-Dec-2016 2:20 PM, "Wolfgang Maier" <report@bugs.python.org> wrote: > > Wolfgang Maier added the comment: > > What's the justification for this proposed change? Isn't it better to > report the fact that there isn't an unambiguous result instead of returning > a rather arbitrary one? > > ---------- > nosy: +steven.daprano, wolma > versions: +Python 3.7 -Python 3.5 > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue28956> > _______________________________________ >
msg283090 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2016-12-13 09:54
On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote: > > Srikanth Anantharam added the comment: > > A better choice would be to return a tuple of values (sliced from the > table). And let the user decide which one to use. The current mode() function is designed for a very basic use-case, where you have an obvious single mode from discrete data. The problem with dealing with multiple modes is that its not easy to tell the difference between a genuinely multi-modal sample and one which just happens to have a few samples with the same value: data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9? I have plans for introducing a binning function to collect data into bins and run statistics on the bins. That might be a better way to deal with multi-modal samples: if you bin the data (for discrete data, use a bin size of 1) and then look at the frequencies, you can decide how many modes there are. Thanks for the suggestion.
msg283091 - (view)	Author: Srikanth Anantharam (sria91) *	Date: 2016-12-13 10:08
Please see the updated pull request PR 50, with the changes. Thanks & Regards Srikanth Anantharam +91 7204 350429 https://sria91.github.io/ Sent from Android On 13-Dec-2016 3:26 PM, "Srikanth Anantharam" <report@bugs.python.org> wrote: > > Changes by Srikanth Anantharam <sria91@gmail.com>: > > > ---------- > pull_requests: +4 > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue28956> > _______________________________________ >
msg283092 - (view)	Author: Srikanth Anantharam (sria91) *	Date: 2016-12-13 10:17
@steven: data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] is clearly unimodal with mode 8 data would have been bimodal if 4 repeated exactly the same (7) number of times as 8, like this: data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] in which case the new patch in PR 50 would return a tuple (4, 8) Thanks & Regards Srikanth Anantharam +91 7204 350429 https://sria91.github.io/ Sent from Android On 13-Dec-2016 3:24 PM, "Steven D'Aprano" <report@bugs.python.org> wrote: Steven D'Aprano added the comment: On Tue, Dec 13, 2016 at 09:35:22AM +0000, Srikanth Anantharam wrote: > > Srikanth Anantharam added the comment: > > A better choice would be to return a tuple of values (sliced from the > table). And let the user decide which one to use. The current mode() function is designed for a very basic use-case, where you have an obvious single mode from discrete data. The problem with dealing with multiple modes is that its not easy to tell the difference between a genuinely multi-modal sample and one which just happens to have a few samples with the same value: data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] Assuming the sampling is fair, 8 is clearly the mode; but is it bimodal with 4 the second mode? Or perhaps even four modes, 8, 4, 7 and 9? I have plans for introducing a binning function to collect data into bins and run statistics on the bins. That might be a better way to deal with multi-modal samples: if you bin the data (for discrete data, use a bin size of 1) and then look at the frequencies, you can decide how many modes there are. Thanks for the suggestion. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue28956> _______________________________________
msg283154 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2016-12-14 01:55
On Tue, Dec 13, 2016 at 10:08:10AM +0000, Srikanth Anantharam wrote: > Please see the updated pull request PR 50, with the changes. I'm rejecting that pull request. As I said, mode() intentionally returns only the single, unique mode. I may add a more advanced API or a second function for dealing with multi-modal samples, but even if I do, your suggestion wouldn't be sufficient. Your pull request merely returns the entire list of unique values: return tuple(value for value, frequency in table) with no way for the caller to tell which values might be a mode and which are not. (By the way, even if this function behaviour was acceptible, which I stress it is not, this would not be sufficient for me to accept as a patch. You should preferably update the documentation and the tests as well. At the very least, you should update the function's docstring to explain the changed return value.) I'm sorry that I have to reject this, I am interested in having better support for multiple modes. I'm not closing this issue just yet, if you are interested in continuing the discussion, what would be VERY valuable for me would be for you or some other volunteer to do some research into numerical techniques for objectively determining the number and value of modes, rather than just plotting a graph and subjectively deciding whether a value is a peak or not. Thanks for your interest.
msg283155 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2016-12-14 02:00
On Tue, Dec 13, 2016 at 10:17:21AM +0000, Srikanth Anantharam wrote: > > Srikanth Anantharam added the comment: > > @steven: > > data = [1, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] > is clearly unimodal with mode 8 > > data would have been bimodal if 4 repeated exactly the same (7) number of > times as 8, like this: > data = [1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9] Bimodal distributions do not require both modes to be exactly the same height. And certainly when you have a sample from a bimodal distribution, you should not expect exactly the same frequency for the two modes. Just from random sampling error you will expect one or the other to have a larger frequency. You shouldn't take my example too literally. With such a small sample of discrete values, it becomes a (hard) matter of personal judgement. The point I was attempting to make was that identifying sample modes outside of the simplest unimodal case is tricky and requires much thought.
msg283453 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2016-12-17 00:22
Srikanth, when you reply by email, please remove the quotation of the previous message. On the web page, it is just noise. The only exception should be when you reply to a specific sentence and need to quote that sentence for context. In my particular experience, mode() is unusally reserved for crudely describing unordered categorical data, where the concept of 'minimum' does not apply. Mode is useful for determining the winner of a vote (or other decision process), but in general, it is not a substitute for a more comprehensive look at a dataset. Problems with possibly returning a tuple of data items instead of a data item include: 1. The user then has to be prepared to handle a tuple instead of a data item. It would be better then to always return a tuple, even for 1 item. 2. Data items can be tuples, making a tuple return ambiguous. Example use case: planar points with int coordinates. >>> mode(((0,0), (0,0), (0,1))) (0, 0) So, while StatisticsError is a nuisance, so are the apparent alternatives. I think we should leave mode alone and close this.
msg312303 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2018-02-18 10:08
What makes the minimum mode better than the maximum?
msg312305 - (view)	Author: Srikanth Anantharam (sria91) *	Date: 2018-02-18 10:13
Please review the new PR with tests. I'll update the documentation if the PR is acceptable.
msg337620 - (view)	Author: Henry Chen (scotchka) *	Date: 2019-03-10 16:42
The problem remains that the function can return a number or a list for input that is a list of numbers. This means the user will need to handle both possibilities every time, which is a heavy burden for such a simple function. SciPy's mode function does return the minimum mode when there is a tie, which as far as I can tell is an arbitrary choice. But in that context, since the input is almost always numerical, a minimum is at least well defined, which is not true for an input with a mix of types. For the general use case, the current behavior - raising an exception - in case of tie conveys the most information.
msg337622 - (view)	Author: Henry Chen (scotchka) *	Date: 2019-03-10 17:10
Yes, the mode function could ALWAYS return a list, but that breaks backward compatibility, as does the currently proposed change.
msg337625 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-03-10 18:04
See the competing proposal and PR at https://bugs.python.org/issue35892 and https://github.com/python/cpython/pull/12089
msg337656 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2019-03-11 11:01
I'm closing this issue in favour of Raymond's #35892, thank you to everyone even if your PRs didn't get used, I appreciate your efforts.

History
Date	User	Action	Args
2022-04-11 14:58:40	admin	set	github: 73142
2019-03-11 11:01:24	steven.daprano	set	status: open -> closed resolution: rejected messages: + msg337656 stage: patch review -> resolved
2019-03-10 18:04:21	rhettinger	set	nosy: + rhettinger messages: + msg337625
2019-03-10 17:10:13	scotchka	set	messages: + msg337622
2019-03-10 16:42:03	scotchka	set	nosy: + scotchka messages: + msg337620
2018-02-18 10:13:00	sria91	set	messages: + msg312305 title: return minimum of modes for a multimodal distribution instead of raising a StatisticsError -> return list of modes for a multimodal distribution instead of raising a StatisticsError
2018-02-18 10:08:48	steven.daprano	set	messages: + msg312303
2018-02-18 10:05:23	sria91	set	keywords: + patch stage: patch review pull_requests: + pull_request5512
2016-12-17 00:22:30	terry.reedy	set	nosy: + terry.reedy messages: + msg283453
2016-12-14 02:00:57	steven.daprano	set	messages: + msg283155
2016-12-14 01:55:37	steven.daprano	set	messages: + msg283154
2016-12-13 10:17:21	sria91	set	messages: + msg283092
2016-12-13 10:08:10	sria91	set	messages: + msg283091
2016-12-13 09:56:46	sria91	set	pull_requests: + pull_request4
2016-12-13 09:54:04	steven.daprano	set	messages: + msg283090
2016-12-13 09:35:22	sria91	set	messages: + msg283089
2016-12-13 08:50:12	wolma	set	nosy: + steven.daprano, wolma messages: + msg283085 versions: + Python 3.7, - Python 3.5
2016-12-13 04:21:41	sria91	create