Issue 30999: statistics module: add a general selection function

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/75182

classification

Title:	statistics module: add a general selection function
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	closed	Resolution:	later
Dependencies:		Superseder:
Assigned To:		Nosy List:	gerion, godfryd, iritkatriel, remi.lapeyre, rhettinger, steven.daprano
Priority:	normal	Keywords:

Created on 2017-07-23 22:10 by gerion, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (12)
msg298918 - (view)	Author: (gerion)	Date: 2017-07-23 22:10
With Python 3.4 the statistics module was added. It would be cool, if the functions: median_low() median_high() mode() would have a "key" keyword argument, just like in max() and min(): ``` >>> median_low([(1, 2), (3, 3), (4, 1)], key=lambda elem: elem[0]) (3, 3) ``` This functions always choose a specific element of the list, so a "key" argument is meaningful. Maybe such a parameter makes sense for mean() as well, if the return value always is the result itself, but this is another point: ``` >>> mean([(1, 2), (3, 3), (4, 1)], key=lambda elem: elem[0]) 2.6666666666666665 ```
msg298924 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2017-07-24 00:43
Apart from being "cool", what is the purpose of this key argument? For the example shown, where you extract an item from tuple data: >>> median_low([(1, 2), (3, 3), (4, 1)], key=lambda elem: elem[0]) (3, 3) I'm not sure I understand when you would use this, and why you would describe (3,3) as a median (a kind of average) of the given data. By the way, although it's not (yet?) officially supported, it turns out that this works: py> median_low([(1, 2), (3, 3), (4, 1)]) (3, 3) Officially, median requires numeric data. If the median* functions were to support tuples, I would be inclined to return a new tuple with the median of each column, as such: median_low([(1, 2), (3, 3), (4, 1)]) (3, 2) # median of 1,3,4 and median of 2,3,1 I can think of uses for that, e.g. calculating the "Q" correlation coefficient. What uses do you have for your suggested key argument?
msg298925 - (view)	Author: (gerion)	Date: 2017-07-24 01:46
My use case is some side data somehow connected to the statistical relevant data. (I think, this is more less a similar use case as with the min and max function.) A few examples: The datastructure is a list of tuples: (score, [list of people that have this score]) ``` median = median_low([(1, ['Anna']), (3, ['Paul', 'Henry']), (4, ['Kate'])], key=lambda elem: elem[0]) for name in median[1]: print(f"{name} is one of the people that reach the median score.") ``` or you can enumerate: ``` data = [1, 3, 4] median = median_low(enumerate(data), key=lambda elem: elem[1]) print(f"median is at position {median[0]}") ``` With the keyword argument, the input can also be a list of self defined objects, where the median make sense on some member variable or function, etc.: ``` >>> median_low(list_of_self_defined_objects, key=lambda elem: elem.get_score()) ```
msg298992 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2017-07-24 16:54
Thanks for explaining your use-case. Although the median_* functions don't perform arithmetic on their data, they are still conceptually mathematical functions that operate on numbers and I'm reluctant to support arbitrary objects with a key function without a solid reason. In your example, I think there are existing ways to get the result you want: (1) Use a dict: data = dict([(1, ['Anna']), (3, ['Paul', 'Henry']), (4, ['Kate'])]) people = data[median_low(data)] (2) Use a custom numeric type with associated data: class MyNum(int): def __new__(cls, num, data): instance = super().__new__(cls, num) instance.data = data return instance data = [MyNum(1, ['Anna']), MyNum(3, ['Paul', 'Henry']), MyNum(4, ['Kate'])] people = median_low(data).data As for your second example, do you have a use-case for wanting to know the position of the median in the original, unsorted list? When would that be useful? One other reason for my reluctance: although median_low and median_high guarantee to only return an actual data point, that's a fairly special case. There are other order statistics (such as quartiles, quantiles, etc) which are conceptually related to median but don't necessarily return a data value. Indeed, the regular median() function doesn't always do so. I would be reluctant for median() and median_low() to have different signatures without an excellent reason. I'm not completely ruling this out. One thing which might sway me is if there are other languages or statistics libraries which offer this feature. (I say might, not that it definitely will.)
msg299013 - (view)	Author: (gerion)	Date: 2017-07-24 19:29
The position might be useful, if you have a second list with some side data stored in it, and not a list of tuples :). I had the idea to file a bug, when I had a list of coordinates and wanted to use the point with the median of the x-coordinates as "representation" for the dataset. With max() and min() in mind, I used median_low() with key argument and get the error that key is not a valid argument (my solution was to use dicts then). So I thought this would be a similar use case as max() and min() and in fact more consistent. But I fully understand your concerns, that this breaks consistence with the other statistic functions. This is not a killer feature, but in my opinion nice to have, because it changes nothing on the default (expected) behaviour, but provides with less code very high flexibility. I cannot say something about other languages.
msg299038 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2017-07-25 01:33
I've given this some more thought, and I think that a "key" argument would make sense for a general selection function. The general selection problem is: given a set of items A, and a number k between 1 and the number of items, return the k-th item. In Python terms, we would use a list, and 0 <= k < len(A) instead. https://www.cs.rochester.edu/~gildea/csc282/slides/C09-median.pdf I've had the idea of adding a select(A, k) function to statistics for a while now. Then the median_low would be equivalent to select(A, len(A)//2) and median_high would be select(A, len(A)//2 + 1). I'd leave the median_* functions as they are, and possibly include a key function in select. I don't think it makes sense to add key arguments to mode, mean, variance etc. I'm having trouble thinking of what that would even mean (no pun intented): it's unlikely that the mean will actually a data value (except by accident, or by careful construction of the data). Variance has the wrong units (it is the units of your data, squared) and the stdev is conceptually a difference between data values, not a data value itself, so it doesn't even make sense to apply a key function and return one of the data points. And mode counts objects, so it already applies to non-numeric data. It's even documented as applying to nominal data.
msg299549 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-07-31 09:41
> I don't think it makes sense to add key arguments to mode, mean, > variance etc. I'm having trouble thinking of what that would > even mean I concur. This proposal bends the concept of a key-function to where it is no longer obvious what it does. > I've given this some more thought, and I think that a "key" > argument would make sense for a general selection function. Yes, that would make sense: select(A, k, key=somefunc) == sorted(A, key=somefunc)[k]
msg327931 - (view)	Author: Michal Nowikowski (godfryd)	Date: 2018-10-18 04:06
What is the progress of this issue? I'm also interested in this feature. I expected that these functions will behave as built-in min and max. They have key argument, see here: https://docs.python.org/3/library/functions.html#max
msg329266 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-11-04 21:52
This issue (as originally proposed) should be closed. A key function for median() and mode() likely isn't a good idea. Those two functions should be kept parallel with mean() as returning simple descriptive statistics. Work towards a select() function with a key function can be pursued in a separate tracker item. That would suffice to locate a specific record occurring at a median (or quartile or decile). FWIW, that is how MS Excel approaches the problem as well (using RANK with INDEX to locate a record by its sort position, leaving AVERAGE, MODE.SNGL, and MEDIAN for straight descriptive statistics).
msg333965 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2019-01-18 14:07
I suggest we closed this issue in favor of #35775 to discuss adding a selection function and the attached PR.
msg399970 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-08-20 14:10
Updating the subject according to the discussion.
msg399997 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-08-20 21:48
No one has shown any interest in this in a long time. Marking is as closed for now. This issue can be reopened if there is an interest and a reasonable use case that can't be reasonably handled with sorted, min, or max.

History
Date	User	Action	Args
2022-04-11 14:58:49	admin	set	github: 75182
2021-08-20 21:48:20	rhettinger	set	status: open -> closed resolution: later messages: + msg399997 stage: resolved
2021-08-20 18:11:20	mark.dickinson	set	nosy: - mark.dickinson
2021-08-20 14:10:55	iritkatriel	set	nosy: + iritkatriel title: statistics module: add "key" keyword argument to median, mode, ... -> statistics module: add a general selection function messages: + msg399970 versions: + Python 3.11, - Python 3.7
2019-01-18 14:07:22	remi.lapeyre	set	nosy: + remi.lapeyre messages: + msg333965
2018-11-04 21:52:30	rhettinger	set	messages: + msg329266
2018-10-18 04:06:23	godfryd	set	nosy: + godfryd messages: + msg327931
2017-07-31 09:41:31	rhettinger	set	nosy: + rhettinger messages: + msg299549
2017-07-25 01:33:09	steven.daprano	set	messages: + msg299038
2017-07-24 19:29:27	gerion	set	messages: + msg299013
2017-07-24 16:54:09	steven.daprano	set	messages: + msg298992
2017-07-24 06:46:55	mark.dickinson	set	nosy: + mark.dickinson
2017-07-24 01:46:03	gerion	set	messages: + msg298925
2017-07-24 00:43:59	steven.daprano	set	nosy: + steven.daprano messages: + msg298924
2017-07-23 22:10:22	gerion	create