This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: statistics module: add a general selection function
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: later
Dependencies: Superseder:
Assigned To: Nosy List: gerion, godfryd, iritkatriel, remi.lapeyre, rhettinger, steven.daprano
Priority: normal Keywords:

Created on 2017-07-23 22:10 by gerion, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (12)
msg298918 - (view) Author: (gerion) Date: 2017-07-23 22:10
With Python 3.4 the statistics module was added. It would be cool, if the functions:
median_low()
median_high()
mode()
would have a "key" keyword argument, just like in max() and min():
```
>>> median_low([(1, 2), (3, 3), (4, 1)], key=lambda elem: elem[0])
(3, 3)
```
This functions always choose a specific element of the list, so a "key" argument is meaningful.


Maybe such a parameter makes sense for mean() as well, if the return value always is the result itself, but this is another point:
```
>>> mean([(1, 2), (3, 3), (4, 1)], key=lambda elem: elem[0])
2.6666666666666665
```
msg298924 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-07-24 00:43
Apart from being "cool", what is the purpose of this key argument?

For the example shown, where you extract an item from tuple data:

>>> median_low([(1, 2), (3, 3), (4, 1)], key=lambda elem: elem[0])
(3, 3)

I'm not sure I understand when you would use this, and why you would describe (3,3) as a median (a kind of average) of the given data.


By the way, although it's not (yet?) officially supported, it turns out that this works:

py> median_low([(1, 2), (3, 3), (4, 1)])
(3, 3)

Officially, median requires numeric data. If the median* functions were to support tuples, I would be inclined to return a new tuple with the median of each column, as such:

median_low([(1, 2), (3, 3), (4, 1)])
(3, 2)  # median of 1,3,4 and median of 2,3,1


I can think of uses for that, e.g. calculating the "Q" correlation coefficient. What uses do you have for your suggested key argument?
msg298925 - (view) Author: (gerion) Date: 2017-07-24 01:46
My use case is some side data somehow connected to the statistical relevant data.
(I think, this is more less a similar use case as with the min and max function.)

A few examples:

The datastructure is a list of tuples: (score, [list of people that have this score])
```
median = median_low([(1, ['Anna']), (3, ['Paul', 'Henry']), (4, ['Kate'])], key=lambda elem: elem[0])
for name in median[1]:
    print(f"{name} is one of the people that reach the median score.")
```
or you can enumerate:
```
data = [1, 3, 4]
median = median_low(enumerate(data), key=lambda elem: elem[1])
print(f"median is at position {median[0]}")
```
With the keyword argument, the input can also be a list of self defined objects, where the median make sense on some member variable or function, etc.:
```
>>> median_low(list_of_self_defined_objects, key=lambda elem: elem.get_score())
```
msg298992 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-07-24 16:54
Thanks for explaining your use-case.

Although the median_* functions don't perform arithmetic on their data, 
they are still conceptually mathematical functions that operate on 
numbers and I'm reluctant to support arbitrary objects with a key 
function without a solid reason. In your example, I think there are 
existing ways to get the result you want:

(1) Use a dict:

data = dict([(1, ['Anna']), (3, ['Paul', 'Henry']), (4, ['Kate'])])
people = data[median_low(data)]

(2) Use a custom numeric type with associated data:

class MyNum(int):
    def __new__(cls, num, data):
        instance = super().__new__(cls, num)
        instance.data = data
        return instance

data = [MyNum(1, ['Anna']), MyNum(3, ['Paul', 'Henry']), 
        MyNum(4, ['Kate'])]

people = median_low(data).data

As for your second example, do you have a use-case for wanting to know 
the position of the median in the original, unsorted list? When would 
that be useful?

One other reason for my reluctance: although median_low and median_high 
guarantee to only return an actual data point, that's a fairly special 
case. There are other order statistics (such as quartiles, quantiles, 
etc) which are conceptually related to median but don't necessarily 
return a data value. Indeed, the regular median() function doesn't 
always do so. I would be reluctant for median() and median_low() to have 
different signatures without an excellent reason.

I'm not completely ruling this out. One thing which might sway me is if 
there are other languages or statistics libraries which offer this 
feature. (I say *might*, not that it definitely will.)
msg299013 - (view) Author: (gerion) Date: 2017-07-24 19:29
The position might be useful, if you have a second list with some side data stored in it, and not a list of tuples :).

I had the idea to file a bug, when I had a list of coordinates and wanted to use the point with the median of the x-coordinates as "representation" for the dataset. With max() and min() in mind, I used median_low() with key argument and get the error that key is not a valid argument (my solution was to use dicts then).

So I thought this would be a similar use case as max() and min() and in fact more consistent. But I fully understand your concerns, that this breaks consistence with the other statistic functions.

This is not a killer feature, but in my opinion nice to have, because it changes nothing on the default (expected) behaviour, but provides with less code very high flexibility.

I cannot say something about other languages.
msg299038 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-07-25 01:33
I've given this some more thought, and I think that a "key" argument 
would make sense for a general selection function.

The general selection problem is: given a set of items A, and a number k 
between 1 and the number of items, return the k-th item. In Python 
terms, we would use a list, and 0 <= k < len(A) instead.

https://www.cs.rochester.edu/~gildea/csc282/slides/C09-median.pdf

I've had the idea of adding a select(A, k) function to statistics for a 
while now. Then the median_low would be equivalent to select(A, 
len(A)//2) and median_high would be select(A, len(A)//2 + 1). I'd leave 
the median_* functions as they are, and possibly include a key function 
in select.

I don't think it makes sense to add key arguments to mode, mean, 
variance etc. I'm having trouble thinking of what that would even mean 
(no pun intented): it's unlikely that the mean will actually a data 
value (except by accident, or by careful construction of the data). 
Variance has the wrong units (it is the units of your data, squared) and 
the stdev is conceptually a difference between data values, not a data 
value itself, so it doesn't even make sense to apply a key function and 
return one of the data points.

And mode counts objects, so it already applies to non-numeric data. 
It's even documented as applying to nominal data.
msg299549 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-07-31 09:41
> I don't think it makes sense to add key arguments to mode, mean,
> variance etc. I'm having trouble thinking of what that would
> even mean

I concur.  This proposal bends the concept of a key-function to where it is no longer obvious what it does.

> I've given this some more thought, and I think that a "key"
> argument would make sense for a general selection function.

Yes, that would make sense:

    select(A, k, key=somefunc) == sorted(A, key=somefunc)[k]
msg327931 - (view) Author: Michal Nowikowski (godfryd) Date: 2018-10-18 04:06
What is the progress of this issue?
I'm also interested in this feature.
I expected that these functions will behave as built-in min and max.
They have key argument, see here: 
https://docs.python.org/3/library/functions.html#max
msg329266 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-11-04 21:52
This issue (as originally proposed) should be closed.  A key function for median() and mode() likely isn't a good idea.  Those two functions should be kept parallel with mean() as returning simple descriptive statistics.  

Work towards a select() function with a key function can be pursued in a separate tracker item.  That would suffice to locate a specific record occurring at a median (or quartile or decile).  FWIW, that is how MS Excel approaches the problem as well (using RANK with INDEX to locate a record by its sort position, leaving AVERAGE, MODE.SNGL, and MEDIAN for straight descriptive statistics).
msg333965 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-01-18 14:07
I suggest we closed this issue in favor of #35775 to discuss adding a selection function and the attached PR.
msg399970 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-20 14:10
Updating the subject according to the discussion.
msg399997 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-08-20 21:48
No one has shown any interest in this in a long time.  Marking is as closed  for now.  

This issue can be reopened if there is an interest and a reasonable use case that can't be reasonably handled with sorted, min, or max.
History
Date User Action Args
2022-04-11 14:58:49adminsetgithub: 75182
2021-08-20 21:48:20rhettingersetstatus: open -> closed
resolution: later
messages: + msg399997

stage: resolved
2021-08-20 18:11:20mark.dickinsonsetnosy: - mark.dickinson
2021-08-20 14:10:55iritkatrielsetnosy: + iritkatriel
title: statistics module: add "key" keyword argument to median, mode, ... -> statistics module: add a general selection function
messages: + msg399970

versions: + Python 3.11, - Python 3.7
2019-01-18 14:07:22remi.lapeyresetnosy: + remi.lapeyre
messages: + msg333965
2018-11-04 21:52:30rhettingersetmessages: + msg329266
2018-10-18 04:06:23godfrydsetnosy: + godfryd
messages: + msg327931
2017-07-31 09:41:31rhettingersetnosy: + rhettinger
messages: + msg299549
2017-07-25 01:33:09steven.dapranosetmessages: + msg299038
2017-07-24 19:29:27gerionsetmessages: + msg299013
2017-07-24 16:54:09steven.dapranosetmessages: + msg298992
2017-07-24 06:46:55mark.dickinsonsetnosy: + mark.dickinson
2017-07-24 01:46:03gerionsetmessages: + msg298925
2017-07-24 00:43:59steven.dapranosetnosy: + steven.daprano
messages: + msg298924
2017-07-23 22:10:22gerioncreate