classification
Title: Add a default to statistics.mean and related functions
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.9
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Yoni Lavi, mark.dickinson, rhettinger, steven.daprano, taleinat, vstinner
Priority: normal Keywords: patch

Created on 2019-12-19 03:06 by Yoni Lavi, last changed 2019-12-20 13:32 by Yoni Lavi. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 17657 closed Yoni Lavi, 2019-12-19 03:10
Messages (9)
msg358653 - (view) Author: Yoni Lavi (Yoni Lavi) * Date: 2019-12-19 03:06
I would like to put forward an argument in favour of a `default` parameter in the statistics.mean function and the related function. 

What motivated me to open this is that my code would more often than not include a check (or try-except) whenever I calculate a mean and add a default/sentinel value, and I felt that there should be a better way.

Please also note that we have a precedent for this in a similar parameter added to min & max in 3.4 (https://bugs.python.org/issue18111)
msg358658 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-12-19 06:55
I vote -1.  We don't have defaults for stdev() or median() or mode().  And it isn't clear what one would use for a meaningful default value in most cases.  Also, I'm not seeing anything like this in Pandas, Excel, etc.  So, I recommend keeping the current simple and clean APIs.
msg358659 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2019-12-19 07:49
It seems to me that this would follow the same argument as in issue #18111: The real issue is that there's no good way to check if an arbitrary iterable is empty, unlike with sequences. Currently, callers need to wrap with try/except to handle empty iterators properly, or do non-trivial iterator "magic" to check whether the iterator is empty before passing it in.

I've tried think of other solutions, such as a generic wrapper for such functions or a helper to check whether an iterable is empty, and they all turn out to be very clunky to use and un-Pythonic.

Since we provide first-class support for iterators, and many builtins return iterators, giving the tools to handle the case where they are empty elegantly and simply seems prudent.
msg358671 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-12-19 10:54
What would the proposal look like for `statistics.stdev`? There you need at least two data points to compute a result, and a user might want to do different things for an empty dataset versus a single data point.
msg358672 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-12-19 11:12
> I've tried think of other solutions, such as a generic wrapper for such functions or a helper to check whether an iterable is empty, and they all turn out to be very clunky to use and un-Pythonic.

So the main use case would be to detect an empty iterable in an efficient fashion? Something like the following code?

sentinel = objet()
avg = mean(data, default=sentinel)
if avg is sentinel:
   ... # special code path

Why not adding a statistics.StatisticsError subclass for empty set (ex: StatisticsEmptyError)? Something like:

try:
   avg = mean(data)
except statistics.StatisticsEmptyError:
   ... # special code path, ex: avg = default

Or is there another use case for the proposed default parameter?
msg358674 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-12-19 11:21
TL;DR: I'm not likely to accept this feature request without at least one of (1) a practical use-case, (2) prior art in other statistics software, or (3) a strong mathematical justification for why this is meaningful and useful.


I'm not categorically against this idea, but it seems a bit fishy to me. If you have no data, how do you know what default value to give that would be appropriate for your (non-existent) observations?

It might help if you could show a real-life example of how, and why, you would use this, and how you would choose the default?

Another possibility would be to find prior-art: another language, library or stats calculator which already offers this feature.

Alternatively, a mathematical/statistical justification for a default. For example, the empty sum is normally taken as 0 and the empty product as 1. R returns either a NAN or NA for the empty mean (depending on precisely how you calculate it).

While I'm personally sympathetic to the nuisance factor of having to wrap code in try...except blocks (my *personal* preference would have been for mean to return NAN on empty input) I think you will need to make a stronger case than just the analogy with min and max.
msg358696 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-12-20 05:48
Thought experiment: Suppose someone proposed, "math.log(x) should take an optional default argument because it is inconvenient to a catch a ValueError if the input is non-positive".   Or more generally, what if someone proposed, "every function in Python that can raise a ValueError should offer a default argument."  One could imagine a use case for both of these proposals but that doesn't mean that the API extensions would be warranted.

Also, ISTM the analogy to min() and max() is imperfect.  Those aren't descriptive statistics.  For min() and max() we can know a priori that a probability is never lower than 0.0 or greater than 1.0 for example.

Lastly, in common cases where the input is a sequence (rather than just an iterator), we already have a ternary operator to does the job nicely:

   central_value = mean(data) if data else 'unknown'

For the less common case, a try/except is not an undue burden; after all, it is a basic core language feature.
msg358700 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-12-20 10:02
I agree with Raymond's comments, except that because I'm sometimes a bit
of a pedant, I have to make one minor correction: max and min can be
descriptive statistics.

The sample minimum is the 1st order statistic, and the sample maximum is
the N-th order statistic:

https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec15.pdf

This doesn't invalidate the rest of what Raymond says.

Yoni Lavi, thank you for the suggestion, but I'm going to close this ticket. If you think you have a really strong argument for the feature, please feel free to make it here, and we will rethink the closure. But I don't want to give you false hope: it would have to be a very strong argument.
msg358708 - (view) Author: Yoni Lavi (Yoni Lavi) * Date: 2019-12-20 13:32
Thanks for the good feedback everyone and apologies for the unresponsiveness over the past day.

I understand that my use cases may not reflect wider usage patterns and am not looking to argue against the closing. But anyway, for future reference, I'll add two real-life usage examples, which I should have included originally (again, apologies for the delay, things have been hectic).

The context is that I'm involved in running a coding bootcamp, and these are two recent cases when I needed a default of zero recently:

1. (Separately of the final grade calculations) We are interested in students' average grades on their projects as an indicator of their skills gained and their striving for excellence. When calculating this indicator, we use an average of 0 for a student who haven't yet submitted anything.

2. When providing tutoring support, we classify the "complexity" of each student issue, and then one of our indicators involves the average complexity of questions in a particular slice of time and the programme (this is particularly interesting around changes to the content). For this as well, a slice of time/programme/tutor during which there were no issues would be considered as having a complexity of 0.

Again, not disputing the decision to close, just adding these examples for future reference.
Thanks
History
Date User Action Args
2019-12-20 13:32:13Yoni Lavisetmessages: + msg358708
2019-12-20 10:02:55steven.dapranosetstatus: open -> closed
resolution: rejected
messages: + msg358700

stage: patch review -> resolved
2019-12-20 05:48:45rhettingersetmessages: + msg358696
2019-12-19 11:21:33steven.dapranosetmessages: + msg358674
2019-12-19 11:12:06vstinnersetnosy: + vstinner
messages: + msg358672
2019-12-19 10:54:46mark.dickinsonsetnosy: + mark.dickinson
messages: + msg358671
2019-12-19 07:49:18taleinatsetmessages: + msg358659
2019-12-19 06:55:44rhettingersetmessages: + msg358658
2019-12-19 03:10:37Yoni Lavisetkeywords: + patch
stage: patch review
pull_requests: + pull_request17124
2019-12-19 03:07:43xtreaksetnosy: + rhettinger, taleinat, steven.daprano
2019-12-19 03:06:41Yoni Lavicreate