I really don't like the current inconsistent behavior. I can't help but see it as a bug, and I am surprised that folks think it is OK that harmonic_mean([0, "foo"]) returns 0. This is the *statistics* module, and the first line of https://docs.python.org/3/library/statistics.html says "This module provides functions for calculating mathematical statistics of numeric (Real-valued) data". I understand the desire to maintain the performance benefit of the early exit, but I would rather disallow values that are zero than accept the nonsensical result of harmonic_mean([0, "foo"]) and the inconsistency between that and the result of harmonic_mean(["foo", 0]). I spend most of my time working on numerical-ish code, so my world view might be different than that of y'all.
Here's a new proposal. To continue to accept zero values, and disallow negative values consistently (as the function is documented to do), *and* preserve the early exit, I propose that the try/except statement in harmonic_mean in master be changed to
try:
T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
except ZeroDivisionError:
if any(x < 0 for x in data):
raise StatisticsError(errmsg) from None
return 0
The change adds just the two lines in the except block. This doesn't change the "happy path" that is followed when all the values are strictly greater than 0. The cost of the additional check for any negative values in the input only occurs in the (presumably uncommon) case where an input value is 0. (This means, however, that any values before the first 0 are checked twice.)
> In the original design, I prohibited negative numbers because that seems to be what everyone says you should do, but as far as I can tell it is mathematically well defined so long as the sum of positive values doesn't equal the sum of negative values. And in that case, we could return a NAN.
Allowing negative values introduces some mathematical issues. Most importantly, the limit of the harmonic mean as two or more values approach zero (in the multivariable limit sense) no longer exists. Even in the simple case of computing the harmonic mean H(x, y) of two values, the limit of H(x, y) as (x, y) -> (0, 0) does not exist. H(x, y) is one of those funky functions where there are different paths to the origin in the second and fourth quadrants that lead to different values of the limit. Of course, practicality beats purity, so the harmonic mean could simply be defined to be 0 whenever any one of the inputs is 0. Then the only issue is the one you mentioned: the handling of the case where the denominator of the harmonic mean formula is zero. In that case, the code would return NAN as you suggested, or raise an exception. Overall, I don't think handling negative values is very important, and I'm not sure such a change to the function is worthwhile. On the other hand, I think it would be better than the current behavior.
> By the way, I've just noticed that harmonic_mean([INF, INF]) raises. Should it return INF instead?
Yes, I expect INF in that case. Currently it only raises when all the inputs are INF. Include one finite value, and there is no error:
In [67]: harmonic_mean([math.inf, math.inf, 5])
Out[67]: 15.0
The exception could be fixed by checking if total is zero before computing n/total (currently computed in the return statement of the function). |