classification
Title: statistics.harmonic_mean fails to raise error with negative input that follows a 0
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: WarrenWeckesser, mark.dickinson, rhettinger, steven.daprano, taleinat, tim.peters
Priority: normal Keywords: patch

Created on 2019-10-06 01:46 by WarrenWeckesser, last changed 2019-10-11 21:36 by taleinat.

Pull Requests
URL Status Linked Edit
PR 16601 closed WarrenWeckesser, 2019-10-06 03:12
Messages (12)
msg354028 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-06 01:46
The function statistics.harmonic_mean is supposed to raise a StatisticsError if any element in the input is negative.  It fails to do so if the negative element is preceded by a zero.  When there is a zero before the negative element, the function returns 0:

>>> from statistics import harmonic_mean
>>> harmonic_mean([0, -1, 3])
0

If the zero occurs after the negative value, the StatisticsError exception is correctly raised:

>>> harmonic_mean([-1, 0, 3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in harmonic_mean
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 164, in _sum
    for typ, values in groupby(data, type):
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in <genexpr>
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 289, in _fail_neg
    raise StatisticsError(errmsg)
statistics.StatisticsError: harmonic mean does not support negative values
>>> 

The problem is in this code in the function harmonic_mean:

    try:
        T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
    except ZeroDivisionError:
        return 0

When the zero occurs before the negative value, the ZeroDivisionError is raised before the negative value has been seen by the generator _fail_neg, and the function returns 0.
msg354030 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-06 02:40
The premature return on ZeroDivisionError also causes an inconsistency in how types are handled.  For example,

>>> harmonic_mean([1, 0, "foo"])
0

We get an error as expected if "foo" occurs before 0:

>>> harmonic_mean([1, "foo", 0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in harmonic_mean
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 166, in _sum
    for n,d in map(_exact_ratio, values):
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in <genexpr>
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 288, in _fail_neg
    if x < 0:
TypeError: '<' not supported between instances of 'str' and 'int'
msg354042 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-10-06 19:06
Normally early-out behaviors are allowed to skip any checks for subsequent inputs (for example, the any() and all() builtins stop consuming inputs one a final value is found).

However, that reasoning might not apply to a zero input for harmonic_mean().  Technically, the definition of harmonic mean doesn't allow for a zero input value at all (accordingly, MS Excel gives an error for a zero input).  Presumably, the reason that Python's harmonic_mean() returns zero is so that the function degrades gracefully as one of the inputs gets closer to zero.

Options at this point:

* Just document that there is an early-out for zero.

* Stick with the strategy of treating zero as if it were just a very, very small positive input.  The would imply that all of the inputs should be considered and that TypeErrors or StatisticErrors should be raised for later data points (though I'm unclear what this world view implies when the inputs have both a zero and a positive infinity).

* Treat a zero input as an error.  This matches the usual specification of harmonic mean being defined only for non-negative real inputs.
msg354050 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-06 21:31
I find it hard to accept the first option.  It seems to let a detail of the current implementation take precedence over API design.  I don't see why we would want an API in which harmonic_mean([1, 0, -1]) returns 0 but harmonic_mean([-1, 0, 1]) raises an error.  The harmonic mean should be invariant to permutations of the input (well, at least to within normal floating point precision when the input is floating point).  I wouldn't expect such a radical change in behavior based on how I ordered my input to harmonic_mean.

The second option appears to be the API that was intended when the function was implemented.  Returning 0 when one or more inputs are 0 (and the rest are positive numbers) is justified by taking a limit.  Just like we *define* the sinc function (https://en.wikipedia.org/wiki/Sinc_function) to be 1 at x=0 and sin(x)/x when x != 0 (because the limit as x approaches 0 of sin(x)/x is 1), we can define the value of the harmonic mean with zeros in the input as the limiting value as one (or more) positive values goes to 0.  That limit is 0.  (Also note that, in the case where just one input is 0, the expression for the harmonic mean can be manipulated into a form that gives 0 without requiring a limit. E.g. for three values, 1/(1/x0 + 1/x1 + 1/x2) = x0*x1*x2/(x1*x2 + x0*x2 + x0*x1).  If just one of those values is 0, the denominator is nonzero, so the result is simply 0.)

There is a nice analogy from electrical circuit theory.  Given, say, three resistances R1, R2 and R3 wired in parallel, the total resistance R is

   R = 1/(1/R1 + 1/R2 + 1/R3) = harmonic_mean([R1, R2, R3])/3

(https://en.wikipedia.org/wiki/Series_and_parallel_circuits#Resistance_units_2). Intuitively, we know that if any of those resistances is 0 (i.e. there is a short circuit), the total resistance is also 0.

The resistance analogy also gives the correct interpretation for the case where the input includes both 0 and +inf.  An infinite resistance is an open circuit (i.e. no connection), so putting that in parallel with a short circuit still results in a total resistance of 0.

The third option, in which any zero in the input is treated as an error, is a change in behavior. Given the justification for returning 0, and given that currently a call such as harmonic_mean([1, 0, 2]) *does* return 0, making that an error seems like an undesirable change.
msg354051 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-10-06 22:29
The resistor example is persuasive :-)
msg354056 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-07 00:22
I don't have a problem with the current behavior (early out on zero, even if later arguments are senseless).  So:

> * Just document that there is an early-out for zero.
msg354058 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-10-07 00:42
> I don't have a problem with the current behavior (early out on zero,
> even if later arguments are senseless).  So:
>
> > * Just document that there is an early-out for zero.

That make the most sense to me as well (even though the OP thinks otherwise).
msg354059 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-10-07 01:41
Thanks Warren, the resistance example is excellent, would you like to write up a PR to add it to the docs?

In the original design, I prohibited negative numbers because that seems to be what everyone says you should do, but as far as I can tell it is mathematically well defined so long as the sum of positive values doesn't equal the sum of negative values. And in that case, we could return a NAN.

I don't know that there's any physical meaning to the harmonic mean of negative values (circuits with negative resistances?), but I'm almost tempted to make this "consenting adults" and remove the restriction altogether. That would simplify the implementation and remove any arguments about early-out on zero.

I acknowledge Warren's point about permutations of data, but that applies to anything with an early-out. I don't think that's quite persuasive enough to justify losing the early exit on zero.

Warren, if you feel really strongly about this, feel free to try to change my mind. But for now, I think that documenting that zero triggers an early-out is the right solution.

By the way, I've just noticed that harmonic_mean([INF, INF]) raises. Should it return INF instead?
msg354065 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-07 07:25
I really don't like the current inconsistent behavior.  I can't help but see it as a bug, and I am surprised that folks think it is OK that harmonic_mean([0, "foo"]) returns 0.  This is the *statistics* module, and the first line of https://docs.python.org/3/library/statistics.html says "This module provides functions for calculating mathematical statistics of numeric (Real-valued) data".  I understand the desire to maintain the performance benefit of the early exit, but I would rather disallow values that are zero than accept the nonsensical result of harmonic_mean([0, "foo"]) and the inconsistency between that and the result of harmonic_mean(["foo", 0]).  I spend most of my time working on numerical-ish code, so my world view might be different than that of y'all.

Here's a new proposal.  To continue to accept zero values, and disallow negative values consistently (as the function is documented to do), *and* preserve the early exit, I propose that the try/except statement in harmonic_mean in master be changed to

    try:
        T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
    except ZeroDivisionError:
        if any(x < 0 for x in data):
            raise StatisticsError(errmsg) from None
        return 0

The change adds just the two lines in the except block.  This doesn't change the "happy path" that is followed when all the values are strictly greater than 0.  The cost of the additional check for any negative values in the input only occurs in the (presumably uncommon) case where an input value is 0.  (This means, however, that any values before the first 0 are checked twice.)

> In the original design, I prohibited negative numbers because that seems to be what everyone says you should do, but as far as I can tell it is mathematically well defined so long as the sum of positive values doesn't equal the sum of negative values. And in that case, we could return a NAN.

Allowing negative values introduces some mathematical issues.  Most importantly, the limit of the harmonic mean as two or more values approach zero (in the multivariable limit sense) no longer exists.  Even in the simple case of computing the harmonic mean H(x, y) of two values, the limit of H(x, y) as (x, y) -> (0, 0) does not exist. H(x, y) is one of those funky functions where there are different paths to the origin in the second and fourth quadrants that lead to different values of the limit.  Of course, practicality beats purity, so the harmonic mean could simply be defined to be 0 whenever any one of the inputs is 0.  Then the only issue is the one you mentioned: the handling of the case where the denominator of the harmonic mean formula is zero.  In that case, the code would return NAN as you suggested, or raise an exception.  Overall, I don't think handling negative values is very important, and I'm not sure such a change to the function is worthwhile.  On the other hand, I think it would be better than the current behavior.

> By the way, I've just noticed that harmonic_mean([INF, INF]) raises. Should it return INF instead?

Yes, I expect INF in that case.  Currently it only raises when all the inputs are INF.  Include one finite value, and there is no error:

In [67]: harmonic_mean([math.inf, math.inf, 5])
Out[67]: 15.0

The exception could be fixed by checking if total is zero before computing n/total (currently computed in the return statement of the function).
msg354241 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-09 00:32
One more case where the "early out" produces a result that seems wrong:

In [20]: harmonic_mean([math.nan, 0])                                           
Out[20]: 0

Anyone who didn't know about the "early out" behavior would probably expect the result to be NAN.
msg354262 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-10-09 10:27
> In [20]: harmonic_mean([math.nan, 0])                                           
> Out[20]: 0

That one seems excusable, for the same sort of reasons that IEEE 754 specifies that hypot(nan, inf) is inf rather than nan. Similarly, sumSquare and sumAbs from IEEE 754-2008 specify that infinities take precedence over NaNs, on the basis that the result doesn't change if the nan is replaced with any non-nan value.
msg354504 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2019-10-11 21:36
I'd certainly be surprised to have the order of zero and negative values in the input have such a dramatic effect on the result. It would make debugging these cases unnecessarily confusing and difficult. And I think it is indeed worth fixing, as these edge-cases are likely to come up of time when dealing with large amounts of data.

I'm +1 on Warren's latest suggestion for a fix.

P.S If we agree on this, either Warren should make the fix himself or we should mark it "newcomer friendly" and let a newcomer handle it.
History
Date User Action Args
2019-10-11 21:36:49taleinatsetnosy: + taleinat
messages: + msg354504
2019-10-09 10:27:20mark.dickinsonsetmessages: + msg354262
2019-10-09 02:45:31rhettingersetassignee: rhettinger -> steven.daprano
2019-10-09 00:32:44WarrenWeckessersetmessages: + msg354241
2019-10-07 07:25:36WarrenWeckessersetmessages: + msg354065
2019-10-07 01:41:18steven.dapranosetmessages: + msg354059
2019-10-07 00:42:25rhettingersetmessages: + msg354058
2019-10-07 00:22:37tim.peterssetmessages: + msg354056
2019-10-06 22:29:52rhettingersetmessages: + msg354051
2019-10-06 21:31:16WarrenWeckessersetmessages: + msg354050
2019-10-06 19:06:06rhettingersetnosy: + tim.peters, mark.dickinson
messages: + msg354042
2019-10-06 18:33:52rhettingersetassignee: rhettinger

nosy: + rhettinger, steven.daprano
versions: + Python 3.7, Python 3.8
2019-10-06 03:12:28WarrenWeckessersetkeywords: + patch
stage: patch review
pull_requests: + pull_request16189
2019-10-06 02:40:20WarrenWeckessersetmessages: + msg354030
2019-10-06 01:46:18WarrenWeckessercreate