This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: statistics.harmonic_mean fails to raise error with negative input that follows a 0
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: WarrenWeckesser, gaoxinge, mark.dickinson, rhettinger, steven.daprano, taleinat, tim.peters
Priority: normal Keywords: patch

Created on 2019-10-06 01:46 by WarrenWeckesser, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 16601 closed WarrenWeckesser, 2019-10-06 03:12
PR 17037 merged rhettinger, 2019-11-03 21:09
PR 17078 merged miss-islington, 2019-11-07 05:50
Messages (23)
msg354028 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-06 01:46
The function statistics.harmonic_mean is supposed to raise a StatisticsError if any element in the input is negative.  It fails to do so if the negative element is preceded by a zero.  When there is a zero before the negative element, the function returns 0:

>>> from statistics import harmonic_mean
>>> harmonic_mean([0, -1, 3])
0

If the zero occurs after the negative value, the StatisticsError exception is correctly raised:

>>> harmonic_mean([-1, 0, 3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in harmonic_mean
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 164, in _sum
    for typ, values in groupby(data, type):
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in <genexpr>
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 289, in _fail_neg
    raise StatisticsError(errmsg)
statistics.StatisticsError: harmonic mean does not support negative values
>>> 

The problem is in this code in the function harmonic_mean:

    try:
        T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
    except ZeroDivisionError:
        return 0

When the zero occurs before the negative value, the ZeroDivisionError is raised before the negative value has been seen by the generator _fail_neg, and the function returns 0.
msg354030 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-06 02:40
The premature return on ZeroDivisionError also causes an inconsistency in how types are handled.  For example,

>>> harmonic_mean([1, 0, "foo"])
0

We get an error as expected if "foo" occurs before 0:

>>> harmonic_mean([1, "foo", 0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in harmonic_mean
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 166, in _sum
    for n,d in map(_exact_ratio, values):
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in <genexpr>
    T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
  File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 288, in _fail_neg
    if x < 0:
TypeError: '<' not supported between instances of 'str' and 'int'
msg354042 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-10-06 19:06
Normally early-out behaviors are allowed to skip any checks for subsequent inputs (for example, the any() and all() builtins stop consuming inputs one a final value is found).

However, that reasoning might not apply to a zero input for harmonic_mean().  Technically, the definition of harmonic mean doesn't allow for a zero input value at all (accordingly, MS Excel gives an error for a zero input).  Presumably, the reason that Python's harmonic_mean() returns zero is so that the function degrades gracefully as one of the inputs gets closer to zero.

Options at this point:

* Just document that there is an early-out for zero.

* Stick with the strategy of treating zero as if it were just a very, very small positive input.  The would imply that all of the inputs should be considered and that TypeErrors or StatisticErrors should be raised for later data points (though I'm unclear what this world view implies when the inputs have both a zero and a positive infinity).

* Treat a zero input as an error.  This matches the usual specification of harmonic mean being defined only for non-negative real inputs.
msg354050 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-06 21:31
I find it hard to accept the first option.  It seems to let a detail of the current implementation take precedence over API design.  I don't see why we would want an API in which harmonic_mean([1, 0, -1]) returns 0 but harmonic_mean([-1, 0, 1]) raises an error.  The harmonic mean should be invariant to permutations of the input (well, at least to within normal floating point precision when the input is floating point).  I wouldn't expect such a radical change in behavior based on how I ordered my input to harmonic_mean.

The second option appears to be the API that was intended when the function was implemented.  Returning 0 when one or more inputs are 0 (and the rest are positive numbers) is justified by taking a limit.  Just like we *define* the sinc function (https://en.wikipedia.org/wiki/Sinc_function) to be 1 at x=0 and sin(x)/x when x != 0 (because the limit as x approaches 0 of sin(x)/x is 1), we can define the value of the harmonic mean with zeros in the input as the limiting value as one (or more) positive values goes to 0.  That limit is 0.  (Also note that, in the case where just one input is 0, the expression for the harmonic mean can be manipulated into a form that gives 0 without requiring a limit. E.g. for three values, 1/(1/x0 + 1/x1 + 1/x2) = x0*x1*x2/(x1*x2 + x0*x2 + x0*x1).  If just one of those values is 0, the denominator is nonzero, so the result is simply 0.)

There is a nice analogy from electrical circuit theory.  Given, say, three resistances R1, R2 and R3 wired in parallel, the total resistance R is

   R = 1/(1/R1 + 1/R2 + 1/R3) = harmonic_mean([R1, R2, R3])/3

(https://en.wikipedia.org/wiki/Series_and_parallel_circuits#Resistance_units_2). Intuitively, we know that if any of those resistances is 0 (i.e. there is a short circuit), the total resistance is also 0.

The resistance analogy also gives the correct interpretation for the case where the input includes both 0 and +inf.  An infinite resistance is an open circuit (i.e. no connection), so putting that in parallel with a short circuit still results in a total resistance of 0.

The third option, in which any zero in the input is treated as an error, is a change in behavior. Given the justification for returning 0, and given that currently a call such as harmonic_mean([1, 0, 2]) *does* return 0, making that an error seems like an undesirable change.
msg354051 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-10-06 22:29
The resistor example is persuasive :-)
msg354056 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-10-07 00:22
I don't have a problem with the current behavior (early out on zero, even if later arguments are senseless).  So:

> * Just document that there is an early-out for zero.
msg354058 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-10-07 00:42
> I don't have a problem with the current behavior (early out on zero,
> even if later arguments are senseless).  So:
>
> > * Just document that there is an early-out for zero.

That make the most sense to me as well (even though the OP thinks otherwise).
msg354059 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-10-07 01:41
Thanks Warren, the resistance example is excellent, would you like to write up a PR to add it to the docs?

In the original design, I prohibited negative numbers because that seems to be what everyone says you should do, but as far as I can tell it is mathematically well defined so long as the sum of positive values doesn't equal the sum of negative values. And in that case, we could return a NAN.

I don't know that there's any physical meaning to the harmonic mean of negative values (circuits with negative resistances?), but I'm almost tempted to make this "consenting adults" and remove the restriction altogether. That would simplify the implementation and remove any arguments about early-out on zero.

I acknowledge Warren's point about permutations of data, but that applies to anything with an early-out. I don't think that's quite persuasive enough to justify losing the early exit on zero.

Warren, if you feel really strongly about this, feel free to try to change my mind. But for now, I think that documenting that zero triggers an early-out is the right solution.

By the way, I've just noticed that harmonic_mean([INF, INF]) raises. Should it return INF instead?
msg354065 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-07 07:25
I really don't like the current inconsistent behavior.  I can't help but see it as a bug, and I am surprised that folks think it is OK that harmonic_mean([0, "foo"]) returns 0.  This is the *statistics* module, and the first line of https://docs.python.org/3/library/statistics.html says "This module provides functions for calculating mathematical statistics of numeric (Real-valued) data".  I understand the desire to maintain the performance benefit of the early exit, but I would rather disallow values that are zero than accept the nonsensical result of harmonic_mean([0, "foo"]) and the inconsistency between that and the result of harmonic_mean(["foo", 0]).  I spend most of my time working on numerical-ish code, so my world view might be different than that of y'all.

Here's a new proposal.  To continue to accept zero values, and disallow negative values consistently (as the function is documented to do), *and* preserve the early exit, I propose that the try/except statement in harmonic_mean in master be changed to

    try:
        T, total, count = _sum(1/x for x in _fail_neg(data, errmsg))
    except ZeroDivisionError:
        if any(x < 0 for x in data):
            raise StatisticsError(errmsg) from None
        return 0

The change adds just the two lines in the except block.  This doesn't change the "happy path" that is followed when all the values are strictly greater than 0.  The cost of the additional check for any negative values in the input only occurs in the (presumably uncommon) case where an input value is 0.  (This means, however, that any values before the first 0 are checked twice.)

> In the original design, I prohibited negative numbers because that seems to be what everyone says you should do, but as far as I can tell it is mathematically well defined so long as the sum of positive values doesn't equal the sum of negative values. And in that case, we could return a NAN.

Allowing negative values introduces some mathematical issues.  Most importantly, the limit of the harmonic mean as two or more values approach zero (in the multivariable limit sense) no longer exists.  Even in the simple case of computing the harmonic mean H(x, y) of two values, the limit of H(x, y) as (x, y) -> (0, 0) does not exist. H(x, y) is one of those funky functions where there are different paths to the origin in the second and fourth quadrants that lead to different values of the limit.  Of course, practicality beats purity, so the harmonic mean could simply be defined to be 0 whenever any one of the inputs is 0.  Then the only issue is the one you mentioned: the handling of the case where the denominator of the harmonic mean formula is zero.  In that case, the code would return NAN as you suggested, or raise an exception.  Overall, I don't think handling negative values is very important, and I'm not sure such a change to the function is worthwhile.  On the other hand, I think it would be better than the current behavior.

> By the way, I've just noticed that harmonic_mean([INF, INF]) raises. Should it return INF instead?

Yes, I expect INF in that case.  Currently it only raises when all the inputs are INF.  Include one finite value, and there is no error:

In [67]: harmonic_mean([math.inf, math.inf, 5])
Out[67]: 15.0

The exception could be fixed by checking if total is zero before computing n/total (currently computed in the return statement of the function).
msg354241 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-10-09 00:32
One more case where the "early out" produces a result that seems wrong:

In [20]: harmonic_mean([math.nan, 0])                                           
Out[20]: 0

Anyone who didn't know about the "early out" behavior would probably expect the result to be NAN.
msg354262 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-10-09 10:27
> In [20]: harmonic_mean([math.nan, 0])                                           
> Out[20]: 0

That one seems excusable, for the same sort of reasons that IEEE 754 specifies that hypot(nan, inf) is inf rather than nan. Similarly, sumSquare and sumAbs from IEEE 754-2008 specify that infinities take precedence over NaNs, on the basis that the result doesn't change if the nan is replaced with any non-nan value.
msg354504 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2019-10-11 21:36
I'd certainly be surprised to have the order of zero and negative values in the input have such a dramatic effect on the result. It would make debugging these cases unnecessarily confusing and difficult. And I think it is indeed worth fixing, as these edge-cases are likely to come up of time when dealing with large amounts of data.

I'm +1 on Warren's latest suggestion for a fix.

P.S If we agree on this, either Warren should make the fix himself or we should mark it "newcomer friendly" and let a newcomer handle it.
msg355687 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-10-30 00:42
Maybe we should hold-off on adding negative number support unless a compelling use case arises.  It is telling that scipy.stats.hmean() and that MS Excel's harmean() haven't found justification to add negative number support.

In all likelihood, a negative input is an error in the input and users would be better-off seeing an exception raised.  In a number of contexts, the negative input wouldn't make sense all:

* a 10 ohm resistor wired in parallel with a negative 5 ohm resistor
* travel 1 km at 5 kph then another 1 km at -10 kph

We really don't have to go down this route just because we can.  The usual rule is to keep the API minimal unless real use cases arise to inform the design.
msg355690 - (view) Author: gaoxinge (gaoxinge) Date: 2019-10-30 03:06
I think that three kinds of mean should behave like:

- `geometric mean`: input should be float, finite and positive
- `harmonic mean`: input should be float, finite and nonzero
- `mean` or `fmean`: input should be float, finite

Otherwise these mean functions should raise a `StatisticsError`. 

Two points that we should pay attention to:

- Why every input should be float? Because it is a statstics library, we should focus on the float instead of complex number or string or `math.nan`.

- Why every input shoud be finite? Becuse we should get rid of `math.inf`.
msg355691 - (view) Author: gaoxinge (gaoxinge) Date: 2019-10-30 03:10
The document https://github.com/WarrenWeckesser/cpython/blob/master/Lib/statistics.py#L389 in `harmonic_mean` is out-dated. I think we can simply follow the wiki:

- [arithemic mean](https://en.wikipedia.org/wiki/Arithmetic_mean)
- [geometric mean](https://en.wikipedia.org/wiki/Geometric_mean)
- [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean)
msg355692 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-10-30 05:21
Thank you for your comments, but accepting non-floats like Decimal and 
Fraction (and int, of course!) is a hard requirement for the statistics 
module. fmean will naturally convert any data to float for speed, but 
the other means will attempt to keep the original data type if possible, 
so that the mean() of Decimals will be a Decimal, the mean of Fractions 
will be a Fraction, etc.

The median_low and median_high functions don't require numbers at all, 
the data points just need to be ordinal data and support sorting. The 
mode function doesn't even require ordinal data, it can work with 
nominal data. That is part of the definition of median and mode, and 
will not change.

http://intellspot.com/nominal-vs-ordinal-data/

So long as Python support IEEE-754 floating point arithmetic, which is 
the most common standard for floating point, there will be NANs and 
INFs. We should not reject them unnecessarily. You can always pre-filter 
your data before passing it the statistics module.

Why do you object to float INF?
msg355694 - (view) Author: gaoxinge (gaoxinge) Date: 2019-10-30 06:06
Your point makes sense. We should still also consider the `int`, `Decimal`, `Fraction`, `math.nan`, `math.inf`, `np.int` and so on. So

- `geometric mean`: input should be positive (>= 0)
- `harmonic mean`: input should be nonzero (!= 0)
- `mean` or `fmean`: no input restriction
msg355696 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-10-30 08:09
Have you read the rest of the thread? There is a compelling reason to 
support harmonic mean including zero (resistors in parallel) but not yet 
any compelling reason to support negative values.

If you have a good use-case for harmonic mean of negative values, please 
tell us. Until then, I agree with Raymond: negative values should be 
treated as an error, as documented.

Can we go back to the original issue? Should we just document the 
"early out" behaviour on hitting zero, or should we follow the 
suggestion to check for bad data and raise?

I'm currently undecided which I would prefer.
msg355705 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-10-30 10:10
> Maybe we should hold-off on adding negative number support unless a compelling use case arises.

+100. It makes little sense mathematically, and I'm sceptical that such use cases exist in real life.
msg355879 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-11-02 18:29
> Can we go back to the original issue? Should we just 
> document the "early out" behaviour on hitting zero,
> or should we follow the suggestion to check for bad 
> data and raise?

I would follow Tim's recommendation to document it and be done with it.
msg355915 - (view) Author: Warren Weckesser (WarrenWeckesser) * Date: 2019-11-04 00:29
On 10/9/19, Mark Dickinson <report@bugs.python.org> wrote:
>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>
>> In [20]: harmonic_mean([math.nan, 0])
>>
>> Out[20]: 0
>
> That one seems excusable, for the same sort of reasons that IEEE 754
> specifies that hypot(nan, inf) is inf rather than nan. Similarly, sumSquare
> and sumAbs from IEEE 754-2008 specify that infinities take precedence over
> NaNs, on the basis that the result doesn't change if the nan is replaced
> with any non-nan value.
>

A belated "thanks", Mark.  I wasn't aware of that, but it makes sense.

Warren

> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue38382>
> _______________________________________
>
msg356170 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-11-07 05:50
New changeset 7f460494d2309ace004a400bae8fc59134dc325c by Raymond Hettinger in branch 'master':
bpo-38382: Document the early-out behavior for a zero (GH-17037)
https://github.com/python/cpython/commit/7f460494d2309ace004a400bae8fc59134dc325c
msg356171 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-11-07 05:58
New changeset 6b66dc387935df432cdbc01397ddef534a84ade9 by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
bpo-38382: Document the early-out behavior for a zero (GH-17037) (GH-17078)
https://github.com/python/cpython/commit/6b66dc387935df432cdbc01397ddef534a84ade9
History
Date User Action Args
2022-04-11 14:59:21adminsetgithub: 82563
2019-11-07 05:58:50rhettingersetstatus: open -> closed
resolution: not a bug
stage: patch review -> resolved
2019-11-07 05:58:16rhettingersetmessages: + msg356171
2019-11-07 05:50:57miss-islingtonsetpull_requests: + pull_request16588
2019-11-07 05:50:51rhettingersetmessages: + msg356170
2019-11-04 00:29:20WarrenWeckessersetmessages: + msg355915
2019-11-03 21:09:23rhettingersetpull_requests: + pull_request16550
2019-11-02 18:29:01rhettingersetmessages: + msg355879
2019-10-30 10:10:22mark.dickinsonsetmessages: + msg355705
2019-10-30 08:09:46steven.dapranosetmessages: + msg355696
2019-10-30 06:06:20gaoxingesetmessages: + msg355694
2019-10-30 05:21:25steven.dapranosetmessages: + msg355692
2019-10-30 03:10:26gaoxingesetmessages: + msg355691
2019-10-30 03:06:29gaoxingesetnosy: + gaoxinge
messages: + msg355690
2019-10-30 00:42:20rhettingersetmessages: + msg355687
2019-10-11 21:36:49taleinatsetnosy: + taleinat
messages: + msg354504
2019-10-09 10:27:20mark.dickinsonsetmessages: + msg354262
2019-10-09 02:45:31rhettingersetassignee: rhettinger -> steven.daprano
2019-10-09 00:32:44WarrenWeckessersetmessages: + msg354241
2019-10-07 07:25:36WarrenWeckessersetmessages: + msg354065
2019-10-07 01:41:18steven.dapranosetmessages: + msg354059
2019-10-07 00:42:25rhettingersetmessages: + msg354058
2019-10-07 00:22:37tim.peterssetmessages: + msg354056
2019-10-06 22:29:52rhettingersetmessages: + msg354051
2019-10-06 21:31:16WarrenWeckessersetmessages: + msg354050
2019-10-06 19:06:06rhettingersetnosy: + tim.peters, mark.dickinson
messages: + msg354042
2019-10-06 18:33:52rhettingersetassignee: rhettinger

nosy: + rhettinger, steven.daprano
versions: + Python 3.7, Python 3.8
2019-10-06 03:12:28WarrenWeckessersetkeywords: + patch
stage: patch review
pull_requests: + pull_request16189
2019-10-06 02:40:20WarrenWeckessersetmessages: + msg354030
2019-10-06 01:46:18WarrenWeckessercreate