Issue38382

Created on **2019-10-06 01:46** by **WarrenWeckesser**, last changed **2019-11-07 05:58** by **rhettinger**. This issue is now **closed**.

Pull Requests | |||
---|---|---|---|

URL | Status | Linked | Edit |

PR 16601 | closed | WarrenWeckesser, 2019-10-06 03:12 | |

PR 17037 | merged | rhettinger, 2019-11-03 21:09 | |

PR 17078 | merged | miss-islington, 2019-11-07 05:50 |

Messages (23) | |||
---|---|---|---|

msg354028 - (view) | Author: Warren Weckesser (WarrenWeckesser) * | Date: 2019-10-06 01:46 | |

The function statistics.harmonic_mean is supposed to raise a StatisticsError if any element in the input is negative. It fails to do so if the negative element is preceded by a zero. When there is a zero before the negative element, the function returns 0: >>> from statistics import harmonic_mean >>> harmonic_mean([0, -1, 3]) 0 If the zero occurs after the negative value, the StatisticsError exception is correctly raised: >>> harmonic_mean([-1, 0, 3]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in harmonic_mean T, total, count = _sum(1/x for x in _fail_neg(data, errmsg)) File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 164, in _sum for typ, values in groupby(data, type): File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in <genexpr> T, total, count = _sum(1/x for x in _fail_neg(data, errmsg)) File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 289, in _fail_neg raise StatisticsError(errmsg) statistics.StatisticsError: harmonic mean does not support negative values >>> The problem is in this code in the function harmonic_mean: try: T, total, count = _sum(1/x for x in _fail_neg(data, errmsg)) except ZeroDivisionError: return 0 When the zero occurs before the negative value, the ZeroDivisionError is raised before the negative value has been seen by the generator _fail_neg, and the function returns 0. |
|||

msg354030 - (view) | Author: Warren Weckesser (WarrenWeckesser) * | Date: 2019-10-06 02:40 | |

The premature return on ZeroDivisionError also causes an inconsistency in how types are handled. For example, >>> harmonic_mean([1, 0, "foo"]) 0 We get an error as expected if "foo" occurs before 0: >>> harmonic_mean([1, "foo", 0]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in harmonic_mean T, total, count = _sum(1/x for x in _fail_neg(data, errmsg)) File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 166, in _sum for n,d in map(_exact_ratio, values): File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 406, in <genexpr> T, total, count = _sum(1/x for x in _fail_neg(data, errmsg)) File "/Users/warren/repos/git/forks/cpython/Lib/statistics.py", line 288, in _fail_neg if x < 0: TypeError: '<' not supported between instances of 'str' and 'int' |
|||

msg354042 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-10-06 19:06 | |

Normally early-out behaviors are allowed to skip any checks for subsequent inputs (for example, the any() and all() builtins stop consuming inputs one a final value is found). However, that reasoning might not apply to a zero input for harmonic_mean(). Technically, the definition of harmonic mean doesn't allow for a zero input value at all (accordingly, MS Excel gives an error for a zero input). Presumably, the reason that Python's harmonic_mean() returns zero is so that the function degrades gracefully as one of the inputs gets closer to zero. Options at this point: * Just document that there is an early-out for zero. * Stick with the strategy of treating zero as if it were just a very, very small positive input. The would imply that all of the inputs should be considered and that TypeErrors or StatisticErrors should be raised for later data points (though I'm unclear what this world view implies when the inputs have both a zero and a positive infinity). * Treat a zero input as an error. This matches the usual specification of harmonic mean being defined only for non-negative real inputs. |
|||

msg354050 - (view) | Author: Warren Weckesser (WarrenWeckesser) * | Date: 2019-10-06 21:31 | |

I find it hard to accept the first option. It seems to let a detail of the current implementation take precedence over API design. I don't see why we would want an API in which harmonic_mean([1, 0, -1]) returns 0 but harmonic_mean([-1, 0, 1]) raises an error. The harmonic mean should be invariant to permutations of the input (well, at least to within normal floating point precision when the input is floating point). I wouldn't expect such a radical change in behavior based on how I ordered my input to harmonic_mean. The second option appears to be the API that was intended when the function was implemented. Returning 0 when one or more inputs are 0 (and the rest are positive numbers) is justified by taking a limit. Just like we *define* the sinc function (https://en.wikipedia.org/wiki/Sinc_function) to be 1 at x=0 and sin(x)/x when x != 0 (because the limit as x approaches 0 of sin(x)/x is 1), we can define the value of the harmonic mean with zeros in the input as the limiting value as one (or more) positive values goes to 0. That limit is 0. (Also note that, in the case where just one input is 0, the expression for the harmonic mean can be manipulated into a form that gives 0 without requiring a limit. E.g. for three values, 1/(1/x0 + 1/x1 + 1/x2) = x0*x1*x2/(x1*x2 + x0*x2 + x0*x1). If just one of those values is 0, the denominator is nonzero, so the result is simply 0.) There is a nice analogy from electrical circuit theory. Given, say, three resistances R1, R2 and R3 wired in parallel, the total resistance R is R = 1/(1/R1 + 1/R2 + 1/R3) = harmonic_mean([R1, R2, R3])/3 (https://en.wikipedia.org/wiki/Series_and_parallel_circuits#Resistance_units_2). Intuitively, we know that if any of those resistances is 0 (i.e. there is a short circuit), the total resistance is also 0. The resistance analogy also gives the correct interpretation for the case where the input includes both 0 and +inf. An infinite resistance is an open circuit (i.e. no connection), so putting that in parallel with a short circuit still results in a total resistance of 0. The third option, in which any zero in the input is treated as an error, is a change in behavior. Given the justification for returning 0, and given that currently a call such as harmonic_mean([1, 0, 2]) *does* return 0, making that an error seems like an undesirable change. |
|||

msg354051 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-10-06 22:29 | |

The resistor example is persuasive :-) |
|||

msg354056 - (view) | Author: Tim Peters (tim.peters) * | Date: 2019-10-07 00:22 | |

I don't have a problem with the current behavior (early out on zero, even if later arguments are senseless). So: > * Just document that there is an early-out for zero. |
|||

msg354058 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-10-07 00:42 | |

> I don't have a problem with the current behavior (early out on zero, > even if later arguments are senseless). So: > > > * Just document that there is an early-out for zero. That make the most sense to me as well (even though the OP thinks otherwise). |
|||

msg354059 - (view) | Author: Steven D'Aprano (steven.daprano) * | Date: 2019-10-07 01:41 | |

Thanks Warren, the resistance example is excellent, would you like to write up a PR to add it to the docs? In the original design, I prohibited negative numbers because that seems to be what everyone says you should do, but as far as I can tell it is mathematically well defined so long as the sum of positive values doesn't equal the sum of negative values. And in that case, we could return a NAN. I don't know that there's any physical meaning to the harmonic mean of negative values (circuits with negative resistances?), but I'm almost tempted to make this "consenting adults" and remove the restriction altogether. That would simplify the implementation and remove any arguments about early-out on zero. I acknowledge Warren's point about permutations of data, but that applies to anything with an early-out. I don't think that's quite persuasive enough to justify losing the early exit on zero. Warren, if you feel really strongly about this, feel free to try to change my mind. But for now, I think that documenting that zero triggers an early-out is the right solution. By the way, I've just noticed that harmonic_mean([INF, INF]) raises. Should it return INF instead? |
|||

msg354065 - (view) | Author: Warren Weckesser (WarrenWeckesser) * | Date: 2019-10-07 07:25 | |

I really don't like the current inconsistent behavior. I can't help but see it as a bug, and I am surprised that folks think it is OK that harmonic_mean([0, "foo"]) returns 0. This is the *statistics* module, and the first line of https://docs.python.org/3/library/statistics.html says "This module provides functions for calculating mathematical statistics of numeric (Real-valued) data". I understand the desire to maintain the performance benefit of the early exit, but I would rather disallow values that are zero than accept the nonsensical result of harmonic_mean([0, "foo"]) and the inconsistency between that and the result of harmonic_mean(["foo", 0]). I spend most of my time working on numerical-ish code, so my world view might be different than that of y'all. Here's a new proposal. To continue to accept zero values, and disallow negative values consistently (as the function is documented to do), *and* preserve the early exit, I propose that the try/except statement in harmonic_mean in master be changed to try: T, total, count = _sum(1/x for x in _fail_neg(data, errmsg)) except ZeroDivisionError: if any(x < 0 for x in data): raise StatisticsError(errmsg) from None return 0 The change adds just the two lines in the except block. This doesn't change the "happy path" that is followed when all the values are strictly greater than 0. The cost of the additional check for any negative values in the input only occurs in the (presumably uncommon) case where an input value is 0. (This means, however, that any values before the first 0 are checked twice.) > In the original design, I prohibited negative numbers because that seems to be what everyone says you should do, but as far as I can tell it is mathematically well defined so long as the sum of positive values doesn't equal the sum of negative values. And in that case, we could return a NAN. Allowing negative values introduces some mathematical issues. Most importantly, the limit of the harmonic mean as two or more values approach zero (in the multivariable limit sense) no longer exists. Even in the simple case of computing the harmonic mean H(x, y) of two values, the limit of H(x, y) as (x, y) -> (0, 0) does not exist. H(x, y) is one of those funky functions where there are different paths to the origin in the second and fourth quadrants that lead to different values of the limit. Of course, practicality beats purity, so the harmonic mean could simply be defined to be 0 whenever any one of the inputs is 0. Then the only issue is the one you mentioned: the handling of the case where the denominator of the harmonic mean formula is zero. In that case, the code would return NAN as you suggested, or raise an exception. Overall, I don't think handling negative values is very important, and I'm not sure such a change to the function is worthwhile. On the other hand, I think it would be better than the current behavior. > By the way, I've just noticed that harmonic_mean([INF, INF]) raises. Should it return INF instead? Yes, I expect INF in that case. Currently it only raises when all the inputs are INF. Include one finite value, and there is no error: In [67]: harmonic_mean([math.inf, math.inf, 5]) Out[67]: 15.0 The exception could be fixed by checking if total is zero before computing n/total (currently computed in the return statement of the function). |
|||

msg354241 - (view) | Author: Warren Weckesser (WarrenWeckesser) * | Date: 2019-10-09 00:32 | |

One more case where the "early out" produces a result that seems wrong: In [20]: harmonic_mean([math.nan, 0]) Out[20]: 0 Anyone who didn't know about the "early out" behavior would probably expect the result to be NAN. |
|||

msg354262 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2019-10-09 10:27 | |

> In [20]: harmonic_mean([math.nan, 0]) > Out[20]: 0 That one seems excusable, for the same sort of reasons that IEEE 754 specifies that hypot(nan, inf) is inf rather than nan. Similarly, sumSquare and sumAbs from IEEE 754-2008 specify that infinities take precedence over NaNs, on the basis that the result doesn't change if the nan is replaced with any non-nan value. |
|||

msg354504 - (view) | Author: Tal Einat (taleinat) * | Date: 2019-10-11 21:36 | |

I'd certainly be surprised to have the order of zero and negative values in the input have such a dramatic effect on the result. It would make debugging these cases unnecessarily confusing and difficult. And I think it is indeed worth fixing, as these edge-cases are likely to come up of time when dealing with large amounts of data. I'm +1 on Warren's latest suggestion for a fix. P.S If we agree on this, either Warren should make the fix himself or we should mark it "newcomer friendly" and let a newcomer handle it. |
|||

msg355687 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-10-30 00:42 | |

Maybe we should hold-off on adding negative number support unless a compelling use case arises. It is telling that scipy.stats.hmean() and that MS Excel's harmean() haven't found justification to add negative number support. In all likelihood, a negative input is an error in the input and users would be better-off seeing an exception raised. In a number of contexts, the negative input wouldn't make sense all: * a 10 ohm resistor wired in parallel with a negative 5 ohm resistor * travel 1 km at 5 kph then another 1 km at -10 kph We really don't have to go down this route just because we can. The usual rule is to keep the API minimal unless real use cases arise to inform the design. |
|||

msg355690 - (view) | Author: gaoxinge (gaoxinge) | Date: 2019-10-30 03:06 | |

I think that three kinds of mean should behave like: - `geometric mean`: input should be float, finite and positive - `harmonic mean`: input should be float, finite and nonzero - `mean` or `fmean`: input should be float, finite Otherwise these mean functions should raise a `StatisticsError`. Two points that we should pay attention to: - Why every input should be float? Because it is a statstics library, we should focus on the float instead of complex number or string or `math.nan`. - Why every input shoud be finite? Becuse we should get rid of `math.inf`. |
|||

msg355691 - (view) | Author: gaoxinge (gaoxinge) | Date: 2019-10-30 03:10 | |

The document https://github.com/WarrenWeckesser/cpython/blob/master/Lib/statistics.py#L389 in `harmonic_mean` is out-dated. I think we can simply follow the wiki: - [arithemic mean](https://en.wikipedia.org/wiki/Arithmetic_mean) - [geometric mean](https://en.wikipedia.org/wiki/Geometric_mean) - [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) |
|||

msg355692 - (view) | Author: Steven D'Aprano (steven.daprano) * | Date: 2019-10-30 05:21 | |

Thank you for your comments, but accepting non-floats like Decimal and Fraction (and int, of course!) is a hard requirement for the statistics module. fmean will naturally convert any data to float for speed, but the other means will attempt to keep the original data type if possible, so that the mean() of Decimals will be a Decimal, the mean of Fractions will be a Fraction, etc. The median_low and median_high functions don't require numbers at all, the data points just need to be ordinal data and support sorting. The mode function doesn't even require ordinal data, it can work with nominal data. That is part of the definition of median and mode, and will not change. http://intellspot.com/nominal-vs-ordinal-data/ So long as Python support IEEE-754 floating point arithmetic, which is the most common standard for floating point, there will be NANs and INFs. We should not reject them unnecessarily. You can always pre-filter your data before passing it the statistics module. Why do you object to float INF? |
|||

msg355694 - (view) | Author: gaoxinge (gaoxinge) | Date: 2019-10-30 06:06 | |

Your point makes sense. We should still also consider the `int`, `Decimal`, `Fraction`, `math.nan`, `math.inf`, `np.int` and so on. So - `geometric mean`: input should be positive (>= 0) - `harmonic mean`: input should be nonzero (!= 0) - `mean` or `fmean`: no input restriction |
|||

msg355696 - (view) | Author: Steven D'Aprano (steven.daprano) * | Date: 2019-10-30 08:09 | |

Have you read the rest of the thread? There is a compelling reason to support harmonic mean including zero (resistors in parallel) but not yet any compelling reason to support negative values. If you have a good use-case for harmonic mean of negative values, please tell us. Until then, I agree with Raymond: negative values should be treated as an error, as documented. Can we go back to the original issue? Should we just document the "early out" behaviour on hitting zero, or should we follow the suggestion to check for bad data and raise? I'm currently undecided which I would prefer. |
|||

msg355705 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2019-10-30 10:10 | |

> Maybe we should hold-off on adding negative number support unless a compelling use case arises. +100. It makes little sense mathematically, and I'm sceptical that such use cases exist in real life. |
|||

msg355879 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-11-02 18:29 | |

> Can we go back to the original issue? Should we just > document the "early out" behaviour on hitting zero, > or should we follow the suggestion to check for bad > data and raise? I would follow Tim's recommendation to document it and be done with it. |
|||

msg355915 - (view) | Author: Warren Weckesser (WarrenWeckesser) * | Date: 2019-11-04 00:29 | |

On 10/9/19, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > >> In [20]: harmonic_mean([math.nan, 0]) >> >> Out[20]: 0 > > That one seems excusable, for the same sort of reasons that IEEE 754 > specifies that hypot(nan, inf) is inf rather than nan. Similarly, sumSquare > and sumAbs from IEEE 754-2008 specify that infinities take precedence over > NaNs, on the basis that the result doesn't change if the nan is replaced > with any non-nan value. > A belated "thanks", Mark. I wasn't aware of that, but it makes sense. Warren > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue38382> > _______________________________________ > |
|||

msg356170 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-11-07 05:50 | |

New changeset 7f460494d2309ace004a400bae8fc59134dc325c by Raymond Hettinger in branch 'master': bpo-38382: Document the early-out behavior for a zero (GH-17037) https://github.com/python/cpython/commit/7f460494d2309ace004a400bae8fc59134dc325c |
|||

msg356171 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-11-07 05:58 | |

New changeset 6b66dc387935df432cdbc01397ddef534a84ade9 by Raymond Hettinger (Miss Islington (bot)) in branch '3.8': bpo-38382: Document the early-out behavior for a zero (GH-17037) (GH-17078) https://github.com/python/cpython/commit/6b66dc387935df432cdbc01397ddef534a84ade9 |

History | |||
---|---|---|---|

Date | User | Action | Args |

2019-11-07 05:58:50 | rhettinger | set | status: open -> closed resolution: not a bug stage: patch review -> resolved |

2019-11-07 05:58:16 | rhettinger | set | messages: + msg356171 |

2019-11-07 05:50:57 | miss-islington | set | pull_requests: + pull_request16588 |

2019-11-07 05:50:51 | rhettinger | set | messages: + msg356170 |

2019-11-04 00:29:20 | WarrenWeckesser | set | messages: + msg355915 |

2019-11-03 21:09:23 | rhettinger | set | pull_requests: + pull_request16550 |

2019-11-02 18:29:01 | rhettinger | set | messages: + msg355879 |

2019-10-30 10:10:22 | mark.dickinson | set | messages: + msg355705 |

2019-10-30 08:09:46 | steven.daprano | set | messages: + msg355696 |

2019-10-30 06:06:20 | gaoxinge | set | messages: + msg355694 |

2019-10-30 05:21:25 | steven.daprano | set | messages: + msg355692 |

2019-10-30 03:10:26 | gaoxinge | set | messages: + msg355691 |

2019-10-30 03:06:29 | gaoxinge | set | nosy:
+ gaoxinge messages: + msg355690 |

2019-10-30 00:42:20 | rhettinger | set | messages: + msg355687 |

2019-10-11 21:36:49 | taleinat | set | nosy:
+ taleinat messages: + msg354504 |

2019-10-09 10:27:20 | mark.dickinson | set | messages: + msg354262 |

2019-10-09 02:45:31 | rhettinger | set | assignee: rhettinger -> steven.daprano |

2019-10-09 00:32:44 | WarrenWeckesser | set | messages: + msg354241 |

2019-10-07 07:25:36 | WarrenWeckesser | set | messages: + msg354065 |

2019-10-07 01:41:18 | steven.daprano | set | messages: + msg354059 |

2019-10-07 00:42:25 | rhettinger | set | messages: + msg354058 |

2019-10-07 00:22:37 | tim.peters | set | messages: + msg354056 |

2019-10-06 22:29:52 | rhettinger | set | messages: + msg354051 |

2019-10-06 21:31:16 | WarrenWeckesser | set | messages: + msg354050 |

2019-10-06 19:06:06 | rhettinger | set | nosy:
+ tim.peters, mark.dickinson messages: + msg354042 |

2019-10-06 18:33:52 | rhettinger | set | assignee: rhettinger nosy: + rhettinger, steven.daprano versions: + Python 3.7, Python 3.8 |

2019-10-06 03:12:28 | WarrenWeckesser | set | keywords:
+ patch stage: patch review pull_requests: + pull_request16189 |

2019-10-06 02:40:20 | WarrenWeckesser | set | messages: + msg354030 |

2019-10-06 01:46:18 | WarrenWeckesser | create |