classification
Title: Add geometric mean to `statistics` module
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: cool-RR, haypo, koobs, mark.dickinson, martin.panter, ned.deily, python-dev, rhettinger, steven.daprano
Priority: Keywords:

Created on 2016-06-02 12:24 by cool-RR, last changed 2016-10-04 18:56 by ned.deily.

Messages (36)
msg266948 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-06-02 21:04
Steven, this seems like a reasonable suggestion (though I would expect someone else will immediately suggest a harmonic mean as well).   Is this within the scope of what you were trying to do with the statistics module?
msg266969 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-06-02 22:04
On Thu, Jun 02, 2016 at 09:04:54PM +0000, Raymond Hettinger wrote:
> Steven, this seems like a reasonable suggestion (though I would expect 
> someone else will immediately suggest a harmonic mean as well).  Is 
> this within the scope of what you were trying to do with the 
> statistics module?

Yes, I think it is reasonable too. I'll aim to get this in to 3.6.
msg267051 - (view) Author: Ram Rachum (cool-RR) * Date: 2016-06-03 06:00
To complicate things further...

I implemented a geometric mean on my own, and then I figured out what I really want is a *weighted* geometric mean, so I implemented that for myself. If you'd want to include that, that'll be cool. Actually I'm not sure if the goal of the `statistics` module is to be comprehensive or minimal. I'm hoping it's meant to be comprehensive. But then I'd guess there would be a lot of things you'd want to add except my little feature.
msg267253 - (view) Author: Ram Rachum (cool-RR) * Date: 2016-06-04 14:14
And of course, if the goal of the `statistics` module is to be comprehensive, one should ask himself what should be the difference between this new module and a mature statistics module like `scipy.stats`, and whether we should try to copy the features of off `scipy.stats`.
msg267990 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-06-09 08:23
Choice of algorithm is a bit tricky here. There are a couple of obvious algorithms that work mathematically but result in significant accuracy loss in an IEEE 754 floating-point implementation: one is `exp(mean(map(log, my_numbers)))`, where the log calls can introduce significant loss of information, and the other is `prod(x**(1./len(my_numbers)) for x in my_numbers)`, where the `**(1./n)` operation similarly discards information. A better algorithm numerically is `prod(my_numbers)**(1./len(my_numbers))`, but that's likely to overflow quickly for large datasets (and/or datasets containing large values).

I'd suggest something along the lines of `prod(my_numbers)**(1./len(my_numbers))`, but keeping track of the exponent of the product separately and renormalizing where necessary to avoid overflow.

There are also algorithms for improved accuracy in a product, along the same lines as the algorithm used in fsum. See e.g., the paper "Accurate Floating-Point Product and Exponentiation" by Stef Graillat. [1] (I didn't know about this paper: I just saw a reference to it in a StackOverflow comment [2], which reminded me of this issue.)

[1] http://www-pequan.lip6.fr/~graillat/papers/IEEE-TC-prod.pdf
[2] http://stackoverflow.com/questions/37715250/safe-computation-of-geometric-mean
msg268008 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-06-09 09:24
On the other hand, apparently `exp(mean(log(...)))` is good enough for SciPy: its current implementation looks like this:

def gmean(a, axis=0):
    a, axis = _chk_asarray(a, axis)
    log_a = ma.log(a)
    return ma.exp(log_a.mean(axis=axis))
msg268020 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-06-09 11:59
On Thu, Jun 09, 2016 at 09:24:04AM +0000, Mark Dickinson wrote:

> On the other hand, apparently `exp(mean(log(...)))` is good enough for SciPy:

Hmm, well, I don't have SciPy installed, but I've found that despite 
their (well-deserved) reputation, numpy (and presumably scipy) often 
have rather naive algorithms that can lose accuracy rather 
spectacularly.

py> statistics.mean([1e50, 2e-50, -1e50, 2e-50])
1e-50
py> np.mean(np.array([1e50, 2e-50, -1e50, 2e-50]))
5e-51

py> statistics.mean([1e50, 2e-50, -1e50, 2e-50]*1000)
1e-50
py> np.mean(np.array([1e50, 2e-50, -1e50, 2e-50]*1000))
5.0000000000000002e-54

On the other hand, np is probably a hundred times (or more) faster, so I 
suppose accuracy/speed makes a good trade off.
msg268022 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-06-09 12:55
> Hmm, well, I don't have SciPy installed, but I've found that despite 
> their (well-deserved) reputation, numpy (and presumably scipy) often 
> have rather naive algorithms that can lose accuracy rather 
> spectacularly.

Agreed. And as Ram Rachum hinted, there seems little point aiming to duplicate things that already exist in the de facto standard scientific libraries. So I think there's a place for a non-naive carefully computed geometric mean in the std. lib. statistics module, but I wouldn't see the point of simply adding an exp-mean-log implementation (not that anyone is advocating that).
msg270025 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-07-09 05:49
Does anyone have any strong feeling about the name for these functions?

gmean and hmean;

geometric_mean and harmonic_mean

And "subcontrary_mean" is not an option :-)
msg270026 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-07-09 05:57
I would like to see them spelled-out:  geometric_mean and harmonic_mean
msg270033 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-07-09 09:15
> I would like to see them spelled-out:  geometric_mean and harmonic_mean

+1
msg272214 - (view) Author: Roundup Robot (python-dev) Date: 2016-08-09 04:18
New changeset 9eb5edfcf604 by Steven D'Aprano in branch 'default':
Issue27181 add geometric mean.
https://hg.python.org/cpython/rev/9eb5edfcf604
msg272217 - (view) Author: Ram Rachum (cool-RR) * Date: 2016-08-09 06:40
Thanks for the patch Steven! I won't comment about the code because I don't know enough about these algorithms, but I'm thinking, since you also did a refactoring of the statistics module, maybe these should be two separate patches/commits so it'll be easy to see which part is the new feature and which part is moving existing code around?
msg272218 - (view) Author: Ram Rachum (cool-RR) * Date: 2016-08-09 06:44
Also... I like the detailed docstrings with the real-life examples! That stuff helps when coding and using an unfamiliar function (since I see the docs in a panel of my IDE), so I wish I'd see more detailed docstrings like these ones in the standard library. For `geometric_mean`, maybe I'd add one sentence that describes how the geometric mean is calculated.
msg272224 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-08-09 08:00
On Tue, Aug 09, 2016 at 06:44:22AM +0000, Ram Rachum wrote:
> For `geometric_mean`, maybe I'd add one sentence that describes
> how the geometric mean is calculated.

What do you mean? As in, the mathematical definition of geometric mean?

Or do you mean a one sentence description of the algorithm?
msg272225 - (view) Author: Ram Rachum (cool-RR) * Date: 2016-08-09 08:44
I meant the mathematical definition.
msg272488 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-12 01:04
Tests fail on a Power PC buildbot:

http://buildbot.python.org/all/builders/PPC64LE%20Fedora%203.x/builds/1476/steps/test/logs/stdio
======================================================================
FAIL: testExactPowers (test.test_statistics.Test_Nth_Root) (i=29, n=11)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64le/build/Lib/test/test_statistics.py", line 1216, in testExactPowers
    self.assertEqual(self.nroot(x, n), i)
AssertionError: 29.000000000000004 != 29

======================================================================
FAIL: testExactPowersNegatives (test.test_statistics.Test_Nth_Root) (i=-29, n=11)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64le/build/Lib/test/test_statistics.py", line 1228, in testExactPowersNegatives
    self.assertEqual(self.nroot(x, n), i)
AssertionError: -29.000000000000004 != -29
msg272504 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-08-12 07:44
What no patch for pre-commit review?!

For computing nth roots, it may be worth special-casing the case n=2: for floats, `math.sqrt` is likely to be faster and more precise than an ad-hoc algorithm. (Indeed, I'd expect it to be perfectly correctly rounded on the vast majority of current machines.)
msg272525 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-08-12 11:07
I thought about special-casing n=2 to math.sqrt, but as that's an 
implementation detail I can make that change at any time. According 
to my testing, math.pow(x, 0.5) is no worse than sqrt, so I'm not 
sure if there's any advantage to having yet another branch.

I'd be interested in special-casing n=3 to math.cbrt (if and when it exists) 
now that its a standard C99 function.
msg272526 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-08-12 11:45
> According to my testing, math.pow(x, 0.5) is no worse than sqrt.

It certainly is worse than sqrt, both in terms of speed and accuracy. Whether the difference is enough to make it worth special-casing is another question, of course, and as you say, that can happen later.
msg272640 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-08-14 02:31
I've created a new issue to track the loss of accuracy on PowerPC: http://bugs.python.org/issue27761
msg272659 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-08-14 09:16
A failing case:

>>> statistics.geometric_mean([0.7 for _ in range(5000)])
Traceback (most recent call last):
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 362, in float_nroot
    isinfinity = math.isinf(x)
OverflowError: int too large to convert to float

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 595, in geometric_mean
    s = 2**p * _nth_root(2**q, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 346, in nth_root
    return _nroot_NS.float_nroot(x, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 364, in float_nroot
    return _nroot_NS.bignum_nroot(x, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 489, in bignum_nroot
    b = 2**q * _nroot_NS.nroot(2**r, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 384, in nroot
    r1 = math.pow(x, 1.0/n)
OverflowError: int too large to convert to float
msg272807 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-08-15 23:05
FTR, multiple platforms are failing in various ways, not just PPC64, so Issue27761 was expanded to cover them and has been marked as a "release blocker".
msg272880 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-08-16 20:22
Failure on s390x Debian 3.x:

http://buildbot.python.org/all/builders/s390x%20Debian%203.x/builds/1455/steps/test/logs/stdio

======================================================================
FAIL: testExactPowers (test.test_statistics.Test_Nth_Root) (i=29, n=11)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/test_statistics.py", line 1216, in testExactPowers
    self.assertEqual(self.nroot(x, n), i)
AssertionError: 29.000000000000004 != 29

======================================================================
FAIL: testExactPowersNegatives (test.test_statistics.Test_Nth_Root) (i=-29, n=11)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/test_statistics.py", line 1228, in testExactPowersNegatives
    self.assertEqual(self.nroot(x, n), i)
AssertionError: -29.000000000000004 != -29
msg272881 - (view) Author: Roundup Robot (python-dev) Date: 2016-08-16 20:22
New changeset 54288b160243 by Victor Stinner in branch 'default':
Issue #27181: Skip tests known to fail until a fix is found
https://hg.python.org/cpython/rev/54288b160243
msg272882 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-08-16 20:24
I would like to use buildbots to check for regressions, but I see a lot of red buildbots, so buildbots became useless :-/

I skipped failing test_statistics tests, since failures are known.

I put the priority to "release blocker".

I suggest to either revert the change or find a fix before 3.6b1.
msg272913 - (view) Author: Kubilay Kocak (koobs) Date: 2016-08-17 08:41
For posterity, the following failure was observed on all (9/10/11(current) FreeBSD buildbots:

======================================================================
FAIL: testFraction (test.test_statistics.Test_Nth_Root)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_statistics.py", line 1247, in testFraction
    self.assertEqual(self.nroot(x**12, 12), float(x))
AssertionError: 1.1866666666666665 != 1.1866666666666668
msg272970 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-08-17 17:18
>     self.assertEqual(self.nroot(x**12, 12), float(x))
> AssertionError: 1.1866666666666665 != 1.1866666666666668

That looks like a case where the test should simply be weakened to an `assertAlmostEqual` with a suitable tolerance; there's no strong reason to expect that `nroot` will give a faithfully rounded result in this case or any other.
msg275927 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-09-12 02:50
As discussed with Ned by email, I'm currently unable to build 3.6 and won't have time to work on this before b1. As discussed on #27761 my tests here are too strict and should be loosened, e.g. from assertEqual to assertAlmostEqual. Ned wrote:

"If you are only planning to make changes to the tests themselves, I think that can wait for b2."

I have no plans to change the publicly visible interface of geometric_mean.
msg276057 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-09-12 15:35
Steven: any thoughts about the

statistics.geometric_mean(0.7 for _ in range(5000))

failure? Should I open a separate bug report for that, or would you rather address it as part of this issue?
msg276060 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-09-12 15:39
>>> statistics.geometric_mean([0.7 for _ in range(5000)])
Traceback (most recent call last):
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 362, in float_nroot
    isinfinity = math.isinf(x)
OverflowError: int too large to convert to float

=> see also issue #27975
msg276151 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-09-13 02:28
On Mon, Sep 12, 2016 at 03:35:14PM +0000, Mark Dickinson wrote:
> statistics.geometric_mean(0.7 for _ in range(5000))

I've raised a new ticket #28111
msg278056 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-10-04 16:14
I'm sorry to say that due to technical difficulties, geometric mean is not going to be in a fit state for beta 2 of 3.6, and so is going to be removed and delayed until 3.7.
msg278059 - (view) Author: Roundup Robot (python-dev) Date: 2016-10-04 16:25
New changeset 9dce0e41bedd by Steven D'Aprano in branch 'default':
Issue #27181 remove geometric_mean and defer for 3.7.
https://hg.python.org/cpython/rev/9dce0e41bedd
msg278076 - (view) Author: Roundup Robot (python-dev) Date: 2016-10-04 18:52
New changeset de0fa478c22e by Steven D'Aprano in branch '3.6':
Issue #27181 remove geometric_mean and defer for 3.7.
https://hg.python.org/cpython/rev/de0fa478c22e
msg278078 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-10-04 18:56
Thanks, Steven.  Actually, we needed to remove geometric_mean from the 3.6 branch, not the default branch (which will become 3.7).  I backported your removal patch to 3.6.  Feel free to reapply geometric_mean to the default branch at your leisure.
History
Date User Action Args
2016-10-04 18:56:20ned.deilysetmessages: + msg278078
stage: patch review
2016-10-04 18:52:26python-devsetmessages: + msg278076
2016-10-04 16:25:29python-devsetmessages: + msg278059
2016-10-04 16:14:43steven.dapranosetpriority: release blocker ->

messages: + msg278056
versions: + Python 3.7, - Python 3.6
2016-09-13 02:28:03steven.dapranosetmessages: + msg276151
2016-09-12 15:39:40hayposetmessages: + msg276060
2016-09-12 15:35:13mark.dickinsonsetmessages: + msg276057
2016-09-12 02:50:47steven.dapranosetmessages: + msg275927
2016-08-17 17:18:15mark.dickinsonsetmessages: + msg272970
2016-08-17 08:41:31koobssetnosy: + koobs
messages: + msg272913
2016-08-16 20:24:35hayposetpriority: normal -> release blocker

messages: + msg272882
2016-08-16 20:22:57python-devsetmessages: + msg272881
2016-08-16 20:22:07hayposetnosy: + haypo
messages: + msg272880
2016-08-15 23:05:40ned.deilysetnosy: + ned.deily
messages: + msg272807
2016-08-14 09:16:11mark.dickinsonsetmessages: + msg272659
2016-08-14 02:31:20steven.dapranosetmessages: + msg272640
2016-08-12 11:45:25mark.dickinsonsetmessages: + msg272526
2016-08-12 11:07:42steven.dapranosetmessages: + msg272525
2016-08-12 07:44:40mark.dickinsonsetmessages: + msg272504
2016-08-12 01:04:21martin.pantersetnosy: + martin.panter
messages: + msg272488
2016-08-09 08:44:08cool-RRsetmessages: + msg272225
2016-08-09 08:00:07steven.dapranosetmessages: + msg272224
2016-08-09 06:44:22cool-RRsetmessages: + msg272218
2016-08-09 06:40:53cool-RRsetmessages: + msg272217
2016-08-09 04:18:51python-devsetnosy: + python-dev
messages: + msg272214
2016-07-09 09:15:14mark.dickinsonsetmessages: + msg270033
2016-07-09 05:57:46rhettingersetmessages: + msg270026
2016-07-09 05:49:17steven.dapranosetmessages: + msg270025
2016-06-09 12:55:29mark.dickinsonsetmessages: + msg268022
2016-06-09 11:59:05steven.dapranosetmessages: + msg268020
2016-06-09 09:24:04mark.dickinsonsetmessages: + msg268008
2016-06-09 08:23:34mark.dickinsonsetnosy: + mark.dickinson
messages: + msg267990
2016-06-04 14:14:26cool-RRsetmessages: + msg267253
2016-06-03 06:00:58cool-RRsetmessages: + msg267051
2016-06-02 22:04:12steven.dapranosetmessages: + msg266969
2016-06-02 21:04:54rhettingersetassignee: steven.daprano

messages: + msg266948
nosy: + rhettinger
2016-06-02 12:38:33xiang.zhangsetnosy: + steven.daprano
2016-06-02 12:24:07cool-RRcreate