Add geometric mean to `statistics` module #71368

cool-RR · 2016-06-02T12:24:08Z

BPO	27181
Nosy	@rhettinger, @mdickinson, @stevendaprano, @cool-RR, @vadmium, @koobs, @csabella
PRs	bpo-27181: Add statistics.geometric_mean() #12638

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/stevendaprano'
closed_at = <Date 2019-04-07.16:21:07.856>
created_at = <Date 2016-06-02.12:24:07.764>
labels = ['3.8', 'type-feature', 'library']
title = 'Add geometric mean to `statistics` module'
updated_at = <Date 2019-04-07.16:21:07.854>
user = 'https://github.com/cool-RR'

bugs.python.org fields:

activity = <Date 2019-04-07.16:21:07.854>
actor = 'rhettinger'
assignee = 'steven.daprano'
closed = True
closed_date = <Date 2019-04-07.16:21:07.856>
closer = 'rhettinger'
components = ['Library (Lib)']
creation = <Date 2016-06-02.12:24:07.764>
creator = 'cool-RR'
dependencies = []
files = []
hgrepos = []
issue_num = 27181
keywords = ['patch']
message_count = 45.0
messages = ['266948', '266969', '267051', '267253', '267990', '268008', '268020', '268022', '270025', '270026', '270033', '272214', '272217', '272218', '272224', '272225', '272488', '272504', '272525', '272526', '272640', '272659', '272807', '272880', '272881', '272882', '272913', '272970', '275927', '276057', '276060', '276151', '278056', '278059', '278076', '278078', '300918', '335662', '335720', '338742', '339082', '339094', '339319', '339579', '339580']
nosy_count = 8.0
nosy_names = ['rhettinger', 'mark.dickinson', 'steven.daprano', 'cool-RR', 'python-dev', 'martin.panter', 'koobs', 'cheryl.sabella']
pr_nums = ['12638']
priority = None
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue27181'
versions = ['Python 3.8']

rhettinger · 2016-06-02T21:04:54Z

Steven, this seems like a reasonable suggestion (though I would expect someone else will immediately suggest a harmonic mean as well). Is this within the scope of what you were trying to do with the statistics module?

stevendaprano · 2016-06-02T22:04:13Z

On Thu, Jun 02, 2016 at 09:04:54PM +0000, Raymond Hettinger wrote:

Steven, this seems like a reasonable suggestion (though I would expect
someone else will immediately suggest a harmonic mean as well). Is
this within the scope of what you were trying to do with the
statistics module?

Yes, I think it is reasonable too. I'll aim to get this in to 3.6.

cool-RR · 2016-06-03T06:00:58Z

To complicate things further...

I implemented a geometric mean on my own, and then I figured out what I really want is a weighted geometric mean, so I implemented that for myself. If you'd want to include that, that'll be cool. Actually I'm not sure if the goal of the statistics module is to be comprehensive or minimal. I'm hoping it's meant to be comprehensive. But then I'd guess there would be a lot of things you'd want to add except my little feature.

cool-RR · 2016-06-04T14:14:26Z

And of course, if the goal of the statistics module is to be comprehensive, one should ask himself what should be the difference between this new module and a mature statistics module like scipy.stats, and whether we should try to copy the features of off scipy.stats.

mdickinson · 2016-06-09T08:23:34Z

Choice of algorithm is a bit tricky here. There are a couple of obvious algorithms that work mathematically but result in significant accuracy loss in an IEEE 754 floating-point implementation: one is exp(mean(map(log, my_numbers))), where the log calls can introduce significant loss of information, and the other is prod(x**(1./len(my_numbers)) for x in my_numbers), where the **(1./n) operation similarly discards information. A better algorithm numerically is prod(my_numbers)**(1./len(my_numbers)), but that's likely to overflow quickly for large datasets (and/or datasets containing large values).

I'd suggest something along the lines of prod(my_numbers)**(1./len(my_numbers)), but keeping track of the exponent of the product separately and renormalizing where necessary to avoid overflow.

There are also algorithms for improved accuracy in a product, along the same lines as the algorithm used in fsum. See e.g., the paper "Accurate Floating-Point Product and Exponentiation" by Stef Graillat. [1] (I didn't know about this paper: I just saw a reference to it in a StackOverflow comment [2], which reminded me of this issue.)

[1] http://www-pequan.lip6.fr/~graillat/papers/IEEE-TC-prod.pdf
[2] http://stackoverflow.com/questions/37715250/safe-computation-of-geometric-mean

mdickinson · 2016-06-09T09:24:04Z

On the other hand, apparently exp(mean(log(...))) is good enough for SciPy: its current implementation looks like this:

def gmean(a, axis=0):
    a, axis = _chk_asarray(a, axis)
    log_a = ma.log(a)
    return ma.exp(log_a.mean(axis=axis))

stevendaprano · 2016-06-09T11:59:05Z

On Thu, Jun 09, 2016 at 09:24:04AM +0000, Mark Dickinson wrote:

On the other hand, apparently exp(mean(log(...))) is good enough for SciPy:

Hmm, well, I don't have SciPy installed, but I've found that despite
their (well-deserved) reputation, numpy (and presumably scipy) often
have rather naive algorithms that can lose accuracy rather
spectacularly.

py> statistics.mean([1e50, 2e-50, -1e50, 2e-50])
1e-50
py> np.mean(np.array([1e50, 2e-50, -1e50, 2e-50]))
5e-51

py> statistics.mean([1e50, 2e-50, -1e50, 2e-50]*1000)
1e-50
py> np.mean(np.array([1e50, 2e-50, -1e50, 2e-50]*1000))
5.0000000000000002e-54

On the other hand, np is probably a hundred times (or more) faster, so I
suppose accuracy/speed makes a good trade off.

mdickinson · 2016-06-09T12:55:30Z

Hmm, well, I don't have SciPy installed, but I've found that despite
their (well-deserved) reputation, numpy (and presumably scipy) often
have rather naive algorithms that can lose accuracy rather
spectacularly.

Agreed. And as Ram Rachum hinted, there seems little point aiming to duplicate things that already exist in the de facto standard scientific libraries. So I think there's a place for a non-naive carefully computed geometric mean in the std. lib. statistics module, but I wouldn't see the point of simply adding an exp-mean-log implementation (not that anyone is advocating that).

stevendaprano · 2016-07-09T05:49:18Z

Does anyone have any strong feeling about the name for these functions?

gmean and hmean;

geometric_mean and harmonic_mean

And "subcontrary_mean" is not an option :-)

rhettinger · 2016-07-09T05:57:47Z

I would like to see them spelled-out: geometric_mean and harmonic_mean

mdickinson · 2016-07-09T09:15:15Z

I would like to see them spelled-out: geometric_mean and harmonic_mean

+1

python-dev · 2016-08-09T04:18:52Z

New changeset 9eb5edfcf604 by Steven D'Aprano in branch 'default':
bpo-27181 add geometric mean.
https://hg.python.org/cpython/rev/9eb5edfcf604

cool-RR · 2016-08-09T06:40:53Z

Thanks for the patch Steven! I won't comment about the code because I don't know enough about these algorithms, but I'm thinking, since you also did a refactoring of the statistics module, maybe these should be two separate patches/commits so it'll be easy to see which part is the new feature and which part is moving existing code around?

cool-RR · 2016-08-09T06:44:22Z

Also... I like the detailed docstrings with the real-life examples! That stuff helps when coding and using an unfamiliar function (since I see the docs in a panel of my IDE), so I wish I'd see more detailed docstrings like these ones in the standard library. For geometric_mean, maybe I'd add one sentence that describes how the geometric mean is calculated.

stevendaprano · 2016-08-09T08:00:08Z

On Tue, Aug 09, 2016 at 06:44:22AM +0000, Ram Rachum wrote:

For geometric_mean, maybe I'd add one sentence that describes
how the geometric mean is calculated.

What do you mean? As in, the mathematical definition of geometric mean?

Or do you mean a one sentence description of the algorithm?

cool-RR · 2016-08-09T08:44:08Z

I meant the mathematical definition.

vadmium · 2016-08-12T01:04:20Z

Tests fail on a Power PC buildbot:

http://buildbot.python.org/all/builders/PPC64LE%20Fedora%203.x/builds/1476/steps/test/logs/stdio
======================================================================
FAIL: testExactPowers (test.test_statistics.Test_Nth_Root) (i=29, n=11)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64le/build/Lib/test/test_statistics.py", line 1216, in testExactPowers
    self.assertEqual(self.nroot(x, n), i)
AssertionError: 29.000000000000004 != 29

======================================================================
FAIL: testExactPowersNegatives (test.test_statistics.Test_Nth_Root) (i=-29, n=11)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64le/build/Lib/test/test_statistics.py", line 1228, in testExactPowersNegatives
    self.assertEqual(self.nroot(x, n), i)
AssertionError: -29.000000000000004 != -29

mdickinson · 2016-08-12T07:44:40Z

What no patch for pre-commit review?!

For computing nth roots, it may be worth special-casing the case n=2: for floats, math.sqrt is likely to be faster and more precise than an ad-hoc algorithm. (Indeed, I'd expect it to be perfectly correctly rounded on the vast majority of current machines.)

stevendaprano · 2016-08-12T11:07:43Z

I thought about special-casing n=2 to math.sqrt, but as that's an
implementation detail I can make that change at any time. According
to my testing, math.pow(x, 0.5) is no worse than sqrt, so I'm not
sure if there's any advantage to having yet another branch.

I'd be interested in special-casing n=3 to math.cbrt (if and when it exists)
now that its a standard C99 function.

mdickinson · 2016-08-12T11:45:26Z

According to my testing, math.pow(x, 0.5) is no worse than sqrt.

It certainly is worse than sqrt, both in terms of speed and accuracy. Whether the difference is enough to make it worth special-casing is another question, of course, and as you say, that can happen later.

stevendaprano · 2016-08-14T02:31:20Z

I've created a new issue to track the loss of accuracy on PowerPC: http://bugs.python.org/issue27761

mdickinson · 2016-08-14T09:16:12Z

A failing case:

>>> statistics.geometric_mean([0.7 for _ in range(5000)])
Traceback (most recent call last):
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 362, in float_nroot
    isinfinity = math.isinf(x)
OverflowError: int too large to convert to float

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 595, in geometric_mean
    s = 2**p * _nth_root(2**q, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 346, in nth_root
    return _nroot_NS.float_nroot(x, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 364, in float_nroot
    return _nroot_NS.bignum_nroot(x, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 489, in bignum_nroot
    b = 2**q * _nroot_NS.nroot(2**r, n)
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 384, in nroot
    r1 = math.pow(x, 1.0/n)
OverflowError: int too large to convert to float

ned-deily · 2016-08-15T23:05:40Z

FTR, multiple platforms are failing in various ways, not just PPC64, so bpo-27761 was expanded to cover them and has been marked as a "release blocker".

vstinner · 2016-08-16T20:22:07Z

Failure on s390x Debian 3.x:

http://buildbot.python.org/all/builders/s390x%20Debian%203.x/builds/1455/steps/test/logs/stdio

======================================================================
FAIL: testExactPowers (test.test_statistics.Test_Nth_Root) (i=29, n=11)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/test_statistics.py", line 1216, in testExactPowers
    self.assertEqual(self.nroot(x, n), i)
AssertionError: 29.000000000000004 != 29

======================================================================
FAIL: testExactPowersNegatives (test.test_statistics.Test_Nth_Root) (i=-29, n=11)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-debian-z/build/Lib/test/test_statistics.py", line 1228, in testExactPowersNegatives
    self.assertEqual(self.nroot(x, n), i)
AssertionError: -29.000000000000004 != -29

python-dev · 2016-08-16T20:22:58Z

New changeset 54288b160243 by Victor Stinner in branch 'default':
Issue bpo-27181: Skip tests known to fail until a fix is found
https://hg.python.org/cpython/rev/54288b160243

vstinner · 2016-08-16T20:24:35Z

I would like to use buildbots to check for regressions, but I see a lot of red buildbots, so buildbots became useless :-/

I skipped failing test_statistics tests, since failures are known.

I put the priority to "release blocker".

I suggest to either revert the change or find a fix before 3.6b1.

koobs · 2016-08-17T08:41:31Z

For posterity, the following failure was observed on all (9/10/11(current) FreeBSD buildbots:

======================================================================
FAIL: testFraction (test.test_statistics.Test_Nth_Root)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_statistics.py", line 1247, in testFraction
    self.assertEqual(self.nroot(x**12, 12), float(x))
AssertionError: 1.1866666666666665 != 1.1866666666666668

mdickinson · 2016-08-17T17:18:16Z

self.assertEqual(self.nroot(x\*\*12, 12), float(x))
AssertionError: 1.1866666666666665 != 1.1866666666666668

That looks like a case where the test should simply be weakened to an assertAlmostEqual with a suitable tolerance; there's no strong reason to expect that nroot will give a faithfully rounded result in this case or any other.

stevendaprano · 2016-09-12T02:50:47Z

As discussed with Ned by email, I'm currently unable to build 3.6 and won't have time to work on this before b1. As discussed on bpo-27761 my tests here are too strict and should be loosened, e.g. from assertEqual to assertAlmostEqual. Ned wrote:

"If you are only planning to make changes to the tests themselves, I think that can wait for b2."

I have no plans to change the publicly visible interface of geometric_mean.

mdickinson · 2016-09-12T15:35:14Z

Steven: any thoughts about the

statistics.geometric_mean(0.7 for _ in range(5000))

failure? Should I open a separate bug report for that, or would you rather address it as part of this issue?

vstinner · 2016-09-12T15:39:40Z

>>> statistics.geometric_mean([0.7 for _ in range(5000)])
Traceback (most recent call last):
  File "/Users/mdickinson/Python/cpython-git/Lib/statistics.py", line 362, in float_nroot
    isinfinity = math.isinf(x)
OverflowError: int too large to convert to float

=> see also issue bpo-27975

stevendaprano · 2016-09-13T02:28:03Z

On Mon, Sep 12, 2016 at 03:35:14PM +0000, Mark Dickinson wrote:

statistics.geometric_mean(0.7 for _ in range(5000))

I've raised a new ticket bpo-28111

stevendaprano · 2016-10-04T16:14:44Z

I'm sorry to say that due to technical difficulties, geometric mean is not going to be in a fit state for beta 2 of 3.6, and so is going to be removed and delayed until 3.7.

python-dev · 2016-10-04T16:25:30Z

New changeset 9dce0e41bedd by Steven D'Aprano in branch 'default':
Issue bpo-27181 remove geometric_mean and defer for 3.7.
https://hg.python.org/cpython/rev/9dce0e41bedd

python-dev · 2016-10-04T18:52:26Z

New changeset de0fa478c22e by Steven D'Aprano in branch '3.6':
Issue bpo-27181 remove geometric_mean and defer for 3.7.
https://hg.python.org/cpython/rev/de0fa478c22e

ned-deily · 2016-10-04T18:56:21Z

Thanks, Steven. Actually, we needed to remove geometric_mean from the 3.6 branch, not the default branch (which will become 3.7). I backported your removal patch to 3.6. Feel free to reapply geometric_mean to the default branch at your leisure.

csabella · 2017-08-27T14:05:18Z

I was wondering if this has been taken up again for 3.7? Thanks!

csabella · 2019-02-16T00:25:41Z

Updating the version in case this wanted to be considered for 3.8.

rhettinger · 2019-02-16T20:55:21Z

Updating the version in case this wanted to be considered for 3.8.

Yes. It would be nice to get this wrapped-up.

rhettinger · 2019-03-24T18:24:21Z

Almost three years have passed.

In the spirit of "perfect is the enemy of good", would it be reasonable to start with a simple, fast implementation using exp-mean-log? Then if someone wants to make it more accurate later, they can do so.

In some quick tests, I don't see much of an accuracy loss. It looks to be plenty good enough to use as a starting point:

--- Accuracy experiments ---

>> from decimal import Decimal
>> from functools import reduce
>> from operator import mul
>> from random import expovariate, triangular
>> from statistics import fmean

>>> # https://www.wolframalpha.com/input/?i=geometric+mean+12,+17,+13,+5,+120,+7
>>> data = [12, 17, 13, 5, 120, 7]
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
14.94412420173971227234687688
>>> exp(fmean(map(log, map(fabs, data))))
14.944124201739715

>>> data = [expovariate(50.0) for i in range(1_000)]
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
0.01140902688569587677205587938
>>> exp(fmean(map(log, map(fabs, data))))
0.011409026885695879

>>> data = [triangular(2000.0, 3000.0, 2200.0) for i in range(10_000)]
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
2388.381301718524160840023868
>>> exp(fmean(map(log, map(fabs, data))))
2388.3813017185225

>>> data = [lognormvariate(20.0, 3.0) for i in range(100_000)]
>>> min(data), max(data)
(2421.506538652375, 137887726484094.5)
>>> print(reduce(mul, map(Decimal, data)) ** (Decimal(1) / len(data)))
484709306.8805352290183838500
>>> exp(fmean(map(log, map(fabs, data))))
484709306.8805349

stevendaprano · 2019-03-28T23:37:04Z

In the spirit of "perfect is the enemy of good", would it be
reasonable to start with a simple, fast implementation using
exp-mean-log? Then if someone wants to make it more accurate later,
they can do so.

I think that is a reasonable idea. On the basis that something is better
than nothing, go ahead. We can discuss accuracy and speed issues later.

Getting some tricky cases down for reference:

# older (removed) implementation
py> geometric_mean([7]*2)
7.0
py> geometric_mean([7]*15)
7.0

# Raymond's newer (faster) implementation
py> exp(fmean(map(log, [7]*2)))
6.999999999999999
py> exp(fmean(map(log, [7]*15)))
6.999999999999999

py> geometric_mean([3,27])
9.0
py> geometric_mean([3,27]*5)
9.0

py> exp(fmean(map(log, [3,27])))
9.000000000000002
py> exp(fmean(map(log, [3,27]*5)))
8.999999999999998

py> x = 2.5e15
py> geometric_mean([x]*100)
2500000000000000.0
py> exp(fmean(map(log, [x]*100)))
2499999999999999.5

On the other hand, sometimes rounding errors work in our favour:

py> geometric_mean([1e50, 1e-50]) # people might expect 1.0
0.9999999999999998
py> 1e-50 == 1/(1e50) # even though they aren't quite inverses
False

py> exp(fmean(map(log, [1e50, 1e-50])))
1.0

rhettinger · 2019-03-29T03:55:44Z

On the basis that something is better than nothing, go ahead.
We can discuss accuracy and speed issues later.

Thanks. I'll put together a PR for your consideration.

rhettinger · 2019-04-02T07:13:37Z

Steven, how does this look?

https://patch-diff.githubusercontent.com/raw/python/cpython/pull/12638.diff

rhettinger · 2019-04-07T16:20:06Z

New changeset 6463ba3 by Raymond Hettinger in branch 'master':
bpo-27181: Add statistics.geometric_mean() (GH-12638)
6463ba3

rhettinger · 2019-04-07T16:21:08Z

Feel free to reopen this if something further needed to be changed or discussed.

cool-RR mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jun 2, 2016

rhettinger assigned stevendaprano Jun 2, 2016

vstinner added the release-blocker label Aug 16, 2016

stevendaprano added 3.7 (EOL) end of life and removed release-blocker labels Oct 4, 2016

csabella added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Feb 16, 2019

rhettinger closed this as completed Apr 7, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Add geometric mean to statistics module #71368

Add geometric mean to statistics module #71368

Comments

cool-RR mannequin commented Jun 2, 2016

rhettinger commented Jun 2, 2016

stevendaprano commented Jun 2, 2016

cool-RR mannequin commented Jun 3, 2016

cool-RR mannequin commented Jun 4, 2016

mdickinson commented Jun 9, 2016

mdickinson commented Jun 9, 2016

stevendaprano commented Jun 9, 2016

mdickinson commented Jun 9, 2016

stevendaprano commented Jul 9, 2016

rhettinger commented Jul 9, 2016

mdickinson commented Jul 9, 2016

python-dev mannequin commented Aug 9, 2016

cool-RR mannequin commented Aug 9, 2016

cool-RR mannequin commented Aug 9, 2016

stevendaprano commented Aug 9, 2016

cool-RR mannequin commented Aug 9, 2016

vadmium commented Aug 12, 2016

mdickinson commented Aug 12, 2016

stevendaprano commented Aug 12, 2016

mdickinson commented Aug 12, 2016

stevendaprano commented Aug 14, 2016

mdickinson commented Aug 14, 2016

ned-deily commented Aug 15, 2016

vstinner commented Aug 16, 2016

python-dev mannequin commented Aug 16, 2016

vstinner commented Aug 16, 2016

koobs commented Aug 17, 2016

mdickinson commented Aug 17, 2016

stevendaprano commented Sep 12, 2016

mdickinson commented Sep 12, 2016

vstinner commented Sep 12, 2016

stevendaprano commented Sep 13, 2016

stevendaprano commented Oct 4, 2016

python-dev mannequin commented Oct 4, 2016

python-dev mannequin commented Oct 4, 2016

ned-deily commented Oct 4, 2016

csabella commented Aug 27, 2017

csabella commented Feb 16, 2019

rhettinger commented Feb 16, 2019

rhettinger commented Mar 24, 2019

stevendaprano commented Mar 28, 2019

rhettinger commented Mar 29, 2019

rhettinger commented Apr 2, 2019

rhettinger commented Apr 7, 2019

rhettinger commented Apr 7, 2019

Add geometric mean to `statistics` module #71368

Add geometric mean to `statistics` module #71368