This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: statistics.pvariance with known mean does not work as expected
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.5
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: iritkatriel, rhettinger, steven.daprano, vstinner, wolma
Priority: normal Keywords:

Created on 2014-04-09 03:14 by steven.daprano, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (6)
msg215802 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-04-09 03:14
If you know the population mean mu, you should calculate the sample variance by passing mu as an explicit argument to statistics.pvariance. Unfortunately, it doesn't work as designed:

py> data = [1, 2, 2, 2, 3, 4]  # sample from a population with mu=2.5
py> statistics.pvariance(data)  # uses the sample mean 2.3333...
py> statistics.pvariance(data, 2.5)  # using known population mean

The second calculation ought to be 0.91666... not 0.88888...

The problem lies with the _ss private function which calculates the sum of square deviations. Unfortunately it is too clever: it includes an error adjustment term

ss -= _sum((x-c) for x in data)**2/len(data)

which mathematically is expected to be zero when c is calculated as the mean of data, but due to rounding may not be quite zero. But when c is given explicitly, as happens if the caller provides an explicit mu argument to pvariance, then the error adjustment has the effect of neutralizing the explicit mu.

The obvious fix is to just skip the adjustment in _ss when c is explicitly given, but I'm not sure if that's the best approach.
msg215806 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-09 08:20
I do not think this is a bug in the module, but rather incorrect usage.

From your own docs:
    data should be an iterable of Real-valued numbers, with at least one
    value. The optional argument mu, if given, should be the mean of
    the data. If it is missing or None, the mean is automatically calculated.

Nowhere does it say that mu should be the known population mean, and rightly so!
The definition of p_variance is that it is the variance of the data assuming that data *is* the whole population (so the correct mean can be calculated from it)
s_variance on the other hand should give an estimate of the population variance under the assumption that data is a random sample of the population, but its formula _ss/(n-1) is derived under the assumption that mu is the sample mean, not the population mean.

So everything's fine and there is nothing to fix really!
msg215816 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-04-09 13:29
On Wed, Apr 09, 2014 at 08:20:42AM +0000, Wolfgang Maier wrote:
> I do not think this is a bug in the module, but rather incorrect usage.

No, it is legitimate usage. See, for example, "Numerical Recipes in 
Pascal" by Press et al. When you know the population mean independently 
from the sample you're using, you should not apply Bessel's Correction 
(that is, you should use a denominator of n rather than n-1, which 
is equivalent to using the population variance).

I don't think it is appropriate to include too much of the complexity 
about variance in the docs. (They should document the module, not teach 
all the odd corners of statistics theory.) I've tried to clarify the 
different uses of (p)variance here:

If you're still not convinced, this usage is equivalent to the 
gsl_stats_variance_with_fixed_mean function from the GNU Scientific 
msg215820 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-09 14:11
ok, there may be use cases for calculating a variance estimate in such situations, but IMHO what you are trying to do is to abuse a function which is not documented to be made for the purpose and then complain that it does not behave correctly.

The *documented* use of the mu argument is to avoid redundant calculations of the mean of data!
With just one argument, how would you know whether the user wants this documented functionality or the undocumented one ?

Your suggestion of just omitting the correction means that every user who wants the documented functionality gets a potentially imprecise result.
Another potential approach may be to correct the correction term based on the mean calculated from data, but such a calculation would be absurd given the documented functionality.

In case the statistics module is going to use exact representations of internal results in 3.5, the error adjustment would become obsolete anyway (see and pvariance could be abused just as you suggest.
In this case, this usage could be sanctioned in the form of a recipe ?
msg399967 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-20 13:11
I can't reproduce this on 3.11, was it fixed?

>>> import statistics
>>> data = [1, 2, 2, 2, 3, 4]
>>> statistics.pvariance(data)
>>> statistics.pvariance(data, 2.5)
msg400009 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-08-21 01:52
See commit d71ab4f73887a6e2b380ddbbfe35b600d236fd4a for bpo-40855.
Date User Action Args
2022-04-11 14:58:01adminsetgithub: 65383
2021-08-21 01:52:03rhettingersetstatus: open -> closed

nosy: + rhettinger
messages: + msg400009

resolution: fixed
stage: needs patch -> resolved
2021-08-20 13:11:54iritkatrielsetnosy: + iritkatriel
messages: + msg399967
2014-04-09 14:11:01wolmasetmessages: + msg215820
2014-04-09 13:32:56vstinnersetnosy: + vstinner
2014-04-09 13:29:31steven.dapranosetmessages: + msg215816
2014-04-09 08:20:42wolmasetnosy: + wolma
messages: + msg215806
2014-04-09 03:14:30steven.dapranocreate