Issue 21184: statistics.pvariance with known mean does not work as expected

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65383

classification

Title:	statistics.pvariance with known mean does not work as expected
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4, Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	steven.daprano	Nosy List:	iritkatriel, rhettinger, steven.daprano, vstinner, wolma
Priority:	normal	Keywords:

Created on 2014-04-09 03:14 by steven.daprano, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (6)
msg215802 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2014-04-09 03:14
If you know the population mean mu, you should calculate the sample variance by passing mu as an explicit argument to statistics.pvariance. Unfortunately, it doesn't work as designed: py> data = [1, 2, 2, 2, 3, 4] # sample from a population with mu=2.5 py> statistics.pvariance(data) # uses the sample mean 2.3333... 0.8888888888888888 py> statistics.pvariance(data, 2.5) # using known population mean 0.8888888888888888 The second calculation ought to be 0.91666... not 0.88888... The problem lies with the _ss private function which calculates the sum of square deviations. Unfortunately it is too clever: it includes an error adjustment term ss -= _sum((x-c) for x in data)**2/len(data) which mathematically is expected to be zero when c is calculated as the mean of data, but due to rounding may not be quite zero. But when c is given explicitly, as happens if the caller provides an explicit mu argument to pvariance, then the error adjustment has the effect of neutralizing the explicit mu. The obvious fix is to just skip the adjustment in _ss when c is explicitly given, but I'm not sure if that's the best approach.
msg215806 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-09 08:20
I do not think this is a bug in the module, but rather incorrect usage. From your own docs: data should be an iterable of Real-valued numbers, with at least one value. The optional argument mu, if given, should be the mean of the data. If it is missing or None, the mean is automatically calculated. Nowhere does it say that mu should be the known population mean, and rightly so! The definition of p_variance is that it is the variance of the data assuming that data is the whole population (so the correct mean can be calculated from it) s_variance on the other hand should give an estimate of the population variance under the assumption that data is a random sample of the population, but its formula _ss/(n-1) is derived under the assumption that mu is the sample mean, not the population mean. So everything's fine and there is nothing to fix really!
msg215816 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2014-04-09 13:29
On Wed, Apr 09, 2014 at 08:20:42AM +0000, Wolfgang Maier wrote: > I do not think this is a bug in the module, but rather incorrect usage. [...] No, it is legitimate usage. See, for example, "Numerical Recipes in Pascal" by Press et al. When you know the population mean independently from the sample you're using, you should not apply Bessel's Correction (that is, you should use a denominator of n rather than n-1, which is equivalent to using the population variance). I don't think it is appropriate to include too much of the complexity about variance in the docs. (They should document the module, not teach all the odd corners of statistics theory.) I've tried to clarify the different uses of (p)variance here: http://import-that.dreamwidth.org/2291.html If you're still not convinced, this usage is equivalent to the gsl_stats_variance_with_fixed_mean function from the GNU Scientific Library: https://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
msg215820 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-09 14:11
ok, there may be use cases for calculating a variance estimate in such situations, but IMHO what you are trying to do is to abuse a function which is not documented to be made for the purpose and then complain that it does not behave correctly. The documented use of the mu argument is to avoid redundant calculations of the mean of data! With just one argument, how would you know whether the user wants this documented functionality or the undocumented one ? Your suggestion of just omitting the correction means that every user who wants the documented functionality gets a potentially imprecise result. Another potential approach may be to correct the correction term based on the mean calculated from data, but such a calculation would be absurd given the documented functionality. In case the statistics module is going to use exact representations of internal results in 3.5, the error adjustment would become obsolete anyway (see http://bugs.python.org/issue20499) and pvariance could be abused just as you suggest. In this case, this usage could be sanctioned in the form of a recipe ?
msg399967 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-08-20 13:11
I can't reproduce this on 3.11, was it fixed? >>> import statistics >>> data = [1, 2, 2, 2, 3, 4] >>> statistics.pvariance(data) 0.8888888888888888 >>> statistics.pvariance(data, 2.5) 0.9166666666666666 >>>
msg400009 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-08-21 01:52
See commit d71ab4f73887a6e2b380ddbbfe35b600d236fd4a for bpo-40855.

History
Date	User	Action	Args
2022-04-11 14:58:01	admin	set	github: 65383
2021-08-21 01:52:03	rhettinger	set	status: open -> closed nosy: + rhettinger messages: + msg400009 resolution: fixed stage: needs patch -> resolved
2021-08-20 13:11:54	iritkatriel	set	nosy: + iritkatriel messages: + msg399967
2014-04-09 14:11:01	wolma	set	messages: + msg215820
2014-04-09 13:32:56	vstinner	set	nosy: + vstinner
2014-04-09 13:29:31	steven.daprano	set	messages: + msg215816
2014-04-09 08:20:42	wolma	set	nosy: + wolma messages: + msg215806
2014-04-09 03:14:30	steven.daprano	create