Message 215802 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	steven.daprano
Date	2014-04-09.03:14:30
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1397013270.92.0.0743559681991.issue21184@psf.upfronthosting.co.za>
In-reply-to

Content
If you know the population mean mu, you should calculate the sample variance by passing mu as an explicit argument to statistics.pvariance. Unfortunately, it doesn't work as designed: py> data = [1, 2, 2, 2, 3, 4] # sample from a population with mu=2.5 py> statistics.pvariance(data) # uses the sample mean 2.3333... 0.8888888888888888 py> statistics.pvariance(data, 2.5) # using known population mean 0.8888888888888888 The second calculation ought to be 0.91666... not 0.88888... The problem lies with the _ss private function which calculates the sum of square deviations. Unfortunately it is too clever: it includes an error adjustment term ss -= _sum((x-c) for x in data)**2/len(data) which mathematically is expected to be zero when c is calculated as the mean of data, but due to rounding may not be quite zero. But when c is given explicitly, as happens if the caller provides an explicit mu argument to pvariance, then the error adjustment has the effect of neutralizing the explicit mu. The obvious fix is to just skip the adjustment in _ss when c is explicitly given, but I'm not sure if that's the best approach.

If you know the population mean mu, you should calculate the sample variance by passing mu as an explicit argument to statistics.pvariance. Unfortunately, it doesn't work as designed:

py> data = [1, 2, 2, 2, 3, 4]  # sample from a population with mu=2.5
py> statistics.pvariance(data)  # uses the sample mean 2.3333...
0.8888888888888888
py> statistics.pvariance(data, 2.5)  # using known population mean
0.8888888888888888

The second calculation ought to be 0.91666... not 0.88888...

The problem lies with the _ss private function which calculates the sum of square deviations. Unfortunately it is too clever: it includes an error adjustment term

ss -= _sum((x-c) for x in data)**2/len(data)

which mathematically is expected to be zero when c is calculated as the mean of data, but due to rounding may not be quite zero. But when c is given explicitly, as happens if the caller provides an explicit mu argument to pvariance, then the error adjustment has the effect of neutralizing the explicit mu.

The obvious fix is to just skip the adjustment in _ss when c is explicitly given, but I'm not sure if that's the best approach.

History
Date	User	Action	Args
2014-04-09 03:14:30	steven.daprano	set	recipients: + steven.daprano
2014-04-09 03:14:30	steven.daprano	set	messageid: <1397013270.92.0.0743559681991.issue21184@psf.upfronthosting.co.za>
2014-04-09 03:14:30	steven.daprano	link	issue21184 messages
2014-04-09 03:14:30	steven.daprano	create