This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Document formulas used in statistics
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: Alextp, BreamoreBoy, ezio.melotti, rhettinger, steven.daprano
Priority: normal Keywords:

Created on 2014-03-24 03:10 by zach.ware, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (14)
msg214666 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2014-03-24 03:10
From docs@:

On Sun, Mar 23, 2014 at 5:55 PM, Alex <aaa5500 at ya.ru> wrote:
> http://docs.python.org/dev/library/statistics.html
>  
> I know math. I ended the institute. But in Russia. Doc doesn't show me WHAT
> FORMULAS are used for mean, median, median_low , etc. I canot understand
> doc. Please write formulas:
>  
> e.g. mean = sum(x[i] from i=1 to N) / N
>  
>  
> Regards
> Alex
msg214667 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-03-24 04:19
At the top of the documentation page is a link to the pure python source code for the statistics functions.  The source for the main functions is short, readable, and clear about exactly what is being done.  The code for the help functions like _sum() is a bit convoluted but the basic idea is to perform basic math in a way that doesn't lose precision.
msg214684 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-03-24 11:37
If any of the docs are unclear, I would be very happy take suggestions to improve them. But I'm not entirely sure that the docs are the right place to show the equations. You should be able to look them up on Wikipedia or Wolfram Mathworld if you have doubt about them. Some of the functions, like mode() and the median functions, don't have equations.

I am willing to include equations where appropriate, if others think that the documentation will be enhanced by them. But so far, I am unconvinced of the need.
msg214692 - (view) Author: Alextp (Alextp) Date: 2014-03-24 16:11
I'm author or topic

I suggest to give simple formulas. for ex -

1) mean. 
Calculates sum of all values in iterable, divided by number of elements. 
E.g. 
mean([x1, x2, ..., xN]) = (x1 + x2 + ... + xN) / N

2) median. 
Calculates value with middle index from iterable. 
If number of elements is even, ie no strict middle index exists, then function takes average of two values at two indexes near middle.

E.g.
median([x1, x2, x3, x4, x5]) = x3
median([x1, x2, x3, x4, x5, x6]) = (x3 + x4) / 2

3) median_low.
 Calculates value with middle index from iterable. 
If number of elements is even, ie no strict middle index exists, then function takes value at near index, lower than middle.


4) median_high.
 Calculates value with middle index from iterable. 
If number of elements is even, ie no strict middle index exists, then function takes value at near index, higher than middle.

5) median_grouped.
(((NOTE!! I may not understand median_grouped OK)))
Calculates average of values of iterable at L given middle indexes.

E.g.
median_grouped([x1, x2, x3, x4, x5], L=3) = (x2+x3+x4)/3

NOTE: pls check this!
msg214695 - (view) Author: Alextp (Alextp) Date: 2014-03-24 16:30
I wrote not ok formula for median_grouped. But i can't get idea from source. THIS SHOWS that source code is NOT ok doc, even student can't get it
 
e.g. pvariance.
Calculates population variance of iterable. It's given by formula:

pvariance([x1, x2, ..., xN]) = ((x1 - M)**2 + ... + (xN - M)**2) / N,
where M is mean of all values:
M = (x1 + ... + xN) / N
msg214705 - (view) Author: Alextp (Alextp) Date: 2014-03-24 18:23
5) pvariance.
Calculates "population variance" of iterable by such formula:

pvariance([x1, x2, ..., xN], M) = ((x1 - M)**2 + ... + (xN - M)**2) / N

M is optional argument which should be value of mean([x1, ... xN]) calculated before. If M parameter is missed in call, it's calculated automatically:
M = (x1 + ... + xN) / N

6) variance.
(NOTE: pls check this.)
Calculates "sample variance" from iterable. It's given by the same formula as pvariance, but not for entire iterable value set. Only subset of iterable is used for calculation. .......... (write here how this subset is taken, randomly or what..... i didn't get it from Wikipedia.)

Ok?
msg214706 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-03-24 18:27
IMHO the docs shouldn't be cluttered with details such as this.
msg214711 - (view) Author: Alextp (Alextp) Date: 2014-03-24 19:08
Without details like these it must be URLS to wikipedia or Wolfram. 
Usual users don't know how to search wolfram.
msg218523 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-05-14 12:04
> E.g.
> median([x1, x2, x3, x4, x5]) = x3
> median([x1, x2, x3, x4, x5, x6]) = (x3 + x4) / 2

The docs seem to already contain similar examples for some of the functions (e.g. median()), but not for others (e.g. mean()).
For these, if the formula can be expressed with a simple Python equivalent (e.g. sum(values) / len(values)), I think it would be reasonable to add it.
msg218639 - (view) Author: Alextp (Alextp) Date: 2014-05-16 01:41
@Ezio:
of course, much of these funcs CANNOT be expressed as simple formulas. Only with some text. I shown example descriptions for almost all- above.
msg218646 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-05-16 07:50
Do you want to propose a patch?
msg218651 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-05-16 10:17
On Fri, May 16, 2014 at 07:50:16AM +0000, Ezio Melotti wrote:

> Do you want to propose a patch?

I'm really not sure that I agree with this request. I'm currently 
sitting on the fence, undecided, about 60% against and 40% in favour of 
explicitly documenting the formulae. This is not Mathworld or Wikipedia, 
and it is easy to google for "variance" to find out what it means.

This request orginally came from somebody who claimed he didn't know 
what the functions were from the names (mean, median, variance) but 
would recognise them from the formulae. Given how hard it is to 
accurately portray mathematical formulae in plain text, and how many 
different versions of the mathematical formulae there are, I don't think 
that will apply to very many people.

There's no good way to write mathematical functions *accurately* in 
ASCII text. I can write mean(L) = sum(L)/len(L), for example, that's 
quite trivial. But it's not the usual mathematical formula. If the OP 
doesn't recognise the name "mean", will he recognise that non-standard 
formula? Should the docs include μ = ∑x÷n? But even that's not quite 
accurate -- where's the subscript on the x? The reader needs to 
understand the formula, and they aren't going to get that here. They 
probably have to go read Mathworld or Wikipedia regardless.

The problem is compounded with variance. Which of these should we write?

    σ² = ∑(x - μ)² ÷ n
    s² = ∑x² ÷ n - μ²
    s[n]² = ∑(x - a)² ÷ n
    Var(X) = E[X-μ)²]
    Var(X) = E[X²] - (E[X])²

or something else?

What do other statistics packages do? I wouldn't want to do *less* -- if 
it is common for other stats packages to show the formula, then I would 
agree we should do the same. R doesn't seem to do so:

http://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html
msg218652 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-05-16 10:44
From msg214692 it seems to me that Alex wants "Python-friendly" formulas or examples, rather than mathematical formulas.  Most functions seems to already have them, so I was asking for a patch to get a better idea of which functions he thinks should be improved and how.

As an example, itertools docs have simple "formulas" explaining what the function does and an example in the table at the top, and (possibly approximate) Python equivalents for most of the functions: https://docs.python.org/dev/library/itertools.html
While the Python equivalent are probably not needed here, some simple formulas/examples might be OK, but I would have to see what exactly Alex is proposing.
msg221625 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-26 17:36
Three months gone and still no patch, not that I believe one is needed.  I'm inclined to close as "won't fix", there's nothing to stop it being reopened if needed.
History
Date User Action Args
2022-04-11 14:58:00adminsetgithub: 65245
2014-06-28 01:37:32ezio.melottisetstatus: open -> closed
resolution: works for me
stage: needs patch -> resolved
2014-06-26 18:04:52zach.waresetnosy: - zach.ware
2014-06-26 17:36:13BreamoreBoysetmessages: + msg221625
2014-05-16 10:44:40ezio.melottisetmessages: + msg218652
2014-05-16 10:17:09steven.dapranosetmessages: + msg218651
2014-05-16 07:50:16ezio.melottisetmessages: + msg218646
2014-05-16 01:41:58Alextpsetmessages: + msg218639
2014-05-14 12:04:57ezio.melottisetnosy: + ezio.melotti
messages: + msg218523
2014-03-24 19:08:59Alextpsetmessages: + msg214711
2014-03-24 18:27:29BreamoreBoysetnosy: + BreamoreBoy
messages: + msg214706
2014-03-24 18:23:53Alextpsetmessages: + msg214705
2014-03-24 16:30:08Alextpsetmessages: + msg214695
2014-03-24 16:11:36Alextpsetnosy: + Alextp
messages: + msg214692
2014-03-24 11:37:53steven.dapranosetmessages: + msg214684
2014-03-24 04:19:47rhettingersetnosy: + rhettinger
messages: + msg214667
2014-03-24 03:10:19zach.warecreate