classification
Title: Improve docs for NormalDist
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Christoph.Deil, mark.dickinson, rhettinger, steven.daprano
Priority: normal Keywords: patch

Created on 2019-08-21 12:38 by Christoph.Deil, last changed 2019-08-27 07:23 by Christoph.Deil. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15486 merged rhettinger, 2019-08-25 07:50
PR 15487 merged miss-islington, 2019-08-25 07:57
Messages (9)
msg350076 - (view) Author: Christoph Deil (Christoph.Deil) Date: 2019-08-21 12:38
I saw that Python 3.8 will add a NormalDist class:
https://docs.python.org/3.8/library/statistics.html#normaldist-objects

Personally I don't see the value of adding this to the Python standard lib. The natural progression would be to extend and extend, but in the end only duplicate what already exists in scientific Python packages.
But Ok, I guess this is not up for debate any more?

I'd like to make a specific comment on NormalDist.overlap.
The rest of NormalDist is very standard, but that method is an oddball.
My suggestion is to remove it or to improve the documentation.

Current docstring: https://github.com/python/cpython/blob/44f2c096804e8e3adc09400a59ef9c9ae843f339/Lib/statistics.py#L959-L991

And this docs example:
https://github.com/python/cpython/commit/318d537daabf2bd5f781255c7e25bfce260cf227#diff-d436928bc44b5d7c40a8047840f55d35R620-R629


> What percentage of men and women will have the same height in `two normally
distributed populations with known means and standard deviations
<http://www.usablestats.com/lessons/normal>`_?

50.3%

This statement doesn't make sense to me. No two people have the exact same height, I think the answer to this question should be 0%.

Using

n = 100_000; sum(m > w for m, w in zip(men.samples(n), women.samples(n))) / n

I see that for 82% of random (men, women) matches the man will be larger. That's another measure, but still, stating that 50% of men and women have the same height is confusing.

Note that there is a multitude of PDF overlap measures different from this min(pdf1, pdf2) that I think are much more common in statistics and the physical sciences:
- https://en.wikipedia.org/wiki/Hellinger_distance
- https://arxiv.org/pdf/1407.7172.pdf

And note that the references that are given currently are weird (basic statistics textbooks would be appropriate references IMO, or open references like Wikipedia)
- slides: http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf
- implementation code comment points to http://dx.doi.org/10.1080/03610928908830127 which is behind a paywall

Why add this one overlap measure and expose it under the "overlap" method name?

My suggestion would be to be conservative and to remove that method again, before releasing it in 3.8. A reference in the docs could be added to other existing third-party codes (e.g. scipy or the uncertainties package) with further functionality, such as being able to handle correlations or multi-dimensional distributions. For this change I'd be happy to send a PR any time.

Raymond and others interested in this topic - thoughts?

(note: I wrote a MultiNorm class prototype last year at https://github.com/cdeil/multinorm/blob/master/multinorm.py and now wanted to rewrite it and try to find a good API and thus was interested in this NormalDist class and what functionality it offers)
msg350094 - (view) Author: Christoph Deil (Christoph.Deil) Date: 2019-08-21 16:36
The Monte Carlo example here has completely unstable results:

https://github.com/python/cpython/commit/cc353a0cd95d9b0c93ed0b60ba762427a94c790d#diff-d436928bc44b5d7c40a8047840f55d35R633

If you run it multiple times, you will see that `mean` is relatively stable, but `stddev` varies from 10 to 50 to 100. The reason is that in the model there's a division by z, and the z distribution used has values arbitrarily close to zero:

>>> NormalDist(5, 1.25).cdf(0) * 100_000
3.16

Suggest to change to a MC sampling example that isn't as pathological, doesn't involve division by zero. E.g. change the mean of z to 50, or reduce the stddev to 0.125 or some such change in parameters.

Usually in stats or machine learning books and docs e.g. on statsmodels or scikit-learn etc., for methods where random numbers are involved, the seed is always set to a fixed value, to have reproducible results & docs. Suggest to make that change also here.
msg350100 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-21 17:39
Several thoughts:

* OVL was used often in the finance firm where I worked.
* It provides a simple, easy to understand point estimate
  of the similarity or overlap between two PDFs.
* It was far easier to use than a Students-t test to answer
  the question of how similar two normal distributions
  are and it is more precise than the more common technique
  of just running overlapping plots and doing it by eye.
* It isn't easy for end-users to do this themselves
  without running an integration.
* It is well defined and well motivated:  

  See: "The overlapping coefficient as a measure of agreement
  between probability distributions and point estimation of the
  overlap of two normal densities" -- Henry F. Inman and 
  Edwin L. Bradley Jr
  http://dx.doi.org/10.1080/03610928908830127

  See also: https://www.rasch.org/rmt/rmt101r.htm

  And: http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf

Perhaps, the wording can be improved on the male/female height example.  Measured to finite precision, perhaps to the nearest centimeter, there will be overlaps.  This is same kind of binning done with chi-square tests to compare how well two distributions match.

AFAICT, this tool is well-defined, tested, and has legitimate use cases that are easy to achieve in other ways using only standard library tools.
msg350104 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-21 19:45
BTW, I get your concern about the statistics module as a whole.  From the point of view of an expert numpy/scipy user, the whole module seems pointless.  However, the purpose of the module is to put a useful subset of statistical tools into the hands of everyday Python users who aren't part of that numeric ecosystem (think of the same people who use MS Excel as part of this group).  The module doesn't require extra pip installation, an Anaconda distribution, or even knowledge of array broadcasting and whatnot.

For the past few months, I've been user testing the new components of the statistics module and have had good success.  Some of the examples in the docs were born from those interactions.

I also get your concern about what is usually found in statistics textbooks; however, those books tend to cover a wide range of distributions, include proofs, and heavily weight hypothesis testing.  Typically, little space is given to descriptive statistics, q-q plots, or other things that are handy in day-to-day practice.

The NormalDist class encapsulates a lot of knowledge that is easily forgotten (that variances are additive, how to translate and rescale), or that a constant divided by a normal distribution doesn't give another normal distribution.  I've tried this out on otherwise not mathematically inclined users and they've found it to be useful and intuitive.  In contrast, the scipy ecosystem presumes much more sophistication.
msg350105 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-21 19:52
> Raymond and others interested in this topic - thoughts?

Please do submit a PR with an improved example for the MonteCarlo simulation.  I'm not fond of that example at all.  It should be as short as possible while getting the core idea across.  But it should be something that doesn't have a simple analytic solution so as to motivate the concept.  Go ahead and use a fixed numeric seed as well.
msg350437 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-25 07:57
New changeset 8371799e300475c8f9f967e900816218d3500e5d by Raymond Hettinger in branch 'master':
bpo-37905: Improve docs for NormalDist (GH-15486)
https://github.com/python/cpython/commit/8371799e300475c8f9f967e900816218d3500e5d
msg350438 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-25 08:04
New changeset 970548c00b366dcb8eb0c2bec0ffcab30ba03aee by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
bpo-37905: Improve docs for NormalDist (GH-15486) (GH-15487)
https://github.com/python/cpython/commit/970548c00b366dcb8eb0c2bec0ffcab30ba03aee
msg350439 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-25 08:06
* Removed external links in overlap() docs
* Removed "same heights" example
* Chose more stable parameters for the Monte Carlo model
* Made the example reproducible by recording a seed value
msg350612 - (view) Author: Christoph Deil (Christoph.Deil) Date: 2019-08-27 07:23
Thank you, Raymond!
History
Date User Action Args
2019-08-27 07:23:09Christoph.Deilsetmessages: + msg350612
2019-08-25 08:06:03rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg350439

stage: patch review -> resolved
2019-08-25 08:04:28rhettingersetmessages: + msg350438
2019-08-25 07:57:38miss-islingtonsetpull_requests: + pull_request15174
2019-08-25 07:57:30rhettingersetmessages: + msg350437
2019-08-25 07:50:39rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request15173
2019-08-21 20:14:30rhettingersettitle: Improve docs for NormalDist.overlap() -> Improve docs for NormalDist
2019-08-21 19:52:16rhettingersetmessages: + msg350105
components: + Documentation, - Library (Lib)
title: Remove NormalDist.overlap() or improve documentation? -> Improve docs for NormalDist.overlap()
2019-08-21 19:45:07rhettingersetmessages: + msg350104
2019-08-21 17:39:39rhettingersetmessages: + msg350100
2019-08-21 17:17:36rhettingersetassignee: rhettinger
2019-08-21 16:36:38Christoph.Deilsetmessages: + msg350094
2019-08-21 14:04:16mark.dickinsonsetnosy: + mark.dickinson
2019-08-21 14:00:06xtreaksetnosy: + steven.daprano
2019-08-21 12:38:54Christoph.Deilcreate