Issue 37905: Improve docs for NormalDist

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82086

classification

Title:	Improve docs for NormalDist
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 3.8

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	Christoph.Deil, mark.dickinson, rhettinger, steven.daprano
Priority:	normal	Keywords:	patch

Created on 2019-08-21 12:38 by Christoph.Deil, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 15486	merged	rhettinger, 2019-08-25 07:50
PR 15487	merged	miss-islington, 2019-08-25 07:57

Messages (9)
msg350076 - (view)	Author: Christoph Deil (Christoph.Deil)	Date: 2019-08-21 12:38
I saw that Python 3.8 will add a NormalDist class: https://docs.python.org/3.8/library/statistics.html#normaldist-objects Personally I don't see the value of adding this to the Python standard lib. The natural progression would be to extend and extend, but in the end only duplicate what already exists in scientific Python packages. But Ok, I guess this is not up for debate any more? I'd like to make a specific comment on NormalDist.overlap. The rest of NormalDist is very standard, but that method is an oddball. My suggestion is to remove it or to improve the documentation. Current docstring: https://github.com/python/cpython/blob/44f2c096804e8e3adc09400a59ef9c9ae843f339/Lib/statistics.py#L959-L991 And this docs example: https://github.com/python/cpython/commit/318d537daabf2bd5f781255c7e25bfce260cf227#diff-d436928bc44b5d7c40a8047840f55d35R620-R629 > What percentage of men and women will have the same height in `two normally distributed populations with known means and standard deviations <http://www.usablestats.com/lessons/normal>`_? 50.3% This statement doesn't make sense to me. No two people have the exact same height, I think the answer to this question should be 0%. Using n = 100_000; sum(m > w for m, w in zip(men.samples(n), women.samples(n))) / n I see that for 82% of random (men, women) matches the man will be larger. That's another measure, but still, stating that 50% of men and women have the same height is confusing. Note that there is a multitude of PDF overlap measures different from this min(pdf1, pdf2) that I think are much more common in statistics and the physical sciences: - https://en.wikipedia.org/wiki/Hellinger_distance - https://arxiv.org/pdf/1407.7172.pdf And note that the references that are given currently are weird (basic statistics textbooks would be appropriate references IMO, or open references like Wikipedia) - slides: http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf - implementation code comment points to http://dx.doi.org/10.1080/03610928908830127 which is behind a paywall Why add this one overlap measure and expose it under the "overlap" method name? My suggestion would be to be conservative and to remove that method again, before releasing it in 3.8. A reference in the docs could be added to other existing third-party codes (e.g. scipy or the uncertainties package) with further functionality, such as being able to handle correlations or multi-dimensional distributions. For this change I'd be happy to send a PR any time. Raymond and others interested in this topic - thoughts? (note: I wrote a MultiNorm class prototype last year at https://github.com/cdeil/multinorm/blob/master/multinorm.py and now wanted to rewrite it and try to find a good API and thus was interested in this NormalDist class and what functionality it offers)
msg350094 - (view)	Author: Christoph Deil (Christoph.Deil)	Date: 2019-08-21 16:36
The Monte Carlo example here has completely unstable results: https://github.com/python/cpython/commit/cc353a0cd95d9b0c93ed0b60ba762427a94c790d#diff-d436928bc44b5d7c40a8047840f55d35R633 If you run it multiple times, you will see that `mean` is relatively stable, but `stddev` varies from 10 to 50 to 100. The reason is that in the model there's a division by z, and the z distribution used has values arbitrarily close to zero: >>> NormalDist(5, 1.25).cdf(0) * 100_000 3.16 Suggest to change to a MC sampling example that isn't as pathological, doesn't involve division by zero. E.g. change the mean of z to 50, or reduce the stddev to 0.125 or some such change in parameters. Usually in stats or machine learning books and docs e.g. on statsmodels or scikit-learn etc., for methods where random numbers are involved, the seed is always set to a fixed value, to have reproducible results & docs. Suggest to make that change also here.
msg350100 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-21 17:39
Several thoughts: * OVL was used often in the finance firm where I worked. * It provides a simple, easy to understand point estimate of the similarity or overlap between two PDFs. * It was far easier to use than a Students-t test to answer the question of how similar two normal distributions are and it is more precise than the more common technique of just running overlapping plots and doing it by eye. * It isn't easy for end-users to do this themselves without running an integration. * It is well defined and well motivated: See: "The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities" -- Henry F. Inman and Edwin L. Bradley Jr http://dx.doi.org/10.1080/03610928908830127 See also: https://www.rasch.org/rmt/rmt101r.htm And: http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf Perhaps, the wording can be improved on the male/female height example. Measured to finite precision, perhaps to the nearest centimeter, there will be overlaps. This is same kind of binning done with chi-square tests to compare how well two distributions match. AFAICT, this tool is well-defined, tested, and has legitimate use cases that are easy to achieve in other ways using only standard library tools.
msg350104 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-21 19:45
BTW, I get your concern about the statistics module as a whole. From the point of view of an expert numpy/scipy user, the whole module seems pointless. However, the purpose of the module is to put a useful subset of statistical tools into the hands of everyday Python users who aren't part of that numeric ecosystem (think of the same people who use MS Excel as part of this group). The module doesn't require extra pip installation, an Anaconda distribution, or even knowledge of array broadcasting and whatnot. For the past few months, I've been user testing the new components of the statistics module and have had good success. Some of the examples in the docs were born from those interactions. I also get your concern about what is usually found in statistics textbooks; however, those books tend to cover a wide range of distributions, include proofs, and heavily weight hypothesis testing. Typically, little space is given to descriptive statistics, q-q plots, or other things that are handy in day-to-day practice. The NormalDist class encapsulates a lot of knowledge that is easily forgotten (that variances are additive, how to translate and rescale), or that a constant divided by a normal distribution doesn't give another normal distribution. I've tried this out on otherwise not mathematically inclined users and they've found it to be useful and intuitive. In contrast, the scipy ecosystem presumes much more sophistication.
msg350105 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-21 19:52
> Raymond and others interested in this topic - thoughts? Please do submit a PR with an improved example for the MonteCarlo simulation. I'm not fond of that example at all. It should be as short as possible while getting the core idea across. But it should be something that doesn't have a simple analytic solution so as to motivate the concept. Go ahead and use a fixed numeric seed as well.
msg350437 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-25 07:57
New changeset 8371799e300475c8f9f967e900816218d3500e5d by Raymond Hettinger in branch 'master': bpo-37905: Improve docs for NormalDist (GH-15486) https://github.com/python/cpython/commit/8371799e300475c8f9f967e900816218d3500e5d
msg350438 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-25 08:04
New changeset 970548c00b366dcb8eb0c2bec0ffcab30ba03aee by Raymond Hettinger (Miss Islington (bot)) in branch '3.8': bpo-37905: Improve docs for NormalDist (GH-15486) (GH-15487) https://github.com/python/cpython/commit/970548c00b366dcb8eb0c2bec0ffcab30ba03aee
msg350439 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-25 08:06
* Removed external links in overlap() docs * Removed "same heights" example * Chose more stable parameters for the Monte Carlo model * Made the example reproducible by recording a seed value
msg350612 - (view)	Author: Christoph Deil (Christoph.Deil)	Date: 2019-08-27 07:23
Thank you, Raymond!

History
Date	User	Action	Args
2022-04-11 14:59:19	admin	set	github: 82086
2019-08-27 07:23:09	Christoph.Deil	set	messages: + msg350612
2019-08-25 08:06:03	rhettinger	set	status: open -> closed resolution: fixed messages: + msg350439 stage: patch review -> resolved
2019-08-25 08:04:28	rhettinger	set	messages: + msg350438
2019-08-25 07:57:38	miss-islington	set	pull_requests: + pull_request15174
2019-08-25 07:57:30	rhettinger	set	messages: + msg350437
2019-08-25 07:50:39	rhettinger	set	keywords: + patch stage: patch review pull_requests: + pull_request15173
2019-08-21 20:14:30	rhettinger	set	title: Improve docs for NormalDist.overlap() -> Improve docs for NormalDist
2019-08-21 19:52:16	rhettinger	set	messages: + msg350105 components: + Documentation, - Library (Lib) title: Remove NormalDist.overlap() or improve documentation? -> Improve docs for NormalDist.overlap()
2019-08-21 19:45:07	rhettinger	set	messages: + msg350104
2019-08-21 17:39:39	rhettinger	set	messages: + msg350100
2019-08-21 17:17:36	rhettinger	set	assignee: rhettinger
2019-08-21 16:36:38	Christoph.Deil	set	messages: + msg350094
2019-08-21 14:04:16	mark.dickinson	set	nosy: + mark.dickinson
2019-08-21 14:00:06	xtreak	set	nosy: + steven.daprano
2019-08-21 12:38:54	Christoph.Deil	create