classification
Title: Add a Normal Distribution class to the statistics module
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: davin, mark.dickinson, miss-islington, rhettinger, selik, steven.daprano, tim.peters, xtreak
Priority: normal Keywords: patch

Created on 2019-02-18 00:00 by rhettinger, last changed 2019-07-22 00:23 by steven.daprano. This issue is now closed.

Files
File name Uploaded Description Edit
gauss.py rhettinger, 2019-02-18 00:00 NormalDist class
gauss_demo.py rhettinger, 2019-02-18 00:52 Examples
normdist_22feb2019.diff rhettinger, 2019-02-22 16:42 Full diff as of 22 Feb 2019
Pull Requests
URL Status Linked Edit
PR 11973 merged rhettinger, 2019-02-21 10:57
PR 12009 merged rhettinger, 2019-02-24 05:59
PR 12022 merged rhettinger, 2019-02-24 19:25
PR 12096 merged rhettinger, 2019-02-28 16:54
PR 12114 merged rhettinger, 2019-03-01 05:42
PR 12921 merged rhettinger, 2019-04-23 08:27
PR 13021 merged rhettinger, 2019-04-30 06:07
PR 13047 merged rhettinger, 2019-05-02 00:29
PR 14898 merged rhettinger, 2019-07-21 23:20
Messages (29)
msg335792 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-18 00:00
Attached is a class that I've found useful for doing practical statistics work with normal distributions.  It provides a nice, high-level API that makes short-work of everyday statistical problems.

------ Examples --------

# Simple scaling and translation
temperature_february = NormalDist(5, 2.5)            # Celsius
print(temperature_february * (9/5) + 32)             # Fahrenheit


# Classic probability problems
# https://blog.prepscholar.com/sat-standard-deviation
# The mean score on a SAT exam is 1060 with a standard deviation of 195
# What percentage of students score between 1100 and 1200?
sat = NormalDist(1060, 195)
fraction = sat.cdf(1200) - sat.cdf(1100)
print(f'{fraction * 100 :.1f}% score between 1100 and 1200')


# Combination of normal distributions by summing variances
birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
drug_effects = NormalDist(0.4, 0.15)
print(birth_weights + drug_effects)


# Statistical calculation estimates using simulations
# Estimate the distribution of X * Y / Z
n = 100_000
X = NormalDist(350, 15).examples(n)
Y = NormalDist(47, 17).examples(n)
Z = NormalDist(62, 6).examples(n)
print(NormalDist.from_samples(x * y / z for x, y, z in zip(X, Y, Z)))


# Naive Bayesian Classifier
# https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification

height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
weight_male = NormalDist.from_samples([180, 190, 170, 165])
weight_female = NormalDist.from_samples([100, 150, 130, 150])
foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
foot_size_female = NormalDist.from_samples([6, 8, 7, 9])

prior_male = 0.5
prior_female = 0.5
posterior_male = prior_male * height_male.pdf(6) * weight_male.pdf(130) * foot_size_male.pdf(8)
posterior_female = prior_female * height_female.pdf(6) * weight_female.pdf(130) * foot_size_female.pdf(8)
print('Predict', 'male' if posterior_male > posterior_female else 'female')
msg335876 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-18 23:42
I like this idea!

Should the "examples" method be re-named "samples"? That's the word used in the docstring, and it matches the from_samples method.
msg336008 - (view) Author: Michael Selik (selik) * Date: 2019-02-19 19:38
+1, This would be useful for quick analyses, avoiding the overhead of installing scipy and looking through its documentation.

Given that it's in the statistics namespace, I think the name can be simply ``Normal`` rather than ``NormalDist``.  Also, instead of ``.from_examples`` consider naming the classmethod ``.fit``.
msg336029 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-20 00:33
I'll work up a PR for this.  

We can continue to tease-out the best method names. I've has success with "examples" and "from_samples" when developing this code in the classroom.  Both names had the virtue of being easily understood and never being misunderstood.

Intellectually, the name fit() makes sense because we are using data to create best fit model parameters.  So, technically this is probably the most accurate terminology.  However, it doesn't match how I think about the problem though -- that is more along the lines of "use sampling data to make a random variable with a normal distribution".  Another minor issue is that class methods are typically (but not always) recognizable by their from-prefix (e.g. dict.fromkeys, datetime.fromtimestamp, etc).

"NormalDist" seems more self explanatory to me that just "Normal".  Also, the noun form seems "more complete" than a dangling adjective (reading "normal" immediately raises the question "normal what?").  FWIW, MS Excel also calls their variant NORM.DIST (formerly spelled without the dot).
msg336285 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-22 08:36
Okay the PR is ready.

If you all are mostly comfortable with it, it would great to get this in for the second alpha so that people have a chance to work with it.
msg336295 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-22 11:39
Thanks Raymond.

Apologies for commenting here instead of at the PR.

While I've been fighting with more intermittedly broken than usual 
internet access, Github has stopped supporting my browser. I can't 
upgrade the browser without upgrading the OS, and I can't upgrade the OS 
without new hardware, and that will take money I don't have at the moment.

So the bottom line is that while I can read *part* of the diffs on 
Github, that's about all I can do. I can't comment there, I can't fork, 
I can't make push requests, half the pages don't load for me and the 
other half don't work properly when they do load. I can't even do a git 
clone.

So right now, the only thing I can do is comment on your extensive 
documentation in statistics.rst. That's very nicely done.

The only thing that strikes me as problematic is the default value for 
sigma, namely 0.0. The PDF for normal curve divides by sigma, so if 
that's zero, things are undefined. So I think that sigma ought to be 
strictly positive.

I also think it would be nice to default to the standard normal curve, 
with mu=0.0 and sigma=1.0. That will make it easy to work with Z scores.

Thanks again for this class, and my apologies for my inability to 
follow the preferred workflow.
msg336319 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-22 14:57
I've made both suggested changes, "examples"->"samples" and set the defaults to the standard normal distribution.

To bypass Github, I've attached a diff to this tracker issue.  Let me know what you think :-)
msg336337 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2019-02-22 17:44
@steven.daprano Bit off topic but you can also append .patch in the PR URL to generate patch file with all the commits made in the PR up to latest commit and .diff provides the current diff against master. They are plain text and can be downloaded through wget and viewed with an editor in case if it helps.

https://github.com/python/cpython/pull/11973.patch
https://github.com/python/cpython/pull/11973.diff
msg336376 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-23 09:42
Thanks for all the positive feedback.  If there are no objections, I would like to push this so it will be in the second alpha release so that it can get exercised.  We can still make adjustments afterwards.
msg336413 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-23 22:44
New changeset 11c79531655a4aa3f82c20ff562ac571f40040cc by Raymond Hettinger in branch 'master':
bpo-36018: Add the NormalDist class to the statistics module (GH-11973)
https://github.com/python/cpython/commit/11c79531655a4aa3f82c20ff562ac571f40040cc
msg336414 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-23 22:46
Okay, it's in for the second alpha.  Please continue to make API or implementation suggestions.  Nothing is set in stone.
msg336417 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2019-02-23 23:37
There is an inconsistency worth paying attention to in the choice of names of the input parameters.

Currently in the statistics module, pvariance() accepts a parameter named "mu" and pstdev() and variance() each accept a parameter named "xbar".  The docs describe both "mu" and "xbar" as "it should be the mean of data".  I suggest it is worth rationalizing the names used within the statistics module for consistency before reusing "mu" or "xbar" or anything else in NormalDist.

Using the names of mathematical symbols that are commonly used to represent a concept is potentially confusing because those symbols are not always *universally* used.  For example, students are often introduced to new concepts in introductory mathematics texts where concepts such as "mean" appear in formulas and equations not as "mu" but as "xbar" or simply "m" or other simple (and hopefully "friendly") names/symbols.  As a mathematician, if I am told a variable is named, "mu", I still feel the need to ask what it represents.  Sure, I can try guessing based upon context but I will usually have more than one guess that I could make.

Rather than continue down a path of using various mathematical-symbols-written-out-in-English-spelling, one alternative would be to use less ambiguous, more informative variable names such as "mean".  It might be worth considering a change to the parameter names of "mu" and "sigma" in NormalDist to names like "mean" and "stddev", respectively.  Or perhaps "mean" and "standard_deviation".  Or perhaps "mean" and "variance" would be easier still (recognizing that variance can be readily computed from standard deviation in this particular context).  In terms of consistency with other packages that users are likely to also use, scipy.stats functions/objects commonly refer to these concepts as "mean" and "var".

I like the idea of making NormalDist readily approachable for students as well as those more familiar with these concepts.  The offerings in scipy.stats are excellent but they are not always the most approachable things for new students of statistics.
msg336422 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-24 03:17
Karthikeyan: thanks for the hint about Github.

Raymond: thanks for the diff. Some comments:

Why use object.__setattr__(self, 'mu', mu) instead of self.mu = mu in the __init__ method?

Should __pos__ return a copy rather than the instance itself?

The rest looks good to me, and I look forward to using it.
msg336425 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-24 03:41
Davin: the chice of using mu versus xbar was deliberate, as they represent different quantities: the population mean versus a sample mean. But reading over the docs with fresh eyes, I can now see that the distinction is not as clear as I intended.

I think that changing the names now would be a breaking change, but even if it wasn't, I don't want to change the names. The distinction between population parameters (mu) and sample statistics (xbar) is important and I think the function parameters should reflect that.

As for the new NormalDist class, we aren't limited by backwards compatibility, but I would still argue for the current names mu and sigma. As well as matching the population parameters of the distribution, they also matches the names used in calculators such as the TI Nspire and Casio Classpad (two very popular CAS calculators used by secondary school students).

See #36099. If you would like to suggest some doc changes, please feel free to do so.
msg336433 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2019-02-24 05:26
Steven: Your point about population versus sample makes sense and your point that altering their names would be a breaking change is especially important.  I think that pretty well puts an end to my suggestion of alternative names and says the current pattern should be kept with NormalDist.

I particularly like the idea of using the TI Nspire and Casio Classpad to guide or help confirm what symbols might be recognizable to secondary students or 1st year university students.


Raymond: As an idea for examples demonstrating the code, what about an example where a plot of pdf is created, possibly for comparison with cdf?  This would require something like matplotlib but would help to visually communicate the concepts of pdf, perhaps with different sigma values?
msg336435 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-24 05:56
> Why use object.__setattr__(self, 'mu', mu) instead of 
> self.mu = mu in the __init__ method?

The idea was the instances should be immutable and hashable, but this added unnecessary complexity, so I took this out prior to the check in.

> Should __pos__ return a copy rather than the instance itself?

Yes.  I'll fix that straight-way.

^ The chice of using mu versus xbar was deliberate

I concur with that choice and also prefer to stick with mu and sigma:

1) It's too late to change it elsewhere in statistics and the random modules. 2) Having attribute names the same as function names in the same module is confusing. 3) I had already user tested this API in some Python courses.  4) The variable names match the various external sources I've linked to in the docs.  5)  Python historically hasn't shied from greek letter names (math: pi tau gamma random: alpha, better, lambd, mu, sigma).
msg336436 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-24 06:00
Steven, Davin, Michael:  Thanks for the encouragement and taking the time to review this code.
msg336439 - (view) Author: miss-islington (miss-islington) Date: 2019-02-24 06:19
New changeset 79fbcc597dfd039d3261fffcb519b5ec5a18df9d by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Make __pos__ return a distinct instance of NormDist (GH-12009)
https://github.com/python/cpython/commit/79fbcc597dfd039d3261fffcb519b5ec5a18df9d
msg336481 - (view) Author: miss-islington (miss-islington) Date: 2019-02-24 19:44
New changeset 9e456bc70e7bc9ee9726d356d7167457e585fd4c by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Add properties for mean and stdev (GH-12022)
https://github.com/python/cpython/commit/9e456bc70e7bc9ee9726d356d7167457e585fd4c
msg336852 - (view) Author: miss-islington (miss-islington) Date: 2019-02-28 17:16
New changeset ef17fdbc1c274dc84c2f611c40449ab84824607e by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Add special value tests and make minor tweaks to the docs (GH-12096)
https://github.com/python/cpython/commit/ef17fdbc1c274dc84c2f611c40449ab84824607e
msg336895 - (view) Author: miss-islington (miss-islington) Date: 2019-03-01 05:47
New changeset 9add4b3317629933d88cf206a24b15e922fa8941 by Miss Islington (bot) (Raymond Hettinger) in branch 'master':
bpo-36018: Add documentation link to "random variable" (GH-12114)
https://github.com/python/cpython/commit/9add4b3317629933d88cf206a24b15e922fa8941
msg337658 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-03-11 11:31
I've done some spot checks of NormDist.pdf and .cdf and compared the results to those returned by my TI Nspire calculator.

So far, the PDF has matched that of the Nspire to 12 decimal places (the limit the calculator will show), but the CDF differs on or about the 8th decimal place:


py> x = statistics.NormalDist(2, 1.3)
py> x.cdf(5.374)
0.9952757439207682
# Nspire normCdf(-∞, 5.372, 2, 1.3) returns 0.995275710979
# difference of 3.294176820212158e-08


py> x.cdf(-0.23)
0.04313736707891003
# Nspire normCdf(-∞, -0.23, 2, 1.3) returns 0.043137332077
# difference of 3.500191003008579e-08

Wolfram Alpha doesn't help me decide which is correct, as it doesn't show enough decimal places.

https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+5.374+]

https://www.wolframalpha.com/input/?i=CDF[+NormalDistribution[2,+1.3],+-0.23+]


Do we care about this difference? Should I raise a new ticket for it?
msg337660 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-03-11 12:12
According to GP/Pari, the correctly value for the first result, to the first few dozen places, is:

0.995275743920768157605659214368609706759611629000344854339231928536087783251913252354...

I'm assuming you meant 5.374 rather than 5.372 in the first Nspire result.
msg337662 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-03-11 12:44
Below is the full transcript from Pari/GP: note that I converted the float inputs to exact Decimal equivalents, assuming IEEE 754 binary64. Summary: both Python results look fine; it's Nspire that's inaccurate here.


mirzakhani:~ mdickinson$ /opt/local/bin/gp
                                            GP/PARI CALCULATOR Version 2.11.1 (released)
                                    i386 running darwin (x86-64/GMP-6.1.2 kernel) 64-bit version
                               compiled: Jan 24 2019, Apple LLVM version 10.0.0 (clang-1000.11.45.5)
                                                      threading engine: single
                                           (readline v8.0 enabled, extended help enabled)

                                               Copyright (C) 2000-2018 The PARI Group

PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER.

Type ? for help, \q to quit.
Type ?17 for how to get moral (and possibly technical) support.

parisize = 8000000, primelimit = 500000
? \p 200
   realprecision = 211 significant digits (200 digits displayed)
? ncdf(x, mu, sig) = (2 - erfc((x - mu) / sig / sqrt(2))) / 2
%1 = (x,mu,sig)->(2-erfc((x-mu)/sig/sqrt(2)))/2
? ncdf(5.37399999999999966604491419275291264057159423828125, 2, 1.3000000000000000444089209850062616169452667236328125)
%2 = 0.99527574392076815760565921436860970675961162900034485433923192853608778325191325235412640687571628164064779657215907190523884572141701976336760387216713270956350229484865180142256611330976179584951493
? ncdf(-0.2300000000000000099920072216264088638126850128173828125, 2, 1.3000000000000000444089209850062616169452667236328125)
%3 = 0.043137367078910025352120502108682523151629166877357644882244088336773338416883044522024586619860574718679715351558322591944140762629090301623352497457372937783778706411712862062109829239761761597057063
msg337686 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-03-11 16:36
> I'm assuming you meant 5.374 rather than 5.372 in the first Nspire result.

Yes, that was a typo, sorry.

Thanks for checking into the results.
msg340703 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-04-23 08:46
New changeset fb8c7d53326d137785ca311bfc48c8284da46770 by Raymond Hettinger in branch 'master':
bpo-36018: Make "seed" into a keyword only argument (GH-12921)
https://github.com/python/cpython/commit/fb8c7d53326d137785ca311bfc48c8284da46770
msg341137 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-04-30 06:47
New changeset b0a2c0fa83f9b79616ccf451687096542de1e6f8 by Raymond Hettinger in branch 'master':
bpo-36018: Test idempotence. Test two methods against one-another. (GH-13021)
https://github.com/python/cpython/commit/b0a2c0fa83f9b79616ccf451687096542de1e6f8
msg341240 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-05-02 00:49
New changeset 671d782f8dc52942dc8c48a513bf24ff8465b112 by Raymond Hettinger in branch 'master':
bpo-36018: Update example to show mean and stdev (GH-13047)
https://github.com/python/cpython/commit/671d782f8dc52942dc8c48a513bf24ff8465b112
msg348273 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-07-22 00:23
I have a query about the documentation:

   The default *method* is "exclusive" and is used for data sampled
   from a population that can have more extreme values than found 
   in the samples. ...
   Setting the *method* to "inclusive" is used for describing 
   population data or for samples that include the extreme points.

In all my reading about quantile calculation methods, this is the first time I've come across this recommendation. Do you have a source for it or a justification?

Thanks.
History
Date User Action Args
2019-07-22 00:23:43steven.dapranosetmessages: + msg348273
2019-07-21 23:20:40rhettingersetpull_requests: + pull_request14678
2019-05-02 00:49:16rhettingersetmessages: + msg341240
2019-05-02 00:29:46rhettingersetpull_requests: + pull_request12966
2019-04-30 06:47:37rhettingersetmessages: + msg341137
2019-04-30 06:07:00rhettingersetpull_requests: + pull_request12943
2019-04-23 08:46:24rhettingersetmessages: + msg340703
2019-04-23 08:27:27rhettingersetpull_requests: + pull_request12849
2019-03-11 16:36:22steven.dapranosetmessages: + msg337686
2019-03-11 12:44:52mark.dickinsonsetmessages: + msg337662
2019-03-11 12:12:46mark.dickinsonsetmessages: + msg337660
2019-03-11 11:31:31steven.dapranosetmessages: + msg337658
2019-03-01 05:47:28miss-islingtonsetmessages: + msg336895
2019-03-01 05:42:10rhettingersetpull_requests: + pull_request12120
2019-02-28 17:16:27miss-islingtonsetmessages: + msg336852
2019-02-28 16:54:52rhettingersetpull_requests: + pull_request12103
2019-02-24 21:46:21rhettingersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-02-24 19:44:59miss-islingtonsetmessages: + msg336481
2019-02-24 19:25:45rhettingersetpull_requests: + pull_request12054
2019-02-24 06:19:08miss-islingtonsetnosy: + miss-islington
messages: + msg336439
2019-02-24 06:00:41rhettingersetmessages: + msg336436
2019-02-24 05:59:22rhettingersetpull_requests: + pull_request12040
2019-02-24 05:56:41rhettingersetmessages: + msg336435
2019-02-24 05:26:13davinsetmessages: + msg336433
2019-02-24 03:41:08steven.dapranosetmessages: + msg336425
2019-02-24 03:17:37steven.dapranosetmessages: + msg336422
2019-02-23 23:37:55davinsetmessages: + msg336417
2019-02-23 22:46:55rhettingersetmessages: + msg336414
2019-02-23 22:44:11rhettingersetmessages: + msg336413
2019-02-23 09:42:35rhettingersetmessages: + msg336376
2019-02-22 17:44:09xtreaksetnosy: + xtreak
messages: + msg336337
2019-02-22 16:42:26rhettingersetfiles: + normdist_22feb2019.diff
2019-02-22 16:42:11rhettingersetfiles: - normdist_22feb2019.diff
2019-02-22 15:54:15rhettingersetfiles: + normdist_22feb2019.diff
2019-02-22 15:53:17rhettingersetfiles: - normdist_22feb2019.diff
2019-02-22 14:57:28rhettingersetfiles: + normdist_22feb2019.diff

messages: + msg336319
2019-02-22 11:39:49steven.dapranosetmessages: + msg336295
2019-02-22 08:36:14rhettingersetnosy: + davin
messages: + msg336285
2019-02-21 10:57:22rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request11999
2019-02-20 00:33:27rhettingersetmessages: + msg336029
2019-02-19 19:38:31seliksetnosy: + selik
messages: + msg336008
2019-02-18 23:42:51steven.dapranosetmessages: + msg335876
2019-02-18 00:52:41rhettingersetfiles: + gauss_demo.py
2019-02-18 00:50:38rhettingersetnosy: + tim.peters, mark.dickinson
2019-02-18 00:00:59rhettingercreate