Add quantiles() to the statistics module #80727

rhettinger · 2019-04-06T21:22:17Z

BPO	36546
Nosy	@rhettinger, @mdickinson, @stevendaprano
PRs	bpo-36546: Add statistics.quantiles() #12710 bpo-36546: More tests: type preservation and equal inputs #13000 bpo-36546: Fix typo quaatile to quantile #13001 bpo-36546: Add more tests and expand docs #13406 bpo-36546: Add design notes to aid future discussions #13769 bpo-36546: Mark first argument as position only #14363 [3.8] bpo-36546: Mark first argument as position only (GH-14363) #14364 bpo-36546: Clean-up comments #14857 [3.8] bpo-36546: Clean-up comments (GH-14857) #14859 [3.8] bpo-36546: Add examples to elucidate the formulas (GH-14898) #14899 bpo-36546: No longer have a need to make "data" positional only #16252 [3.8] bpo-36546: No longer a need to make "data" positional only (GH-16252) #16253

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-04-23.07:15:59.221>
created_at = <Date 2019-04-06.21:22:16.575>
labels = ['3.8', 'type-feature', 'library']
title = 'Add quantiles() to the statistics module'
updated_at = <Date 2019-09-18.04:06:56.932>
user = 'https://github.com/rhettinger'

bugs.python.org fields:

activity = <Date 2019-09-18.04:06:56.932>
actor = 'rhettinger'
assignee = 'none'
closed = True
closed_date = <Date 2019-04-23.07:15:59.221>
closer = 'rhettinger'
components = ['Library (Lib)']
creation = <Date 2019-04-06.21:22:16.575>
creator = 'rhettinger'
dependencies = []
files = []
hgrepos = []
issue_num = 36546
keywords = ['patch']
message_count = 20.0
messages = ['339544', '339548', '339584', '339593', '339596', '340639', '340641', '340693', '340695', '340936', '340943', '341042', '342803', '344377', '346478', '346481', '348155', '348157', '352686', '352687']
nosy_count = 3.0
nosy_names = ['rhettinger', 'mark.dickinson', 'steven.daprano']
pr_nums = ['12710', '13000', '13001', '13406', '13769', '14363', '14364', '14857', '14859', '14899', '16252', '16253']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue36546'
versions = ['Python 3.8']

rhettinger · 2019-04-06T21:22:16Z

It is a common and useful data analysis technique to examine quartiles, deciles, and percentiles. It is especially helpful for comparing distinct datasets (heights of boys versus heights of girls) or for comparing against a reference distribution (empirical data versus a normal distribution for example).

--- sample session ---

>> from statistics import NormalDist, quantiles
>> from pylab import plot

# SAT exam scores
>>> sat = NormalDist(1060, 195)
>>> list(map(round, quantiles(sat, n=4)))       # quartiles
[928, 1060, 1192]
>>> list(map(round, quantiles(sat, n=10)))      # deciles
[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]

# Summarize a dataset
>>> data = [110, 96, 155, 87, 98, 82, 156, 88, 172, 102, 91, 184, 105, 114, 104]
>>> quantiles(data, n=2)                        # median
[104.0]
>>> quantiles(data, n=4)                        # quartiles
[91.0, 104.0, 155.0]
>>> quantiles(data, n=10)                       # deciles
[85.0, 88.6, 95.0, 99.6, 104.0, 108.0, 122.2, 155.8, 176.8]


# Assess when data is normally distributed by comparing quantiles
>>> reference_dist = NormalDist.from_samples(data)
>>> quantiles(reference_dist, n=4)
[93.81594518619364, 116.26666666666667, 138.71738814713967]

# Make a QQ plot to visualize how well the data matches a normal distribution
# plot(quantiles(data, n=7), quantiles(reference_dist, n=7))

rhettinger · 2019-04-06T21:38:30Z

https://patch-diff.githubusercontent.com/raw/python/cpython/pull/12710.diff

stevendaprano · 2019-04-07T23:37:17Z

I think adding quantiles (sometimes called fractiles) is a good feature to add. I especially have some use-cases for quartiles. I especially like that it delegates to the inv_cdf() method when available, and I'm happy with the API you suggested.

Forgive me if you're already aware of this, but the calculation of quantiles is unfortunately complicated by the fact that there are so many different ways to calculate them. (I see you have mentioned a potential future API for interp_method.)

See, for example:

http://jse.amstat.org/v14n3/langford.html

for a discussion. My own incomplete survey of statistics software has found about 20 distinct methods for calculating quantiles in use. I'm very happy to see this function added, but I'm not happy to commit to a specific calculation method without discussion.

If you agree that we can change the implementation later (and hence the specific cut points returned) then I see no reason why we can't get this in before feature freeze, and then do a review to find the "best" default implementation later. I already have three candidates:

Langford suggests his "CDF method 4", which is equivalent to Hyndman and Fan's Definition 2; it is also the default method used by SAS.
Hyndman and Fan themselves recommend their Definition 8:

https://robjhyndman.com/publications/quantiles/

R's default is H&F's Definition 7. (I suggest this only to ease comparisons with results from R, not because it has any statistical advantage.)

Do we agree that there is to be no backwards-compatibility guarantee made on the implementation and the specific cut-points returned? (Or at least not yet.)

mdickinson · 2019-04-08T07:28:19Z

Related previous discussion on python-ideas: https://mail.python.org/pipermail/python-ideas/2018-March/049327.html

rhettinger · 2019-04-08T07:42:32Z

Thanks for taking a detailed look. I'll explore the links you provided shortly.

The API is designed to be extendable so that we don't get trapped by the choice of computation method. If needed, any or all of the following extensions are possible without breaking backward compatibility:

  quantiles(data, n=4, already_sorted=True) # Skip resorting
  quantiles(data, cut_points=[0.02, 0.25, 0.50, 0.75, 0.98]) # box-and-whiskers
  quantiles(data, interp_method='nearest') # also: "low", "high", "midpoint"
  quantiles(data, inclusive=True)    # For description of a complete population

The default approach used in the PR matches what is used by MS Excel's PERCENTILE.EXC function¹. That has several virtues. It is easy to explain. It allows two unequal sized datasets to be compared (perhaps with a QQ plot) to explore whether they are drawn from the same distribution. For sampled data, the quantiles tend to remain stable as more samples are added. For samples from a known distribution (i.e normal variates), it tends to give the same results as ihv_cdf():

    >>> iq = NormalDist(100, 15)
    >>> cohort = iq.samples(10_000)
    >>> for ref, est in zip(quantiles(iq, n=10), quantiles(cohort, n=10)):
    ...     print(f'{ref:5.1f}\t{est:5.1f}')
    ...
     80.8	 81.0
     87.4	 87.8
     92.1	 92.3
     96.2	 96.3
    100.0	100.1
    103.8	104.0
    107.9	108.0
    112.6	112.9
    119.2	119.3

My thought was to start with something like this and only add options if they get requested (the most likely request is an inclusive=True option to emulate MS Excel's PERCENTILE.INC).

If we need to leave the exact method unguaranteed, that's fine. But I think it would be better to guarantee the match to PERCENTILE.EXC and then handle other requests through API extensions rather than revisions.

¹ https://exceljet.net/excel-functions/excel-percentile.exc-function

rhettinger · 2019-04-22T05:10:17Z

Steven, as requested I added a documentation note reserving the right to add other interpolation methods. We can make the note stronger if you like. Otherwise, I think we're good to go now.

Mark, thanks for the link. I've read all the posts and agree that we might consider changing the default interpolation method prior to the release. For now, matching what Excel and SciPy does seems like a reasonable starting point.

rhettinger · 2019-04-22T05:54:45Z

One other thought: The PR implements both methods 6 and 7 which means that that we can reproduce the results of every major stats tool except for Mathematica. The summary of properties chart of in Hyndman & Fan lists our default as satisfying 5 of 6 desirable properties. This is the same as for method 8 which almost no one uses by default.

rhettinger · 2019-04-23T07:06:48Z

New changeset 9013ccf by Raymond Hettinger in branch 'master':
bpo-36546: Add statistics.quantiles() (bpo-12710)
9013ccf

rhettinger · 2019-04-23T07:15:59Z

As requested, I've made the API easily extendable to handle alternative methods if those are deemed necessary. Also, it is easy to change the default.

That said, it is my hope that these two methods and the current default get kept. They have several virtues: easy to explain, obviousness, widely adopted (most of the other packages offer at least of these two methods), and they have most of the desired properties listed in Hyndman & Fan. They are a good practical choice.

stevendaprano · 2019-04-26T18:17:46Z

Hi Raymond,

Thanks for working on this, I'm really keen to see this happen and I
appreciate your efforts so far.

Your arguments have also convinced me that the default calculation
type you chose (PERCENTILE.EXC or R type=6) is suitable.

But I have a couple of concerns about the API:

You call the parameter to choose a calculation definition "method";
I'm mildly concerned that might lead to some confusion or ambiguity with
methods on a class. Do you think I'm worrying about nothing?

R calls the same parameter "type", which I don't like either since that
also has another meaning in Python. Numpy calls an equivalent parameter
"interpolation", which I think is not only excessively long to type, but
also misleading since at least two of the calculation methods used by
numpy don't perform any interpolation at all.

Octave and Maple call their parameter "method", so if we stick with
"method" we're in good company.

I'm more concerned about the values taken by the method parameter.
"Inclusive" and "Exclusive" have a related but distinct meaning when it
comes to quartiles which is different from the Excel usage:

Using range(1, 9) as the data:

standard "inclusive" or "exclusive" quartiles: 2.5, 4.5, 6.5
Excel's "exclusive" quartiles (R type 6): 2.25, 4.5, 6.75
Excel's "inclusive" quartiles (R type 7): 2.75, 4.5, 6.25

I'd prefer to keep inclusive/exclusive for such quartiles.

I have a working version of quantiles() which supports cutpoints and all
nine calculation methods supported by R. Following your lead, I've kept
type=6 as the default. I need to tidy up the tests and inline docs,
which time permitting I will do over the next 48 hours, and then it will
be ready for initial review.

rhettinger · 2019-04-26T21:14:26Z

Thanks for propelling this forward :-) I'm really happy to have an easy to reach tool that readily summarizes the shape of data and that can be used to compare how distributions differ.

Octave and Maple call their parameter "method", so if we
stick with "method" we're in good company.

The Langford paper also uses the word "method", so that is likely just the right word.

I'm more concerned about the values taken by the method parameter.
"Inclusive" and "Exclusive"h ave a related but distinct meaning
when it comes to quartiles which is different from the Excel usage

Feel free to change it to whatever communicates the best. The meaning I was going for is closer to the notions of open-interval or closed interval. In terms of use cases, one is for describing population data where the minimum input really is the 0th percentile and the maximum is the 100th percentile. The other is for sample data where the underlying population will have values outside the range of the empirical samples. I'm not sure what words bests describe the distinction. The word "inclusive" and "exclusive" approximated that idea but maybe you can do better.

I have a working version of quantiles() which supports cutpoints
and all nine calculation methods supported by R.

My recommendation is to not do this. Usually, it's better to start simple, focusing on core use cases (i.e. sample and population), then let users teach us what additions they really need (this is a YAGNI argument). Once a feature is offered, it can never be taken away even if it proves to be not helpful in most situations or is mostly unused.

In his 20 year retrospective, Hyndman expressed dismay that his paper had the opposite effect of what was intended (hoping for a standardization on a single approach rather than a proliferation of all nine methods). My experience in API design is that offering users too many choices will complicate their lives, leading to suboptimal and incorrect choices and creating confusion. That is likely why most software packages other than R only offer one or two options.

If you hold off, you can always add these options later. We might just find that what we've got suffices for most everyday uses. Also, I thought the spirit of the statistics module was to offer a few core statistical tools aimed at non-experts, deferring to external packages for more rich collections of optimized, expert tools that cover every option. For me, the best analogy is my two cameras. One is a point and shoot that is easy to use and does a reasonable job. The other is a professional SLR with hundreds of settings that I had to go to photography school to learn to use.

FWIW, I held-off on adding "cut_points" because the normal use case is to get equally spaced quantiles. It would be unusual to want 0.25 and 0.50 but not 0.75. The other reason is that user provided cut-points conflict with core concept of "Divide *dist* into *n* continuous intervals with equal probability." User provided cut-points provide other ways to go wrong as well (not being sorted, 0.0 or 1.0 not being valid for some methods, values outside the range 0.0 to 1.0). The need for cut_points makes more sense for numpy or scipy where is common to pass around a linspace. Everyday Python isn't like that.

rhettinger · 2019-04-29T04:32:05Z

New changeset db81ba1 by Raymond Hettinger in branch 'master':
bpo-36546: More tests: type preservation and equal inputs (bpo-13000)
db81ba1

rhettinger · 2019-05-18T17:18:32Z

New changeset e917f2e by Raymond Hettinger in branch 'master':
bpo-36546: Add more tests and expand docs (bpo-13406)
e917f2e

rhettinger · 2019-06-03T04:07:50Z

New changeset cba9f84 by Raymond Hettinger in branch 'master':
bpo-36546: Add design notes to aid future discussions (GH-13769)
cba9f84

rhettinger · 2019-06-25T02:39:25Z

New changeset 1791128 by Raymond Hettinger in branch 'master':
bpo-36546: Mark first argument as position only (GH-14363)
1791128

rhettinger · 2019-06-25T03:07:04Z

New changeset 210358b by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
bpo-36546: Mark first argument as position only (GH-14363) (GH-14364)
210358b

rhettinger · 2019-07-19T08:57:26Z

New changeset eed5e9a by Raymond Hettinger in branch 'master':
bpo-36546: Clean-up comments (GH-14857)
eed5e9a

rhettinger · 2019-07-19T09:17:58Z

New changeset e5bfd1c by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
bpo-36546: Clean-up comments (GH-14857) (bpo-14859)
e5bfd1c

rhettinger · 2019-09-18T03:45:12Z

New changeset 272d0d0 by Raymond Hettinger in branch 'master':
bpo-36546: No longer a need to make "data" positional only (GH-16252)
272d0d0

rhettinger · 2019-09-18T04:06:57Z

New changeset 31af1cc by Raymond Hettinger (Miss Islington (bot)) in branch '3.8':
bpo-36546: No longer a need to make "data" positional only (GH-16252) (GH-16253)
31af1cc

rhettinger added 3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 6, 2019

rhettinger closed this as completed Apr 23, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantiles() to the statistics module #80727

Add quantiles() to the statistics module #80727

rhettinger commented Apr 6, 2019

rhettinger commented Apr 6, 2019

rhettinger commented Apr 6, 2019

stevendaprano commented Apr 7, 2019

mdickinson commented Apr 8, 2019

rhettinger commented Apr 8, 2019

rhettinger commented Apr 22, 2019

rhettinger commented Apr 22, 2019

rhettinger commented Apr 23, 2019

rhettinger commented Apr 23, 2019

stevendaprano commented Apr 26, 2019

rhettinger commented Apr 26, 2019

rhettinger commented Apr 29, 2019

rhettinger commented May 18, 2019

rhettinger commented Jun 3, 2019

rhettinger commented Jun 25, 2019

rhettinger commented Jun 25, 2019

rhettinger commented Jul 19, 2019

rhettinger commented Jul 19, 2019

rhettinger commented Sep 18, 2019

rhettinger commented Sep 18, 2019

Add quantiles() to the statistics module #80727

Add quantiles() to the statistics module #80727

Comments

rhettinger commented Apr 6, 2019

rhettinger commented Apr 6, 2019

rhettinger commented Apr 6, 2019

stevendaprano commented Apr 7, 2019

mdickinson commented Apr 8, 2019

rhettinger commented Apr 8, 2019

rhettinger commented Apr 22, 2019

rhettinger commented Apr 22, 2019

rhettinger commented Apr 23, 2019

rhettinger commented Apr 23, 2019

stevendaprano commented Apr 26, 2019

rhettinger commented Apr 26, 2019

rhettinger commented Apr 29, 2019

rhettinger commented May 18, 2019

rhettinger commented Jun 3, 2019

rhettinger commented Jun 25, 2019

rhettinger commented Jun 25, 2019

rhettinger commented Jul 19, 2019

rhettinger commented Jul 19, 2019

rhettinger commented Sep 18, 2019

rhettinger commented Sep 18, 2019