Issue 35775: Add a general selection function to statistics

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79956

classification

Title:	Add a general selection function to statistics
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	mark.dickinson, remi.lapeyre, rhettinger, steven.daprano
Priority:	normal	Keywords:	patch, patch

Created on 2019-01-18 14:04 by remi.lapeyre, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 11609	closed	remi.lapeyre, 2019-01-18 14:05
PR 11609	closed	remi.lapeyre, 2019-01-18 14:05

Messages (7)
msg333964 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2019-01-18 14:04
Like discussed in #30999, the attached PR adds a general selection function to the statistics module. This allows to simply get the element at a given quantile of a collection. https://www.cs.rochester.edu/~gildea/csc282/slides/C09-median.pdf
msg334019 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2019-01-18 22:34
I'm very interested in adding quartiles and general quantiles/fractiles, but I'm not so sure that this select(data, index) function would be useful. Can you explain how you would use this?
msg334022 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2019-01-18 23:13
Wouldn't be the 5-th percentile be select(data, round(len(data)/20)?
msg334027 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2019-01-19 03:22
On Fri, Jan 18, 2019 at 11:13:41PM +0000, Rémi Lapeyre wrote: > Wouldn't be the 5-th percentile be select(data, round(len(data)/20)? Oh if only it were that simple! Using the method you suggest, the 50th percentile is not the same as the median unless the length of the list is three more than a multiple of four. It also runs into problems for small lists where the index rounds down to zero. Langford (2006) does a literature review and finds fifteen methods for calculating the quartiles (Q1, Q2, Q3), of which twelve are distinct and incompatible; Hyndman & Fan (1996) did similar for general quantiles and came up with nine, of which seven match Langford's. I know of at least six other methods, which gives a total of 20 distinct ways of calculating quartiles or quantiles. http://jse.amstat.org/v14n3/langford.html https://robjhyndman.com/publications/quantiles/ I stress that these are not merely different algorithms which give the same answer, but different methods which sometimes disagree on their answers. So whichever method you use, some people are going to be annoyed or confused or both. http://mathforum.org/library/drmath/view/60969.html Other statistics libraries provide a choice, e.g.: - R and Octave provide the same 9 as H&F. - Maple provides 6 of those, plus 2 others. - Wessa provides 5 that match H&F, plus another 3. - SAS provides 5. - even Excel provides 2 different ways. Statisticians don't even agree on which is the "best" method. H&F recommend their method number 8. Langford recommends his method 4. I think that your suggestion matches Langford's method 14, which is H&F's method 3. Selecting the i-th item from a list is the easy part. Turning that into meaningful quantiles, percentiles etc is where it gets really hairy. My favourite quote on this comes from J Nash on the Gnumeric mailing list: Ultimately, this question boils down to where to cut to divide 4 candies among 5 children. No matter what you do, things get ugly.
msg334036 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-01-19 06:01
> I'm very interested in adding quartiles and > general quantiles/fractiles, but I'm not > so sure that this select(data, index) function would be useful. I concur with Steven.
msg334075 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2019-01-19 23:27
Rémi. I've read over your patch and have some comments: (1) You call sorted() to produce a list, but then instead of retrieving the item using ``data[i-1]`` you use ``itertools.islice``. That seems unnecessary to me. Do you have a reason for using ``islice``? (2) select is not very useful on its own, we actually want it so we can calculate quantiles, e.g. percentiles, deciles, quartiles. If we want the k-quantile (e.g. k=100 for percentiles) then there are k+1 k-quantiles in total, including the minimum and maximum. E.g quartiles divide the data set into four equal sections, so there are five boundary values including the min and max. So the caller is likely to be calling select repeatedly on the same data set, and hence making a copy of that data and sorting it repeatedly. If the data set is small, repeatedly making sorted copies is still cheap enough, but for large data sets, that will be expensive. Do you have any thoughts on how to deal with that?
msg343398 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2019-05-24 15:35
Hi Steven, thanks for taking the time to reviewing my patch. Regarding the relevance of add select(), I was looking for work to do in the bug tracker and found some references to it (https://bugs.python.org/issue21592#msg219934 for example). I knew that there is multiples definition of the percentiles but got sloppy in my previous response by wanting to answer quickly. I will try not to do this again. Regarding the use of sorting, I thought that sorting would be quicker than doing the other linear-time algorithm in Python given the general performance of Tim sort, some tests in https://bugs.python.org/issue21592 agreed with that. For the iterator, I was thinking about how to implement percentiles when writing select() and thought that by writing: def _select(data, i, key=None): if not len(data): raise StatisticsError("select requires at least one data point") if not (1 <= i <= len(data)): raise StatisticsError(f"The index looked for must be between 1 and {len(data)}") data = sorted(data, key=key) return islice(data, i-1, None) def select(data, i, key=None): return next(_select(data, y, key=key)) and then doing some variant of: it = _select(data, i, key=key) left, right = next(it), next(it) # compute percentile with left and right to implement the quantiles without sorting multiple time the list. Now that quantiles() has been implement by Raymond Hettinger, this is moot anyway. Since its probably not useful, feel free to disregard my PR.

History
Date	User	Action	Args
2022-04-11 14:59:10	admin	set	github: 79956
2020-03-09 15:30:54	remi.lapeyre	set	status: open -> closed stage: patch review -> resolved
2019-05-24 15:35:28	remi.lapeyre	set	messages: + msg343398
2019-01-19 23:27:32	steven.daprano	set	keywords: patch, patch messages: + msg334075
2019-01-19 06:01:30	rhettinger	set	keywords: patch, patch messages: + msg334036
2019-01-19 03:22:05	steven.daprano	set	messages: + msg334027
2019-01-18 23:13:35	remi.lapeyre	set	messages: + msg334022
2019-01-18 22:34:06	steven.daprano	set	keywords: patch, patch messages: + msg334019
2019-01-18 16:23:43	mark.dickinson	set	keywords: patch, patch nosy: + mark.dickinson
2019-01-18 14:05:38	remi.lapeyre	set	nosy: + rhettinger, steven.daprano
2019-01-18 14:05:10	remi.lapeyre	set	keywords: + patch stage: patch review pull_requests: + pull_request11338
2019-01-18 14:05:07	remi.lapeyre	set	keywords: + patch stage: (no value) pull_requests: + pull_request11337
2019-01-18 14:04:49	remi.lapeyre	create