Issue 20479: Efficiently support weight/frequency mappings in the statistics module

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64678

classification

Title:	Efficiently support weight/frequency mappings in the statistics module
Type:	enhancement	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:	20478	Superseder:
Assigned To:	steven.daprano	Nosy List:	gregory.p.smith, ncoghlan, oscarbenjamin, remi.lapeyre, rhettinger, steven.daprano, wolma
Priority:	normal	Keywords:

Created on 2014-02-02 01:21 by ncoghlan, last changed 2022-04-11 14:57 by admin.

Messages (15)
msg209931 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-02 01:21
Issue 20478 suggests ensuring that even weight/frequency mappings like collections.Counter are consistently handled as iterables in the current statistics module API. However, it likely makes sense to provide public APIs that support efficiently working with such weight/frequency mappings directly, rather than requiring that they be expanded to a full iterable all the time. One possibility would be to provide parallel APIs with the _map suffix, similar to the format() vs format_map() distinction in the string formatting APIs.
msg209973 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2014-02-02 11:55
Off the top of my head, I can think of three APIs: (1) separate functions, as Nick suggests: mean vs weighted_mean, stdev vs weighted_stdev (2) treat mappings as an implied (value, frequency) pairs (3) take an additional argument to switch between unweighted and weighted modes. I dislike #3, but will consider the others for 3.5.
msg209975 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2014-02-02 12:12
On 2 February 2014 11:55, Steven D'Aprano <report@bugs.python.org> wrote: > > (1) separate functions, as Nick suggests: > mean vs weighted_mean, stdev vs weighted_stdev This would be my preferred approach. It makes it very clear which functions are available for working with map style data. It will be clear from both the module documentation and a casual introspection of the module that those APIs are present for those who might want them. Also apart from mode() the implementation of each function on map-format data will be completely different from the iterable version so you'd want to have it as a separate function at least internally anyway.
msg209985 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-02 13:55
See also issue18844.
msg210038 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-02 22:27
> -----Ursprüngliche Nachricht----- > Von: Steven D'Aprano [mailto:report@bugs.python.org] > Gesendet: Sonntag, 2. Februar 2014 12:55 > An: wolfgang.maier@biologie.uni-freiburg.de > Betreff: [issue20479] Efficiently support weight/frequency mappings in the > statistics module > > > Steven D'Aprano added the comment: > > Off the top of my head, I can think of three APIs: > > (1) separate functions, as Nick suggests: > mean vs weighted_mean, stdev vs weighted_stdev > > (2) treat mappings as an implied (value, frequency) pairs > (2) is clearly my favourite. (1) may work well, if you have a module with a small fraction of functions, for which you need an alternate API. In the statistics module, however, almost all of its current functions could profit from having a way to treat mappings specially. In such a case, (1) is prone to create lots of redundancies. I do not share Oscar's opinion that > apart from mode() the implementation of each function on > map-format data will be completely different from the iterable version > so you'd want to have it as a separate function at least internally > anyway. Consider _sum's current code (docstring omitted for brevity): def _sum(data, start=0): n, d = _exact_ratio(start) T = type(start) partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. coerce_types = _coerce_types exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and track the "current" type. for x in data: T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total) all you'd have to do to treat mappings as proposed here is to add a check whether we are dealing with a mapping, then in this case, instead of the for loop: for x in data: T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n use this: for x,m in data.items(): T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n*m and no other changes (though I haven't tested this carefully). Wolfgang
msg210107 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-03 10:09
Well, I was thinking about frequencies (ints) when suggesting for x,m in data.items(): T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n*m in my previous message. To support weights (float or Rational) this would have to be more sophisticated. Wolfgang
msg210108 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2014-02-03 10:33
> in my previous message. To support weights (float or Rational) this would have to be more sophisticated. I guess you'd do: for x,w in data.items(): T = _coerce_types(T, type(x)) xn, xd = exact_ratio(x) wn, wd = exact_ratio(w) partials[d] = partials_get(xd * wd, 0) + xn * wn Variance is only slightly trickier. Median would be more complicated. I just think that I prefer to know when I look at code that something is being treated as a mapping or as an iterable. So when I look at d = f(x, y, z) v = variance_map(d) It's immediately obvious what d is and how the function variance_map is using it. As well as the benefit of readability there's also the fact that accepting different kinds of input puts strain on any attempt to modify your code in the future. Auditing the code requires understanding at all times that the name "data" is bound to a quantum superposition of different types of object. Either every function would have to have the same "iterable or mapping" interface or there would have to be some other convention for making it clear which ones do. Perhaps the functions that don't make sense for a mapping could explicitly reject them rather than treating them as an iterable. I just think it's simpler to have a different function name for each type of input. Then it's clear what functions are available for working with mappings. If you were going for something completely different then you could have an object-oriented interface where there are classes for the different types of data and methods that do the right thing in each case. Then you would do v = WeightedData(d).variance() The ordinary variance() function could just become a shortcut for def variance(data): return SequenceData(data).variance()
msg305166 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-10-28 20:01
My recommendation is to have weights as an optional argument: statistics.mean(values, weights=None) While it is tempting to special case dicts and counters, I got feedback from Jake Vanderplas and Wes McKinney that in practice it is more common to have the weights as a separate list/array/vector. That API has other advantages as well. For starters, it is a simple extension of the existing API, so it isn't a disruptive change. Also, it works well with mapping views: statistics.mean(vehicle_sales.keys(), vehicle_sales.values()) And the API also helps support use cases where different weightings are being explored for the same population: statistics.mean(salary, years_of_service) statistics.mean(salary, education) statistics.mean(salary, age)
msg305171 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-10-29 02:23
Thinking back to my signal processing days, I have to agree that our weightings (filter definitions) were usually separate from our data (live signals). Similarly, systems engineering trade studies all maintained feature weights separately from the assessments of the individual options. The comment from my original RFE about avoiding expanding value -> weight/frequency mappings "to a full iterable all the time" doesn't actually make much sense in 3.x, since m.keys() and m.values() are now typically able to avoid data copying. So +1 from me for the separates "weights" parameter, with the m.keys()/m.values() idiom used to handle mappings like Counter. As another point in favour of that approach, it's trivial to build zero-copy weighted variants on top of it for mappings with cheap key and value views: def weighted_mean(mapping): return statistics.mean(mapping.keys(), mapping.values()) By contrast, if the lowest level primitive provided is a mapping based API, then when you do have separate values-and-weights iterables, you're going to have a much harder time avoiding building a completely new container.
msg333978 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2019-01-18 15:27
Is this proposal still relevant? If so, I would like to work on its implementation. I think the third proposition to change the API to have a new `weights` parameter is the best has it does not blindly suppose that a tuple is a pair (value, weight) which could not always be true.
msg334037 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-01-19 06:04
> Is this proposal still relevant? If so, I would > like to work on its implementation. The first question is the important one. Writing implementations is usually the easy part. Deciding on whether there is a real need and creating a usable, extendable, and clean API is often the hard part. So please don't run to a PR, that just complicates the important part.
msg334039 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2019-01-19 06:30
> Is this proposal still relevant? Yes. As Raymond says, deciding on a good API is the hard part. Its relatively simple to change a poor implementation for a better one, but backwards compatibility means that changing the API is very difficult. I would find it very helpful if somebody has time to do a survey of other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, SAS etc) and see how they handle data with weights. - what APIs do they provide? - do they require weights to be positive integers, or do they support arbitrary float weights? - including negative weights? (what physical meaning does a negative weight have?) At the moment, a simple helper function seems to do the trick for non-negative integer weights: def flatten(items): for item in items: yield from item py> data = [1, 2, 3, 4] py> weights = [1, 4, 1, 2] py> statistics.mean(flatten([x]w for x, w in zip(data, weights))) 2.5 In principle, the implementation could be as simple as a single recursive call: def mean(data, weights=None): if weights is not None: return mean(flatten([x]w for x, w in zip(data, weights))) # base case without weights is unchanged or perhaps it could be just a recipe in the docs.
msg334101 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2019-01-20 21:11
> I would find it very helpful if somebody has time to do a survey of > other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, > SAS etc) and see how they handle data with weights. Numpy has only sporadic support for this. The standard mean function does not have any way to provide weights but there is an alternative called average that computes the mean and has an optional weights argument. I've never heard of average before searching for "numpy weighted mean" just now. Numpy's API often has bits of old cruft from where various numerical packages were joined together so I'm not sure they would recommend their current approach. I don't think there are any other numpy functions for providing weighted statistics. Statsmodels does provide an API for this as explained here: https://stackoverflow.com/a/36464881/9450991 Their API is that you create an object with data and weights and can then call methods/attributes for statistics. Matlab doesn't support even weighted mean as far as I can tell. There is wmean on the matlab file exchange: > > - what APIs do they provide? > - do they require weights to be positive integers, or do they > support arbitrary float weights? > - including negative weights? > (what physical meaning does a negative weight have?) > > At the moment, a simple helper function seems to do the trick for > non-negative integer weights: > > def flatten(items): > for item in items: > yield from item > > py> data = [1, 2, 3, 4] > py> weights = [1, 4, 1, 2] > py> statistics.mean(flatten([x]w for x, w in zip(data, weights))) > 2.5 > > In principle, the implementation could be as simple as a single > recursive call: > > def mean(data, weights=None): > if weights is not None: > return mean(flatten([x]w for x, w in zip(data, weights))) > # base case without weights is unchanged > > or perhaps it could be just a recipe in the docs. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue20479> > _______________________________________
msg334102 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2019-01-20 21:17
Sorry, sent too soon... > Matlab doesn't support even weighted mean as far as I can tell. There > is wmean on the matlab file exchange: https://stackoverflow.com/a/36464881/9450991 This is a separate function `wmean(data, weights)`. It has to be a separate function though because it's third party code so the author couldn't change the main mean function. R ships with a weighted.mean function but I think for standard deviation you need third party libs. A quick survey but the main impression I get is that providing API for this is not that common. The only good-looking API is the statsmodel one.
msg334197 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2019-01-22 01:12
Here is some further information on weights in statistics in general, and SAS and Stata specifically: https://blogs.sas.com/content/iml/2017/10/02/weight-variables-in-statistics-sas.html Quote: use the FREQ statement to specify integer frequencies for repeated observations. Use the WEIGHT statement when you want to decrease the influence that certain observations have on the parameter estimates. http://support.sas.com/kb/22/600.html https://www.stata.com/manuals13/u20.pdf#u20.23 Executive summary: - Stata defines four different kinds of weights; - SAS defines two, WEIGHT and FREQ (frequency); - SAS truncates FREQ values to integers, with zero or negative meaning that the data point is to be ignored; - Using FREQ is equivalent to repeating the data points. In Python terms: mean([1, 2, 3, 4], freq=[1, 0, 3, 1]) would be equivalent to mean([1, 3, 3, 3, 4]). - Weights in SAS are implicitly normalised to sum to 1, but some functions allow you to normalise to sum to the number of data points, because it sometimes makes a difference. - It isn't clear to me what the physical meaning of weights in SAS actually is. The documentation is unclear, it could as simple as the definition of weighted mean here: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition but how that extends to more complex SAS functions is unclear to me. (And for what its worth, I don't think SAS's MEAN function supports weights at all. Any SAS users here that could comment?)

History
Date	User	Action	Args
2022-04-11 14:57:57	admin	set	github: 64678
2019-01-22 01:12:21	steven.daprano	set	messages: + msg334197
2019-01-20 21:17:11	oscarbenjamin	set	messages: + msg334102
2019-01-20 21:11:48	oscarbenjamin	set	messages: + msg334101
2019-01-19 06:30:43	steven.daprano	set	messages: + msg334039
2019-01-19 06:04:40	rhettinger	set	messages: + msg334037
2019-01-18 15:27:31	remi.lapeyre	set	versions: + Python 3.8, - Python 3.7
2019-01-18 15:27:05	remi.lapeyre	set	nosy: + remi.lapeyre messages: + msg333978
2017-10-29 02:23:26	ncoghlan	set	messages: + msg305171
2017-10-28 20:01:32	rhettinger	set	nosy: + rhettinger messages: + msg305166 versions: + Python 3.7, - Python 3.5
2017-10-28 13:30:56	serhiy.storchaka	set	nosy: - serhiy.storchaka
2014-02-03 10:33:54	oscarbenjamin	set	messages: + msg210108
2014-02-03 10:09:31	wolma	set	messages: + msg210107
2014-02-02 22:27:09	wolma	set	messages: + msg210038
2014-02-02 13:55:05	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg209985
2014-02-02 12:12:32	oscarbenjamin	set	messages: + msg209975
2014-02-02 11:55:18	steven.daprano	set	assignee: steven.daprano messages: + msg209973
2014-02-02 01:22:52	ncoghlan	set	dependencies: + Avoid inadvertently special casing Counter in statistics module versions: + Python 3.5
2014-02-02 01:21:11	ncoghlan	create