classification
Title: Efficiently support weight/frequency mappings in the statistics module
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 20478 Superseder:
Assigned To: steven.daprano Nosy List: gregory.p.smith, ncoghlan, oscarbenjamin, remi.lapeyre, rhettinger, steven.daprano, wolma
Priority: normal Keywords:

Created on 2014-02-02 01:21 by ncoghlan, last changed 2019-01-22 01:12 by steven.daprano.

Messages (15)
msg209931 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-02-02 01:21
Issue 20478 suggests ensuring that even weight/frequency mappings like collections.Counter are consistently handled as iterables in the current statistics module API.

However, it likely makes sense to provide public APIs that support efficiently working with such weight/frequency mappings directly, rather than requiring that they be expanded to a full iterable all the time.

One possibility would be to provide parallel APIs with the _map suffix, similar to the format() vs format_map() distinction in the string formatting APIs.
msg209973 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-02-02 11:55
Off the top of my head, I can think of three APIs:

(1) separate functions, as Nick suggests:
mean vs weighted_mean, stdev vs weighted_stdev

(2) treat mappings as an implied (value, frequency) pairs

(3) take an additional argument to switch between unweighted and weighted modes.

I dislike #3, but will consider the others for 3.5.
msg209975 - (view) Author: Oscar Benjamin (oscarbenjamin) * Date: 2014-02-02 12:12
On 2 February 2014 11:55, Steven D'Aprano <report@bugs.python.org> wrote:
>
> (1) separate functions, as Nick suggests:
> mean vs weighted_mean, stdev vs weighted_stdev

This would be my preferred approach. It makes it very clear which
functions are available for working with map style data. It will be
clear from both the module documentation and a casual introspection of
the module that those APIs are present for those who might want them.
Also apart from mode() the implementation of each function on
map-format data will be completely different from the iterable version
so you'd want to have it as a separate function at least internally
anyway.
msg209985 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-02 13:55
See also issue18844.
msg210038 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-02-02 22:27
> -----Ursprüngliche Nachricht-----
> Von: Steven D'Aprano [mailto:report@bugs.python.org]
> Gesendet: Sonntag, 2. Februar 2014 12:55
> An: wolfgang.maier@biologie.uni-freiburg.de
> Betreff: [issue20479] Efficiently support weight/frequency mappings in the
> statistics module
> 
> 
> Steven D'Aprano added the comment:
> 
> Off the top of my head, I can think of three APIs:
> 
> (1) separate functions, as Nick suggests:
> mean vs weighted_mean, stdev vs weighted_stdev
> 
> (2) treat mappings as an implied (value, frequency) pairs
> 

(2) is clearly my favourite. (1) may work well, if you have a module with a small fraction of functions, for which you need an alternate API.
In the statistics module, however, almost all of its current functions could profit from having a way to treat mappings specially.
In such a case, (1) is prone to create lots of redundancies.

I do not share Oscar's opinion that

> apart from mode() the implementation of each function on
> map-format data will be completely different from the iterable version
> so you'd want to have it as a separate function at least internally
> anyway.

Consider _sum's current code (docstring omitted for brevity):
def _sum(data, start=0):
    n, d = _exact_ratio(start)
    T = type(start)
    partials = {d: n}  # map {denominator: sum of numerators}
    # Micro-optimizations.
    coerce_types = _coerce_types
    exact_ratio = _exact_ratio
    partials_get = partials.get
    # Add numerators for each denominator, and track the "current" type.
    for x in data:
        T = _coerce_types(T, type(x))
        n, d = exact_ratio(x)
        partials[d] = partials_get(d, 0) + n
    if None in partials:
        assert issubclass(T, (float, Decimal))
        assert not math.isfinite(partials[None])
        return T(partials[None])
    total = Fraction()
    for d, n in sorted(partials.items()):
        total += Fraction(n, d)
    if issubclass(T, int):
        assert total.denominator == 1
        return T(total.numerator)
    if issubclass(T, Decimal):
        return T(total.numerator)/total.denominator
    return T(total)

all you'd have to do to treat mappings as proposed here is to add a check whether we are dealing with a mapping, then in this case, instead of the for loop:

    for x in data:
        T = _coerce_types(T, type(x))
        n, d = exact_ratio(x)
        partials[d] = partials_get(d, 0) + n

use this:

    for x,m in data.items():
        T = _coerce_types(T, type(x))
        n, d = exact_ratio(x)
        partials[d] = partials_get(d, 0) + n*m

and no other changes (though I haven't tested this carefully).

Wolfgang
msg210107 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-02-03 10:09
Well, I was thinking about frequencies (ints) when suggesting

    for x,m in data.items():
        T = _coerce_types(T, type(x))
        n, d = exact_ratio(x)
        partials[d] = partials_get(d, 0) + n*m

in my previous message. To support weights (float or Rational) this would have to be more sophisticated.

Wolfgang
msg210108 - (view) Author: Oscar Benjamin (oscarbenjamin) * Date: 2014-02-03 10:33
> in my previous message. To support weights (float or Rational) this would have to be more sophisticated.

I guess you'd do:

     for x,w in data.items():
         T = _coerce_types(T, type(x))
         xn, xd = exact_ratio(x)
         wn, wd = exact_ratio(w)
         partials[d] = partials_get(xd * wd, 0) + xn * wn

Variance is only slightly trickier. Median would be more complicated.

I just think that I prefer to know when I look at code that something is being
treated as a mapping or as an iterable. So when I look at

    d = f(x, y, z)
    v = variance_map(d)

It's immediately obvious what d is and how the function variance_map is using
it.

As well as the benefit of readability there's also the fact that accepting
different kinds of input puts strain on any attempt to modify your code in the
future. Auditing the code requires understanding at all times that the name
"data" is bound to a quantum superposition of different types of object.

Either every function would have to have the same "iterable or mapping"
interface or there would have to be some other convention for making it clear
which ones do. Perhaps the functions that don't make sense for a mapping could
explicitly reject them rather than treating them as an iterable.

I just think it's simpler to have a different function name for each type of
input. Then it's clear what functions are available for working with mappings.

If you were going for something completely different then you could have an
object-oriented interface where there are classes for the different types of
data and methods that do the right thing in each case.

Then you would do

    v = WeightedData(d).variance()

The ordinary variance() function could just become a shortcut for

    def variance(data):
        return SequenceData(data).variance()
msg305166 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-10-28 20:01
My recommendation is to have *weights* as an optional argument:

    statistics.mean(values, weights=None)

While it is tempting to special case dicts and counters, I got feedback from Jake Vanderplas and Wes McKinney that in practice it is more common to have the weights as a separate list/array/vector.

That API has other advantages as well.  For starters, it is a simple extension of the existing API, so it isn't a disruptive change.  Also, it works well with mapping views: 
   
   statistics.mean(vehicle_sales.keys(), vehicle_sales.values())

And the API also helps support use cases where different weightings are being explored for the same population:

   statistics.mean(salary, years_of_service)
   statistics.mean(salary, education)
   statistics.mean(salary, age)
msg305171 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-10-29 02:23
Thinking back to my signal processing days, I have to agree that our weightings (filter definitions) were usually separate from our data (live signals). Similarly, systems engineering trade studies all maintained feature weights separately from the assessments of the individual options.

The comment from my original RFE about avoiding expanding value -> weight/frequency mappings "to a full iterable all the time" doesn't actually make much sense in 3.x, since m.keys() and m.values() are now typically able to avoid data copying.

So +1 from me for the separates "weights" parameter, with the m.keys()/m.values() idiom used to handle mappings like Counter.

As another point in favour of that approach, it's trivial to build zero-copy weighted variants on top of it for mappings with cheap key and value views:

    def weighted_mean(mapping):
        return statistics.mean(mapping.keys(), mapping.values())

By contrast, if the lowest level primitive provided is a mapping based API, then when you do have separate values-and-weights iterables, you're going to have a much harder time avoiding building a completely new container.
msg333978 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-01-18 15:27
Is this proposal still relevant? If so, I would like to work on its implementation.

I think the third proposition to change the API to have a new `weights` parameter is the best has it does not blindly suppose that a tuple is a pair (value, weight) which could not always be true.
msg334037 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-01-19 06:04
> Is this proposal still relevant? If so, I would 
> like to work on its implementation.

The first question is the important one.  Writing implementations is usually the easy part.  Deciding on whether there is a real need and creating a usable, extendable, and clean API is often the hard part.  So please don't run to a PR, that just complicates the important part.
msg334039 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-01-19 06:30
> Is this proposal still relevant?

Yes.

As Raymond says, deciding on a good API is the hard part. Its relatively 
simple to change a poor implementation for a better one, but backwards 
compatibility means that changing the API is very difficult.

I would find it very helpful if somebody has time to do a survey of 
other statistics libraries or languages (e.g. numpy, R, Octave, Matlab, 
SAS etc) and see how they handle data with weights.

- what APIs do they provide?
- do they require weights to be positive integers, or do they 
  support arbitrary float weights?
- including negative weights? 
  (what physical meaning does a negative weight have?)

At the moment, a simple helper function seems to do the trick for 
non-negative integer weights:

def flatten(items):
    for item in items:
        yield from item

py> data = [1, 2, 3, 4]
py> weights = [1, 4, 1, 2]
py> statistics.mean(flatten([x]*w for x, w in zip(data, weights)))
2.5

In principle, the implementation could be as simple as a single 
recursive call:

def mean(data, weights=None):
    if weights is not None:
        return mean(flatten([x]*w for x, w in zip(data, weights)))
    # base case without weights is unchanged

or perhaps it could be just a recipe in the docs.
msg334101 - (view) Author: Oscar Benjamin (oscarbenjamin) * Date: 2019-01-20 21:11
> I would find it very helpful if somebody has time to do a survey of
> other statistics libraries or languages (e.g. numpy, R, Octave, Matlab,
> SAS etc) and see how they handle data with weights.

Numpy has only sporadic support for this. The standard mean function
does not have any way to provide weights but there is an alternative
called average that computes the mean and has an optional weights
argument. I've never heard of average before searching for "numpy
weighted mean" just now. Numpy's API often has bits of old cruft from
where various numerical packages were joined together so I'm not sure
they would recommend their current approach. I don't think there are
any other numpy functions for providing weighted statistics.

Statsmodels does provide an API for this as explained here:
https://stackoverflow.com/a/36464881/9450991
Their API is that you create an object with data and weights and can
then call methods/attributes for statistics.

Matlab doesn't support even weighted mean as far as I can tell. There
is wmean on the matlab file exchange:

>
> - what APIs do they provide?
> - do they require weights to be positive integers, or do they
>   support arbitrary float weights?
> - including negative weights?
>   (what physical meaning does a negative weight have?)
>
> At the moment, a simple helper function seems to do the trick for
> non-negative integer weights:
>
> def flatten(items):
>     for item in items:
>         yield from item
>
> py> data = [1, 2, 3, 4]
> py> weights = [1, 4, 1, 2]
> py> statistics.mean(flatten([x]*w for x, w in zip(data, weights)))
> 2.5
>
> In principle, the implementation could be as simple as a single
> recursive call:
>
> def mean(data, weights=None):
>     if weights is not None:
>         return mean(flatten([x]*w for x, w in zip(data, weights)))
>     # base case without weights is unchanged
>
> or perhaps it could be just a recipe in the docs.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue20479>
> _______________________________________
msg334102 - (view) Author: Oscar Benjamin (oscarbenjamin) * Date: 2019-01-20 21:17
Sorry, sent too soon...

> Matlab doesn't support even weighted mean as far as I can tell. There
> is wmean on the matlab file exchange:
https://stackoverflow.com/a/36464881/9450991

This is a separate function `wmean(data, weights)`. It has to be a
separate function though because it's third party code so the author
couldn't change the main mean function.

R ships with a weighted.mean function but I think for standard
deviation you need third party libs.

A quick survey but the main impression I get is that providing API for
this is not that common. The only good-looking API is the statsmodel
one.
msg334197 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-01-22 01:12
Here is some further information on weights in statistics in general, 
and SAS and Stata specifically:

https://blogs.sas.com/content/iml/2017/10/02/weight-variables-in-statistics-sas.html

Quote:

    use the FREQ statement to specify integer frequencies for 
    repeated observations. Use the WEIGHT statement when you 
    want to decrease the influence that certain observations 
    have on the parameter estimates. 

http://support.sas.com/kb/22/600.html

https://www.stata.com/manuals13/u20.pdf#u20.23

Executive summary:

- Stata defines four different kinds of weights;

- SAS defines two, WEIGHT and FREQ (frequency);

- SAS truncates FREQ values to integers, with zero or 
  negative meaning that the data point is to be ignored;

- Using FREQ is equivalent to repeating the data points.
  In Python terms:

  mean([1, 2, 3, 4], freq=[1, 0, 3, 1])

  would be equivalent to mean([1, 3, 3, 3, 4]).

- Weights in SAS are implicitly normalised to sum to 1, 
  but some functions allow you to normalise to sum to the
  number of data points, because it sometimes makes a 
  difference.

- It isn't clear to me what the physical meaning of weights
  in SAS actually is. The documentation is unclear, it *could*
  as simple as the definition of weighted mean here:

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition

but how that extends to more complex SAS functions is unclear to me.

(And for what its worth, I don't think SAS's MEAN function supports 
weights at all. Any SAS users here that could comment?)
History
Date User Action Args
2019-01-22 01:12:21steven.dapranosetmessages: + msg334197
2019-01-20 21:17:11oscarbenjaminsetmessages: + msg334102
2019-01-20 21:11:48oscarbenjaminsetmessages: + msg334101
2019-01-19 06:30:43steven.dapranosetmessages: + msg334039
2019-01-19 06:04:40rhettingersetmessages: + msg334037
2019-01-18 15:27:31remi.lapeyresetversions: + Python 3.8, - Python 3.7
2019-01-18 15:27:05remi.lapeyresetnosy: + remi.lapeyre
messages: + msg333978
2017-10-29 02:23:26ncoghlansetmessages: + msg305171
2017-10-28 20:01:32rhettingersetnosy: + rhettinger

messages: + msg305166
versions: + Python 3.7, - Python 3.5
2017-10-28 13:30:56serhiy.storchakasetnosy: - serhiy.storchaka
2014-02-03 10:33:54oscarbenjaminsetmessages: + msg210108
2014-02-03 10:09:31wolmasetmessages: + msg210107
2014-02-02 22:27:09wolmasetmessages: + msg210038
2014-02-02 13:55:05serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg209985
2014-02-02 12:12:32oscarbenjaminsetmessages: + msg209975
2014-02-02 11:55:18steven.dapranosetassignee: steven.daprano
messages: + msg209973
2014-02-02 01:22:52ncoghlansetdependencies: + Avoid inadvertently special casing Counter in statistics module
versions: + Python 3.5
2014-02-02 01:21:11ncoghlancreate