classification
Title: Consider adding a normalize() method to collections.Counter()
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: David Mertz, josh.r, mark.dickinson, pitrou, rhettinger, steven.daprano, veky, wolma
Priority: low Keywords:

Created on 2015-10-26 02:24 by rhettinger, last changed 2017-03-15 17:25 by steven.daprano.

Messages (10)
msg253452 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-10-26 02:24
Allen Downey suggested this at PyCon in Montreal and said it would be useful in his bayesian statistics courses.  Separately, Peter Norvig created a normalize() function in his probablity tutorial at In[45] in http://nbviewer.ipython.org/url/norvig.com/ipython/Probability.ipynb .

I'm creating this tracker item to record thoughts about the idea.  Right now, it isn't clear whether Counter is the right place to support this operation, how it should be designed, whether to use an in-place operation or an operation that creates a new counter, should it have rounding to make the result exactly equal to 1.0, should it use math.fsum() for float inputs?

Should it support other target totals besides 1.0?

  >>> Counter(red=11, green=5, blue=4).normalize(100) # percentage
  Counter(red=55, green=25, blue=20)

Also would it make sense to support something like this?

  sampled_gender_dist = Counter(male=405, female=421)
  world_gender_dist = Counter(male=0.51, female=0.50)
  cs = world_gender_dist.chi_squared(observed=sampled_gender_dist)

Would it be better to just have a general multiply-by-scalar operation for scaling?

  c = Counter(observations)
  c.scale_by(1.0 / sum(c.values())

Perhaps use an operator?

  c /= sum(c.values())
msg253512 - (view) Author: Josh Rosenberg (josh.r) * Date: 2015-10-27 02:39
Counter is documented as being primarily intended for integer counts. While you can use them with floats, I'm not sure they're the right data type for this use case. Having some methods that only make sense with floats, and others (like elements) that only make sense with integers is just confusing.
msg276491 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-09-14 21:41
The pitfall I imagine here is that if you continue adding elements after normalize() is called, the results will be nonsensical.
msg277008 - (view) Author: Vedran Čačić (veky) * Date: 2016-09-20 05:23
Operator seems OK. After all, we can currently do c+c, which is kinda like c*2 (sequences behave this way generally, and it is a usual convention in mathematics too). And division by a number is just a multiplication by its reciprocal. But a dedicated normalize method? No. As Josh said, then you're forking the API.

The correct way is probably to have a "normalized view" of a Counter. But I don't know the best way to calculate it fast. I mean, I know it mathematically (cache the sum of values and update it on every Counter update) but I don't know whether it's Pythonic enough.
msg289588 - (view) Author: Wolfgang Maier (wolma) * Date: 2017-03-14 14:17
>   >>> Counter(red=11, green=5, blue=4).normalize(100) # percentage
>  Counter(red=55, green=25, blue=20)

I like this example, where the normalize method of a Counter returns a new Counter, but I think the new Counter should always only have integer counts. More specifically, it should be the closest approximation of the original Counter that is possible with integers adding up to the argument to the method or, statistically speaking, it should represent the expected number of observations of each outcome for a given sample size.
msg289596 - (view) Author: Vedran Čačić (veky) * Date: 2017-03-14 15:10
That seems horribly arbitrary to me, not to mention inviting another intdiv fiasco (from sanity import division:). If only Counter was committed to only working with integer values from start, it might be acceptable, but since Counter implementation was always careful not to preclude using Counter with nonint values, it wouldn't make sense.

Also, there is an interesting inconsistency then, in the form of

    c = Counter(a=5,b=5).normalize(5)

Presumably c.a and c.b would be equal integers, and their sum equal to 5. That is unfortunately not possible. :-o
msg289600 - (view) Author: David Mertz (David Mertz) Date: 2017-03-14 16:17
I definitely wouldn't want a mutator that "normalized" counts for the reason Antoine mentions.  It would be a common error to normalize then continue meaningless counting.

One could write a `Frequency` subclass easily enough.  The essential feature in my mind would be to keep an attribute `Counter.total` around to perform the normalization.  I'm +1 on adding that to `collections.Counter` itself.

I'm not sure if this would be better as an attribute kept directly or as a property that called `sum(self.values())` when accessed.  I believe that having `mycounter.total` would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?")
msg289637 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-15 03:43
The idea is that the method would return a new counter instance and leave the existing instance untouched.
msg289640 - (view) Author: David Mertz (David Mertz) Date: 2017-03-15 04:39
Raymond wrote:
> The idea is that the method would return a new counter instance
> and leave the existing instance untouched.

Your own first example suggested:

    c /= sum(c.values())

That would suggest an inplace modification.  But even if it's not that, but creating a new object, that doesn't make much difference to the end user who has rebound the name `c`.

Likewise, I think users would be somewhat tempted by:

    c = c.scale_by(1.0/c.total)  # My property/attribute suggestion

This would present the same attractive nuisance.  If the interface was the slightly less friendly:

    freqs = {k:v/c.total for k, v in c.items()}

I think there would be far less temptation to rebind the same name unintentionally.
msg289683 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-03-15 17:25
It seems to me that the basic Counter class should be left as-is, and if there are specialized methods used for statistics (such as normalize) it should go into a subclass in the statistics module.

The statistics module already uses Counter internally to calculate the mode.

It makes some sense to me for statistics to have a FrequencyTable (and CumulativeFrequencyTable?) class built on top of Counter. I don't think it makes sense to overload the collections.Counter type with these sorts of specialised methods.
History
Date User Action Args
2017-03-15 17:25:38steven.dapranosetnosy: + steven.daprano
messages: + msg289683
2017-03-15 04:39:11David Mertzsetmessages: + msg289640
2017-03-15 03:43:57rhettingersetmessages: + msg289637
2017-03-14 16:17:46David Mertzsetnosy: + David Mertz
messages: + msg289600
2017-03-14 15:10:39vekysetmessages: + msg289596
2017-03-14 14:17:33wolmasetnosy: + wolma
messages: + msg289588
2016-09-20 05:23:22vekysetnosy: + veky
messages: + msg277008
2016-09-14 21:41:00pitrousetnosy: + pitrou
messages: + msg276491
2016-09-12 09:01:04SilentGhostsetnosy: - SilentGhost
2016-09-12 09:00:49SilentGhostsetmessages: - msg275998
2016-09-12 08:36:08SilentGhostsetnosy: + SilentGhost
messages: + msg275998
2016-09-12 07:29:02rhettingersetversions: + Python 3.7, - Python 3.6
2015-10-27 02:39:28josh.rsetnosy: + josh.r
messages: + msg253512
2015-10-26 02:29:05rhettingersetnosy: + mark.dickinson
2015-10-26 02:24:40rhettingercreate