Consider adding a normalize() method to collections.Counter() #69664

rhettinger · 2015-10-26T02:24:40Z

BPO	25478
Nosy	@rhettinger, @mdickinson, @pitrou, @stevendaprano, @serhiy-storchaka, @wm75, @MojoVampire, @vedgar, @DavidMertz, @AllenDowney
PRs	bpo-25478: Add scalar multiplication and division to Counter #6574 bpo-25478: Add total() method to collections.Counter #25829

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = <Date 2021-12-31.20:57:57.439>
created_at = <Date 2015-10-26.02:24:39.945>
labels = ['type-feature', 'library', '3.10']
title = 'Consider adding a normalize() method to collections.Counter()'
updated_at = <Date 2021-12-31.20:57:57.438>
user = 'https://github.com/rhettinger'

bugs.python.org fields:

activity = <Date 2021-12-31.20:57:57.438>
actor = 'rhettinger'
assignee = 'rhettinger'
closed = True
closed_date = <Date 2021-12-31.20:57:57.439>
closer = 'rhettinger'
components = ['Library (Lib)']
creation = <Date 2015-10-26.02:24:39.945>
creator = 'rhettinger'
dependencies = []
files = []
hgrepos = []
issue_num = 25478
keywords = ['patch']
message_count = 22.0
messages = ['253452', '253512', '276491', '277008', '289588', '289596', '289600', '289637', '289640', '289683', '315665', '316978', '316980', '317005', '317007', '317008', '317016', '320359', '383447', '383450', '392758', '409435']
nosy_count = 11.0
nosy_names = ['rhettinger', 'mark.dickinson', 'pitrou', 'steven.daprano', 'serhiy.storchaka', 'wolma', 'josh.r', 'veky', 'Allen Downey', 'DavidMertz', 'AllenDowney']
pr_nums = ['6574', '25829']
priority = 'low'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue25478'
versions = ['Python 3.10']

rhettinger · 2015-10-26T02:24:38Z

Allen Downey suggested this at PyCon in Montreal and said it would be useful in his bayesian statistics courses. Separately, Peter Norvig created a normalize() function in his probablity tutorial at In[45] in http://nbviewer.ipython.org/url/norvig.com/ipython/Probability.ipynb .

I'm creating this tracker item to record thoughts about the idea. Right now, it isn't clear whether Counter is the right place to support this operation, how it should be designed, whether to use an in-place operation or an operation that creates a new counter, should it have rounding to make the result exactly equal to 1.0, should it use math.fsum() for float inputs?

Should it support other target totals besides 1.0?

  >>> Counter(red=11, green=5, blue=4).normalize(100) # percentage
  Counter(red=55, green=25, blue=20)

Also would it make sense to support something like this?

  sampled_gender_dist = Counter(male=405, female=421)
  world_gender_dist = Counter(male=0.51, female=0.50)
  cs = world_gender_dist.chi_squared(observed=sampled_gender_dist)

Would it be better to just have a general multiply-by-scalar operation for scaling?

  c = Counter(observations)
  c.scale_by(1.0 / sum(c.values())

Perhaps use an operator?

c /= sum(c.values())

MojoVampire · 2015-10-27T02:39:28Z

Counter is documented as being primarily intended for integer counts. While you can use them with floats, I'm not sure they're the right data type for this use case. Having some methods that only make sense with floats, and others (like elements) that only make sense with integers is just confusing.

pitrou · 2016-09-14T21:41:01Z

The pitfall I imagine here is that if you continue adding elements after normalize() is called, the results will be nonsensical.

vedgar · 2016-09-20T05:23:22Z

Operator seems OK. After all, we can currently do c+c, which is kinda like c*2 (sequences behave this way generally, and it is a usual convention in mathematics too). And division by a number is just a multiplication by its reciprocal. But a dedicated normalize method? No. As Josh said, then you're forking the API.

The correct way is probably to have a "normalized view" of a Counter. But I don't know the best way to calculate it fast. I mean, I know it mathematically (cache the sum of values and update it on every Counter update) but I don't know whether it's Pythonic enough.

wm75 · 2017-03-14T14:17:33Z

>>> Counter(red=11, green=5, blue=4).normalize(100) # percentage
Counter(red=55, green=25, blue=20)

I like this example, where the normalize method of a Counter returns a new Counter, but I think the new Counter should always only have integer counts. More specifically, it should be the closest approximation of the original Counter that is possible with integers adding up to the argument to the method or, statistically speaking, it should represent the expected number of observations of each outcome for a given sample size.

vedgar · 2017-03-14T15:10:39Z

That seems horribly arbitrary to me, not to mention inviting another intdiv fiasco (from sanity import division:). If only Counter was committed to only working with integer values from start, it might be acceptable, but since Counter implementation was always careful not to preclude using Counter with nonint values, it wouldn't make sense.

Also, there is an interesting inconsistency then, in the form of

    c = Counter(a=5,b=5).normalize(5)

Presumably c.a and c.b would be equal integers, and their sum equal to 5. That is unfortunately not possible. :-o

DavidMertz · 2017-03-14T16:17:46Z

I definitely wouldn't want a mutator that "normalized" counts for the reason Antoine mentions. It would be a common error to normalize then continue meaningless counting.

One could write a Frequency subclass easily enough. The essential feature in my mind would be to keep an attribute Counter.total around to perform the normalization. I'm +1 on adding that to collections.Counter itself.

I'm not sure if this would be better as an attribute kept directly or as a property that called sum(self.values()) when accessed. I believe that having mycounter.total would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?")

rhettinger · 2017-03-15T03:43:58Z

The idea is that the method would return a new counter instance and leave the existing instance untouched.

DavidMertz · 2017-03-15T04:39:11Z

Raymond wrote:

The idea is that the method would return a new counter instance
and leave the existing instance untouched.

Your own first example suggested:

c /= sum(c.values())

That would suggest an inplace modification. But even if it's not that, but creating a new object, that doesn't make much difference to the end user who has rebound the name c.

Likewise, I think users would be somewhat tempted by:

    c = c.scale_by(1.0/c.total)  # My property/attribute suggestion

This would present the same attractive nuisance. If the interface was the slightly less friendly:

    freqs = {k:v/c.total for k, v in c.items()}

I think there would be far less temptation to rebind the same name unintentionally.

stevendaprano · 2017-03-15T17:25:38Z

It seems to me that the basic Counter class should be left as-is, and if there are specialized methods used for statistics (such as normalize) it should go into a subclass in the statistics module.

The statistics module already uses Counter internally to calculate the mode.

It makes some sense to me for statistics to have a FrequencyTable (and CumulativeFrequencyTable?) class built on top of Counter. I don't think it makes sense to overload the collections.Counter type with these sorts of specialised methods.

serhiy-storchaka · 2018-04-23T12:34:51Z

Was __rtruediv__ discussed before? What is the use case for it?

Besides __rtruediv__ all LGTM.

AllenDowney · 2018-05-17T19:06:01Z

I'd like to second Raymond's suggestion. With just a few additional methods, you could support a useful set of operations. One possible API:

def scaled(self, factor)
"""Returns a new Counter with all values multiplied by factor."""

def normalized(self, total=1)
"""Returns a new Counter with values normalized so their sum is total."""

def total(self)
"""Returns the sum of the values in the Counter."""

These operations would make it easier to use a Counter as a PMF without subclassing.

I understand two arguments against this proposal

If you modify the Counter after normalizing, the result is probably nonsense.

That's true, but it is already the case that some Counter methods don't make sense for some use cases, depending on how you are using the Counter (as a bag, multiset, etc)

So the new features would come with caveats, but I don't think that's fatal.

PMF operations are not general enough for core Python; they should be in a stats module.

I think PMFs are used (or would be used) for lots of quick computations that don't require full-fledged stats.

Also, stats libraries tend to focus on analytic distributions; they don't really provide this kind of light-weight empirical PMF.

I think the proposed features have a high ratio of usefulness to implementation effort, without expanding the API unacceptably.

Two thoughts for alternatives/extensions:

It might be good to make scaled() available as __mul__, as Peter Norvig suggests.
If the argument of scaled() is a mapping type, it might be good to support elementwise scaling. That would provide an elegant implementation of Raymond's chi-squared example and my inspection paradox example (http://greenteapress.com/thinkstats2/html/thinkstats2004.html#sec33)

Thank you!
Allen

vedgar · 2018-05-17T20:02:08Z

As I said above, if we're going to go down that route, it seems much more reasonable to me that total should be a cached property, that's updated on every Counter update (in __setitem__, increased by a difference of a new value and an old one for that key).

And normalization should just provide a view over the Counter, that just passes the values through division with the above cached property. The view should of course be immutable by itself, but should reflect the changes of the underlying counter, just as already existing views (e.g. dict_values) do.

mdickinson · 2018-05-18T07:05:30Z

total should be a cached property, that's updated on every Counter update

That would run into difficulties for Counters with float values: e.g., after

>> c = Counter()
>> c['spam'] = 1e100
>> c['eggs'] = 1
>> c['spam'] = 0

the cached total would likely be 0.0, because that's what the sum of the (new-old) values gives:

>>> (1e100 - 0) + (1 - 0) + (0 - 1e100)
0.0

vedgar · 2018-05-18T07:28:09Z

Well, yes, floats are innacurate. Do you really expect to normalize Counters containing values like a googol, or was it just a strawman? For me, it is much more imaginable* that total be zero because you have a negative value (e.g. {'spam': -1, 'eggs': 1}) than because you had a googol in your Counter at some time in the past.

(*) Note that the documentation says

Counts are allowed to be any integer value including zero or _negative_ counts. (emphasis mine)

... and floats are only mentioned at the bottom, in a Note. Besides, floats have that problem already, even with an existing API:

>>> from collections import Counter as C
>>> big = C(spam=1e100)
>>> c = C(spam=1)
>>> not +c
False
>>> c.update(big)
>>> c.subtract(big)
>>> not +c
True

mdickinson · 2018-05-18T07:40:18Z

The point is that if you cache the total and update on each operation, you end up with a total that depends not just on the current contents of the Counter, but also on the history of operations. That seems like a bad idea: you could have two Counters with exactly the same counts in them (so that they compare equal), but with different cached totals.

So if floats (or Decimal instances) are permitted as Counter values, your suggested caching approach is not viable.

Of course, if Counter values are restricted to be plain old integers then it's fine, but that's not the current state of affairs.

vedgar · 2018-05-18T08:17:26Z

My reading of the documentation says floats are only tentatively supported. The main text of the documentation says the values are supposed to be integers. Even the note mostly talks about negative values, the floats are mentioned in only one item. (And in that item, it says the value type should support addition and subtraction. I think it's not too big a stretch to stipulate it should support them accurately.:)

But whatever you do about total (cached property was just my idea to enable implementation of a probability distribution as a view on a Counter), it's obvious from the documentation that the output of normalize is _not_ a Counter. It might be a subclass, but I'd rather it be a completely separate class. The API intersection is not really that fat.

rhettinger · 2018-06-24T06:47:09Z

Vedran, Counters do explicitly support floats, decimals, ints, fractions, etc.

Also, total() needs to be a method rather than a property to be consistent with the existing API and for clarity that a computation is being performed (as opposed to looking up a running total or other cheap operation).

rhettinger · 2020-12-20T20:44:06Z

Here's what I propose to add:

    def total(self):
        return sum(self.values())

    def scaled_by(self, factor):
        return Counter({elem : count * factor for elem, count in self.items()})

    def scaled_to(self, target_total=1.0):
        ratio = target_total / self.total()
        return self.scaled_by(ratio)

These cover the common cases and they don't mutate the counter.

AllenDowney · 2020-12-20T20:54:10Z

This API would work well for my use cases.

And looking back at previous comments in this thread, I think this proposal avoids the most objectionable pitfalls.

rhettinger · 2021-05-03T03:19:57Z

New changeset 8c598db by Raymond Hettinger in branch 'master':
bpo-25478: Add total() method to collections.Counter (GH-25829)
8c598db

rhettinger · 2021-12-31T20:57:57Z

Withdrawing the suggestions for scaled_to() and scaled_by(). Am thinking that people are mostly better off with a dict comprehension where they can control the details of rounding and type conversions.

rhettinger self-assigned this Oct 26, 2015

rhettinger added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Oct 26, 2015

rhettinger added the 3.7 (EOL) end of life label Sep 12, 2016

rhettinger added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Jan 29, 2018

rhettinger added 3.10 only security fixes and removed 3.8 only security fixes labels Dec 20, 2020

rhettinger closed this as completed Dec 31, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding a normalize() method to collections.Counter() #69664

Consider adding a normalize() method to collections.Counter() #69664

rhettinger commented Oct 26, 2015

rhettinger commented Oct 26, 2015

MojoVampire mannequin commented Oct 27, 2015

pitrou commented Sep 14, 2016

vedgar mannequin commented Sep 20, 2016

wm75 mannequin commented Mar 14, 2017

vedgar mannequin commented Mar 14, 2017

DavidMertz mannequin commented Mar 14, 2017

rhettinger commented Mar 15, 2017

DavidMertz mannequin commented Mar 15, 2017

stevendaprano commented Mar 15, 2017

serhiy-storchaka commented Apr 23, 2018

AllenDowney mannequin commented May 17, 2018

vedgar mannequin commented May 17, 2018

mdickinson commented May 18, 2018

vedgar mannequin commented May 18, 2018

mdickinson commented May 18, 2018

vedgar mannequin commented May 18, 2018

rhettinger commented Jun 24, 2018

rhettinger commented Dec 20, 2020

AllenDowney mannequin commented Dec 20, 2020

rhettinger commented May 3, 2021

rhettinger commented Dec 31, 2021

Consider adding a normalize() method to collections.Counter() #69664

Consider adding a normalize() method to collections.Counter() #69664

Comments

rhettinger commented Oct 26, 2015

rhettinger commented Oct 26, 2015

MojoVampire mannequin commented Oct 27, 2015

pitrou commented Sep 14, 2016

vedgar mannequin commented Sep 20, 2016

wm75 mannequin commented Mar 14, 2017

vedgar mannequin commented Mar 14, 2017

DavidMertz mannequin commented Mar 14, 2017

rhettinger commented Mar 15, 2017

DavidMertz mannequin commented Mar 15, 2017

stevendaprano commented Mar 15, 2017

serhiy-storchaka commented Apr 23, 2018

AllenDowney mannequin commented May 17, 2018

vedgar mannequin commented May 17, 2018

mdickinson commented May 18, 2018

vedgar mannequin commented May 18, 2018

mdickinson commented May 18, 2018

vedgar mannequin commented May 18, 2018

rhettinger commented Jun 24, 2018

rhettinger commented Dec 20, 2020

AllenDowney mannequin commented Dec 20, 2020

rhettinger commented May 3, 2021

rhettinger commented Dec 31, 2021