Issue 25478: Consider adding a normalize() method to collections.Counter()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69664

classification

Title:	Consider adding a normalize() method to collections.Counter()
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.10

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	Allen Downey, AllenDowney, DavidMertz, josh.r, mark.dickinson, pitrou, rhettinger, serhiy.storchaka, steven.daprano, veky, wolma
Priority:	low	Keywords:	patch

Created on 2015-10-26 02:24 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 6574	closed	rhettinger, 2018-04-23 02:41
PR 25829	merged	rhettinger, 2021-05-03 02:21

Messages (22)
msg253452 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-10-26 02:24
Allen Downey suggested this at PyCon in Montreal and said it would be useful in his bayesian statistics courses. Separately, Peter Norvig created a normalize() function in his probablity tutorial at In[45] in http://nbviewer.ipython.org/url/norvig.com/ipython/Probability.ipynb . I'm creating this tracker item to record thoughts about the idea. Right now, it isn't clear whether Counter is the right place to support this operation, how it should be designed, whether to use an in-place operation or an operation that creates a new counter, should it have rounding to make the result exactly equal to 1.0, should it use math.fsum() for float inputs? Should it support other target totals besides 1.0? >>> Counter(red=11, green=5, blue=4).normalize(100) # percentage Counter(red=55, green=25, blue=20) Also would it make sense to support something like this? sampled_gender_dist = Counter(male=405, female=421) world_gender_dist = Counter(male=0.51, female=0.50) cs = world_gender_dist.chi_squared(observed=sampled_gender_dist) Would it be better to just have a general multiply-by-scalar operation for scaling? c = Counter(observations) c.scale_by(1.0 / sum(c.values()) Perhaps use an operator? c /= sum(c.values())
msg253512 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2015-10-27 02:39
Counter is documented as being primarily intended for integer counts. While you can use them with floats, I'm not sure they're the right data type for this use case. Having some methods that only make sense with floats, and others (like elements) that only make sense with integers is just confusing.
msg276491 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-09-14 21:41
The pitfall I imagine here is that if you continue adding elements after normalize() is called, the results will be nonsensical.
msg277008 - (view)	Author: Vedran Čačić (veky) *	Date: 2016-09-20 05:23
Operator seems OK. After all, we can currently do c+c, which is kinda like c*2 (sequences behave this way generally, and it is a usual convention in mathematics too). And division by a number is just a multiplication by its reciprocal. But a dedicated normalize method? No. As Josh said, then you're forking the API. The correct way is probably to have a "normalized view" of a Counter. But I don't know the best way to calculate it fast. I mean, I know it mathematically (cache the sum of values and update it on every Counter update) but I don't know whether it's Pythonic enough.
msg289588 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2017-03-14 14:17
> >>> Counter(red=11, green=5, blue=4).normalize(100) # percentage > Counter(red=55, green=25, blue=20) I like this example, where the normalize method of a Counter returns a new Counter, but I think the new Counter should always only have integer counts. More specifically, it should be the closest approximation of the original Counter that is possible with integers adding up to the argument to the method or, statistically speaking, it should represent the expected number of observations of each outcome for a given sample size.
msg289596 - (view)	Author: Vedran Čačić (veky) *	Date: 2017-03-14 15:10
That seems horribly arbitrary to me, not to mention inviting another intdiv fiasco (from sanity import division:). If only Counter was committed to only working with integer values from start, it might be acceptable, but since Counter implementation was always careful not to preclude using Counter with nonint values, it wouldn't make sense. Also, there is an interesting inconsistency then, in the form of c = Counter(a=5,b=5).normalize(5) Presumably c.a and c.b would be equal integers, and their sum equal to 5. That is unfortunately not possible. :-o
msg289600 - (view)	Author: David Mertz (DavidMertz) *	Date: 2017-03-14 16:17
I definitely wouldn't want a mutator that "normalized" counts for the reason Antoine mentions. It would be a common error to normalize then continue meaningless counting. One could write a `Frequency` subclass easily enough. The essential feature in my mind would be to keep an attribute `Counter.total` around to perform the normalization. I'm +1 on adding that to `collections.Counter` itself. I'm not sure if this would be better as an attribute kept directly or as a property that called `sum(self.values())` when accessed. I believe that having `mycounter.total` would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?")
msg289637 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-03-15 03:43
The idea is that the method would return a new counter instance and leave the existing instance untouched.
msg289640 - (view)	Author: David Mertz (DavidMertz) *	Date: 2017-03-15 04:39
Raymond wrote: > The idea is that the method would return a new counter instance > and leave the existing instance untouched. Your own first example suggested: c /= sum(c.values()) That would suggest an inplace modification. But even if it's not that, but creating a new object, that doesn't make much difference to the end user who has rebound the name `c`. Likewise, I think users would be somewhat tempted by: c = c.scale_by(1.0/c.total) # My property/attribute suggestion This would present the same attractive nuisance. If the interface was the slightly less friendly: freqs = {k:v/c.total for k, v in c.items()} I think there would be far less temptation to rebind the same name unintentionally.
msg289683 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2017-03-15 17:25
It seems to me that the basic Counter class should be left as-is, and if there are specialized methods used for statistics (such as normalize) it should go into a subclass in the statistics module. The statistics module already uses Counter internally to calculate the mode. It makes some sense to me for statistics to have a FrequencyTable (and CumulativeFrequencyTable?) class built on top of Counter. I don't think it makes sense to overload the collections.Counter type with these sorts of specialised methods.
msg315665 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-04-23 12:34
Was __rtruediv__ discussed before? What is the use case for it? Besides __rtruediv__ all LGTM.
msg316978 - (view)	Author: Allen Downey (Allen Downey)	Date: 2018-05-17 19:06
I'd like to second Raymond's suggestion. With just a few additional methods, you could support a useful set of operations. One possible API: def scaled(self, factor) """Returns a new Counter with all values multiplied by factor.""" def normalized(self, total=1) """Returns a new Counter with values normalized so their sum is total.""" def total(self) """Returns the sum of the values in the Counter.""" These operations would make it easier to use a Counter as a PMF without subclassing. I understand two arguments against this proposal 1) If you modify the Counter after normalizing, the result is probably nonsense. That's true, but it is already the case that some Counter methods don't make sense for some use cases, depending on how you are using the Counter (as a bag, multiset, etc) So the new features would come with caveats, but I don't think that's fatal. 2) PMF operations are not general enough for core Python; they should be in a stats module. I think PMFs are used (or would be used) for lots of quick computations that don't require full-fledged stats. Also, stats libraries tend to focus on analytic distributions; they don't really provide this kind of light-weight empirical PMF. I think the proposed features have a high ratio of usefulness to implementation effort, without expanding the API unacceptably. Two thoughts for alternatives/extensions: 1) It might be good to make scaled() available as __mul__, as Peter Norvig suggests. 2) If the argument of scaled() is a mapping type, it might be good to support elementwise scaling. That would provide an elegant implementation of Raymond's chi-squared example and my inspection paradox example (http://greenteapress.com/thinkstats2/html/thinkstats2004.html#sec33) Thank you! Allen
msg316980 - (view)	Author: Vedran Čačić (veky) *	Date: 2018-05-17 20:02
As I said above, if we're going to go down that route, it seems much more reasonable to me that total should be a cached property, that's updated on every Counter update (in __setitem__, increased by a difference of a new value and an old one for that key). And normalization should just provide a view over the Counter, that just passes the values through division with the above cached property. The view should of course be immutable by itself, but should reflect the changes of the underlying counter, just as already existing views (e.g. dict_values) do.
msg317005 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2018-05-18 07:05
> total should be a cached property, that's updated on every Counter update That would run into difficulties for Counters with float values: e.g., after >>> c = Counter() >>> c['spam'] = 1e100 >>> c['eggs'] = 1 >>> c['spam'] = 0 the cached total would likely be 0.0, because that's what the sum of the (new-old) values gives: >>> (1e100 - 0) + (1 - 0) + (0 - 1e100) 0.0
msg317007 - (view)	Author: Vedran Čačić (veky) *	Date: 2018-05-18 07:28
Well, yes, floats are innacurate. Do you really expect to normalize Counters containing values like a googol, or was it just a strawman? For me, it is much more imaginable* that total be zero because you have a negative value (e.g. {'spam': -1, 'eggs': 1}) than because you had a googol in your Counter at some time in the past. (*) Note that the documentation says > Counts are allowed to be any integer value including zero or _negative_ counts. (emphasis mine) ... and floats are only mentioned at the bottom, in a Note. Besides, floats have that problem already, even with an existing API: >>> from collections import Counter as C >>> big = C(spam=1e100) >>> c = C(spam=1) >>> not +c False >>> c.update(big) >>> c.subtract(big) >>> not +c True
msg317008 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2018-05-18 07:40
The point is that if you cache the total and update on each operation, you end up with a total that depends not just on the current contents of the Counter, but also on the history of operations. That seems like a bad idea: you could have two Counters with exactly the same counts in them (so that they compare equal), but with different cached totals. So if floats (or Decimal instances) are permitted as Counter values, your suggested caching approach is not viable. Of course, if Counter values are restricted to be plain old integers then it's fine, but that's not the current state of affairs.
msg317016 - (view)	Author: Vedran Čačić (veky) *	Date: 2018-05-18 08:17
My reading of the documentation says floats are only tentatively supported. The main text of the documentation says the values are supposed to be integers. Even the note mostly talks about negative values, the floats are mentioned in only one item. (And in that item, it says the value type should support addition and subtraction. I think it's not too big a stretch to stipulate it should support them accurately.:) But whatever you do about total (cached property was just my idea to enable implementation of a probability distribution as a view on a Counter), it's obvious from the documentation that the output of normalize is _not_ a Counter. It might be a subclass, but I'd rather it be a completely separate class. The API intersection is not really that fat.
msg320359 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-06-24 06:47
Vedran, Counters do explicitly support floats, decimals, ints, fractions, etc. Also, total() needs to be a method rather than a property to be consistent with the existing API and for clarity that a computation is being performed (as opposed to looking up a running total or other cheap operation).
msg383447 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-12-20 20:44
Here's what I propose to add: def total(self): return sum(self.values()) def scaled_by(self, factor): return Counter({elem : count * factor for elem, count in self.items()}) def scaled_to(self, target_total=1.0): ratio = target_total / self.total() return self.scaled_by(ratio) These cover the common cases and they don't mutate the counter.
msg383450 - (view)	Author: Allen Downey (AllenDowney)	Date: 2020-12-20 20:54
This API would work well for my use cases. And looking back at previous comments in this thread, I think this proposal avoids the most objectionable pitfalls.
msg392758 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-03 03:19
New changeset 8c598dbb9483bcfcb88fc579ebf27821d8861465 by Raymond Hettinger in branch 'master': bpo-25478: Add total() method to collections.Counter (GH-25829) https://github.com/python/cpython/commit/8c598dbb9483bcfcb88fc579ebf27821d8861465
msg409435 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-31 20:57
Withdrawing the suggestions for scaled_to() and scaled_by(). Am thinking that people are mostly better off with a dict comprehension where they can control the details of rounding and type conversions.

History
Date	User	Action	Args
2022-04-11 14:58:23	admin	set	github: 69664
2021-12-31 20:57:57	rhettinger	set	status: open -> closed resolution: rejected messages: + msg409435 stage: patch review -> resolved
2021-05-03 03:19:57	rhettinger	set	messages: + msg392758
2021-05-03 02:21:57	rhettinger	set	pull_requests: + pull_request24516
2020-12-20 20:54:10	AllenDowney	set	nosy: + AllenDowney messages: + msg383450
2020-12-20 20:44:05	rhettinger	set	messages: + msg383447 versions: + Python 3.10, - Python 3.8
2020-01-23 20:21:35	ammar2	link	issue39437 superseder
2018-06-24 06:47:09	rhettinger	set	messages: + msg320359
2018-05-18 08:17:25	veky	set	messages: + msg317016
2018-05-18 07:40:17	mark.dickinson	set	messages: + msg317008
2018-05-18 07:28:09	veky	set	messages: + msg317007
2018-05-18 07:05:30	mark.dickinson	set	messages: + msg317005
2018-05-17 20:02:08	veky	set	messages: + msg316980
2018-05-17 19:06:00	Allen Downey	set	nosy: + Allen Downey messages: + msg316978
2018-04-23 12:34:51	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg315665
2018-04-23 02:41:59	rhettinger	set	keywords: + patch stage: patch review pull_requests: + pull_request6275
2018-01-29 20:46:58	rhettinger	set	versions: + Python 3.8, - Python 3.7
2017-03-15 17:25:38	steven.daprano	set	nosy: + steven.daprano messages: + msg289683
2017-03-15 04:39:11	DavidMertz	set	messages: + msg289640
2017-03-15 03:43:57	rhettinger	set	messages: + msg289637
2017-03-14 16:17:46	DavidMertz	set	nosy: + DavidMertz messages: + msg289600
2017-03-14 15:10:39	veky	set	messages: + msg289596
2017-03-14 14:17:33	wolma	set	nosy: + wolma messages: + msg289588
2016-09-20 05:23:22	veky	set	nosy: + veky messages: + msg277008
2016-09-14 21:41:00	pitrou	set	nosy: + pitrou messages: + msg276491
2016-09-12 09:01:04	SilentGhost	set	nosy: - SilentGhost
2016-09-12 09:00:49	SilentGhost	set	messages: - msg275998
2016-09-12 08:36:08	SilentGhost	set	nosy: + SilentGhost messages: + msg275998
2016-09-12 07:29:02	rhettinger	set	versions: + Python 3.7, - Python 3.6
2015-10-27 02:39:28	josh.r	set	nosy: + josh.r messages: + msg253512
2015-10-26 02:29:05	rhettinger	set	nosy: + mark.dickinson
2015-10-26 02:24:40	rhettinger	create