Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a normalize() method to collections.Counter() #69664

Closed
rhettinger opened this issue Oct 26, 2015 · 22 comments
Closed

Consider adding a normalize() method to collections.Counter() #69664

rhettinger opened this issue Oct 26, 2015 · 22 comments
Assignees
Labels
3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

BPO 25478
Nosy @rhettinger, @mdickinson, @pitrou, @stevendaprano, @serhiy-storchaka, @wm75, @MojoVampire, @vedgar, @DavidMertz, @AllenDowney
PRs
  • bpo-25478: Add scalar multiplication and division to Counter #6574
  • bpo-25478: Add total() method to collections.Counter #25829
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2021-12-31.20:57:57.439>
    created_at = <Date 2015-10-26.02:24:39.945>
    labels = ['type-feature', 'library', '3.10']
    title = 'Consider adding a normalize() method to collections.Counter()'
    updated_at = <Date 2021-12-31.20:57:57.438>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2021-12-31.20:57:57.438>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2021-12-31.20:57:57.439>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2015-10-26.02:24:39.945>
    creator = 'rhettinger'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 25478
    keywords = ['patch']
    message_count = 22.0
    messages = ['253452', '253512', '276491', '277008', '289588', '289596', '289600', '289637', '289640', '289683', '315665', '316978', '316980', '317005', '317007', '317008', '317016', '320359', '383447', '383450', '392758', '409435']
    nosy_count = 11.0
    nosy_names = ['rhettinger', 'mark.dickinson', 'pitrou', 'steven.daprano', 'serhiy.storchaka', 'wolma', 'josh.r', 'veky', 'Allen Downey', 'DavidMertz', 'AllenDowney']
    pr_nums = ['6574', '25829']
    priority = 'low'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue25478'
    versions = ['Python 3.10']

    @rhettinger
    Copy link
    Contributor Author

    Allen Downey suggested this at PyCon in Montreal and said it would be useful in his bayesian statistics courses. Separately, Peter Norvig created a normalize() function in his probablity tutorial at In[45] in http://nbviewer.ipython.org/url/norvig.com/ipython/Probability.ipynb .

    I'm creating this tracker item to record thoughts about the idea. Right now, it isn't clear whether Counter is the right place to support this operation, how it should be designed, whether to use an in-place operation or an operation that creates a new counter, should it have rounding to make the result exactly equal to 1.0, should it use math.fsum() for float inputs?

    Should it support other target totals besides 1.0?

      >>> Counter(red=11, green=5, blue=4).normalize(100) # percentage
      Counter(red=55, green=25, blue=20)

    Also would it make sense to support something like this?

      sampled_gender_dist = Counter(male=405, female=421)
      world_gender_dist = Counter(male=0.51, female=0.50)
      cs = world_gender_dist.chi_squared(observed=sampled_gender_dist)

    Would it be better to just have a general multiply-by-scalar operation for scaling?

      c = Counter(observations)
      c.scale_by(1.0 / sum(c.values())

    Perhaps use an operator?

    c /= sum(c.values())

    @rhettinger rhettinger self-assigned this Oct 26, 2015
    @rhettinger rhettinger added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Oct 26, 2015
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Oct 27, 2015

    Counter is documented as being primarily intended for integer counts. While you can use them with floats, I'm not sure they're the right data type for this use case. Having some methods that only make sense with floats, and others (like elements) that only make sense with integers is just confusing.

    @rhettinger rhettinger added the 3.7 (EOL) end of life label Sep 12, 2016
    @pitrou
    Copy link
    Member

    pitrou commented Sep 14, 2016

    The pitfall I imagine here is that if you continue adding elements after normalize() is called, the results will be nonsensical.

    @vedgar
    Copy link
    Mannequin

    vedgar mannequin commented Sep 20, 2016

    Operator seems OK. After all, we can currently do c+c, which is kinda like c*2 (sequences behave this way generally, and it is a usual convention in mathematics too). And division by a number is just a multiplication by its reciprocal. But a dedicated normalize method? No. As Josh said, then you're forking the API.

    The correct way is probably to have a "normalized view" of a Counter. But I don't know the best way to calculate it fast. I mean, I know it mathematically (cache the sum of values and update it on every Counter update) but I don't know whether it's Pythonic enough.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Mar 14, 2017

    >>> Counter(red=11, green=5, blue=4).normalize(100) # percentage
    Counter(red=55, green=25, blue=20)

    I like this example, where the normalize method of a Counter returns a new Counter, but I think the new Counter should always only have integer counts. More specifically, it should be the closest approximation of the original Counter that is possible with integers adding up to the argument to the method or, statistically speaking, it should represent the expected number of observations of each outcome for a given sample size.

    @vedgar
    Copy link
    Mannequin

    vedgar mannequin commented Mar 14, 2017

    That seems horribly arbitrary to me, not to mention inviting another intdiv fiasco (from sanity import division:). If only Counter was committed to only working with integer values from start, it might be acceptable, but since Counter implementation was always careful not to preclude using Counter with nonint values, it wouldn't make sense.

    Also, there is an interesting inconsistency then, in the form of

        c = Counter(a=5,b=5).normalize(5)

    Presumably c.a and c.b would be equal integers, and their sum equal to 5. That is unfortunately not possible. :-o

    @DavidMertz
    Copy link
    Mannequin

    DavidMertz mannequin commented Mar 14, 2017

    I definitely wouldn't want a mutator that "normalized" counts for the reason Antoine mentions. It would be a common error to normalize then continue meaningless counting.

    One could write a Frequency subclass easily enough. The essential feature in my mind would be to keep an attribute Counter.total around to perform the normalization. I'm +1 on adding that to collections.Counter itself.

    I'm not sure if this would be better as an attribute kept directly or as a property that called sum(self.values()) when accessed. I believe that having mycounter.total would provide the right normalization in a clean API, and also expose easy access to other questions one would naturally ask (e.g. "How many observations were made?")

    @rhettinger
    Copy link
    Contributor Author

    The idea is that the method would return a new counter instance and leave the existing instance untouched.

    @DavidMertz
    Copy link
    Mannequin

    DavidMertz mannequin commented Mar 15, 2017

    Raymond wrote:

    The idea is that the method would return a new counter instance
    and leave the existing instance untouched.

    Your own first example suggested:

    c /= sum(c.values())
    

    That would suggest an inplace modification. But even if it's not that, but creating a new object, that doesn't make much difference to the end user who has rebound the name c.

    Likewise, I think users would be somewhat tempted by:

        c = c.scale_by(1.0/c.total)  # My property/attribute suggestion

    This would present the same attractive nuisance. If the interface was the slightly less friendly:

        freqs = {k:v/c.total for k, v in c.items()}

    I think there would be far less temptation to rebind the same name unintentionally.

    @stevendaprano
    Copy link
    Member

    It seems to me that the basic Counter class should be left as-is, and if there are specialized methods used for statistics (such as normalize) it should go into a subclass in the statistics module.

    The statistics module already uses Counter internally to calculate the mode.

    It makes some sense to me for statistics to have a FrequencyTable (and CumulativeFrequencyTable?) class built on top of Counter. I don't think it makes sense to overload the collections.Counter type with these sorts of specialised methods.

    @rhettinger rhettinger added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Jan 29, 2018
    @serhiy-storchaka
    Copy link
    Member

    Was __rtruediv__ discussed before? What is the use case for it?

    Besides __rtruediv__ all LGTM.

    @AllenDowney
    Copy link
    Mannequin

    AllenDowney mannequin commented May 17, 2018

    I'd like to second Raymond's suggestion. With just a few additional methods, you could support a useful set of operations. One possible API:

    def scaled(self, factor)
    """Returns a new Counter with all values multiplied by factor."""
    
    def normalized(self, total=1)
    """Returns a new Counter with values normalized so their sum is total."""
    
    def total(self)
    """Returns the sum of the values in the Counter."""

    These operations would make it easier to use a Counter as a PMF without subclassing.

    I understand two arguments against this proposal

    1. If you modify the Counter after normalizing, the result is probably nonsense.

    That's true, but it is already the case that some Counter methods don't make sense for some use cases, depending on how you are using the Counter (as a bag, multiset, etc)

    So the new features would come with caveats, but I don't think that's fatal.

    1. PMF operations are not general enough for core Python; they should be in a stats module.

    I think PMFs are used (or would be used) for lots of quick computations that don't require full-fledged stats.

    Also, stats libraries tend to focus on analytic distributions; they don't really provide this kind of light-weight empirical PMF.

    I think the proposed features have a high ratio of usefulness to implementation effort, without expanding the API unacceptably.

    Two thoughts for alternatives/extensions:

    1. It might be good to make scaled() available as __mul__, as Peter Norvig suggests.

    2. If the argument of scaled() is a mapping type, it might be good to support elementwise scaling. That would provide an elegant implementation of Raymond's chi-squared example and my inspection paradox example (http://greenteapress.com/thinkstats2/html/thinkstats2004.html#sec33)

    Thank you!
    Allen

    @vedgar
    Copy link
    Mannequin

    vedgar mannequin commented May 17, 2018

    As I said above, if we're going to go down that route, it seems much more reasonable to me that total should be a cached property, that's updated on every Counter update (in __setitem__, increased by a difference of a new value and an old one for that key).

    And normalization should just provide a view over the Counter, that just passes the values through division with the above cached property. The view should of course be immutable by itself, but should reflect the changes of the underlying counter, just as already existing views (e.g. dict_values) do.

    @mdickinson
    Copy link
    Member

    total should be a cached property, that's updated on every Counter update

    That would run into difficulties for Counters with float values: e.g., after

    >> c = Counter()
    >> c['spam'] = 1e100
    >> c['eggs'] = 1
    >> c['spam'] = 0

    the cached total would likely be 0.0, because that's what the sum of the (new-old) values gives:

    >>> (1e100 - 0) + (1 - 0) + (0 - 1e100)
    0.0

    @vedgar
    Copy link
    Mannequin

    vedgar mannequin commented May 18, 2018

    Well, yes, floats are innacurate. Do you really expect to normalize Counters containing values like a googol, or was it just a strawman? For me, it is much more imaginable* that total be zero because you have a negative value (e.g. {'spam': -1, 'eggs': 1}) than because you had a googol in your Counter at some time in the past.

    (*) Note that the documentation says

    Counts are allowed to be any integer value including zero or _negative_ counts. (emphasis mine)

    ... and floats are only mentioned at the bottom, in a Note. Besides, floats have that problem already, even with an existing API:

    >>> from collections import Counter as C
    >>> big = C(spam=1e100)
    >>> c = C(spam=1)
    >>> not +c
    False
    >>> c.update(big)
    >>> c.subtract(big)
    >>> not +c
    True

    @mdickinson
    Copy link
    Member

    The point is that if you cache the total and update on each operation, you end up with a total that depends not just on the current contents of the Counter, but also on the history of operations. That seems like a bad idea: you could have two Counters with exactly the same counts in them (so that they compare equal), but with different cached totals.

    So if floats (or Decimal instances) are permitted as Counter values, your suggested caching approach is not viable.

    Of course, if Counter values are restricted to be plain old integers then it's fine, but that's not the current state of affairs.

    @vedgar
    Copy link
    Mannequin

    vedgar mannequin commented May 18, 2018

    My reading of the documentation says floats are only tentatively supported. The main text of the documentation says the values are supposed to be integers. Even the note mostly talks about negative values, the floats are mentioned in only one item. (And in that item, it says the value type should support addition and subtraction. I think it's not too big a stretch to stipulate it should support them accurately.:)

    But whatever you do about total (cached property was just my idea to enable implementation of a probability distribution as a view on a Counter), it's obvious from the documentation that the output of normalize is _not_ a Counter. It might be a subclass, but I'd rather it be a completely separate class. The API intersection is not really that fat.

    @rhettinger
    Copy link
    Contributor Author

    Vedran, Counters do explicitly support floats, decimals, ints, fractions, etc.

    Also, total() needs to be a method rather than a property to be consistent with the existing API and for clarity that a computation is being performed (as opposed to looking up a running total or other cheap operation).

    @rhettinger
    Copy link
    Contributor Author

    Here's what I propose to add:

        def total(self):
            return sum(self.values())
    
        def scaled_by(self, factor):
            return Counter({elem : count * factor for elem, count in self.items()})
    
        def scaled_to(self, target_total=1.0):
            ratio = target_total / self.total()
            return self.scaled_by(ratio)

    These cover the common cases and they don't mutate the counter.

    @rhettinger rhettinger added 3.10 only security fixes and removed 3.8 only security fixes labels Dec 20, 2020
    @AllenDowney
    Copy link
    Mannequin

    AllenDowney mannequin commented Dec 20, 2020

    This API would work well for my use cases.

    And looking back at previous comments in this thread, I think this proposal avoids the most objectionable pitfalls.

    @rhettinger
    Copy link
    Contributor Author

    New changeset 8c598db by Raymond Hettinger in branch 'master':
    bpo-25478: Add total() method to collections.Counter (GH-25829)
    8c598db

    @rhettinger
    Copy link
    Contributor Author

    Withdrawing the suggestions for scaled_to() and scaled_by(). Am thinking that people are mostly better off with a dict comprehension where they can control the details of rounding and type conversions.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants