Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add collections.counts() #44815

Closed
bethard mannequin opened this issue Apr 7, 2007 · 17 comments
Closed

Add collections.counts() #44815

bethard mannequin opened this issue Apr 7, 2007 · 17 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@bethard
Copy link
Mannequin

bethard mannequin commented Apr 7, 2007

BPO 1696199
Nosy @birkenfeld, @rhettinger, @amauryfa, @pitrou, @ned-deily
Files
  • collections_counts.diff: Patch adding collections.counts()
  • counter6.diff: Full patch with tests and improved docs
  • counter7.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2009-01-12.23:05:13.896>
    created_at = <Date 2007-04-07.21:38:28.000>
    labels = ['type-feature', 'library']
    title = 'Add collections.counts()'
    updated_at = <Date 2009-01-12.23:05:13.895>
    user = 'https://bugs.python.org/bethard'

    bugs.python.org fields:

    activity = <Date 2009-01-12.23:05:13.895>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2009-01-12.23:05:13.896>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2007-04-07.21:38:28.000>
    creator = 'bethard'
    dependencies = []
    files = ['2345', '12702', '12706']
    hgrepos = []
    issue_num = 1696199
    keywords = ['patch']
    message_count = 17.0
    messages = ['31726', '31727', '31728', '31729', '31730', '79461', '79467', '79471', '79512', '79650', '79661', '79663', '79690', '79694', '79696', '79698', '79707']
    nosy_count = 6.0
    nosy_names = ['georg.brandl', 'rhettinger', 'amaury.forgeotdarc', 'pitrou', 'bethard', 'ned.deily']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1696199'
    versions = ['Python 3.1', 'Python 2.7']

    @bethard
    Copy link
    Mannequin Author

    bethard mannequin commented Apr 7, 2007

    As suggested in http://mail.python.org/pipermail/python-list/2007-April/433986.html this is a patch to add a counts() function to the collections module. Usage looks like:

    >>> items = 'acabbacba'
    >>> item_counts = counts(items)
    >>> for item in 'abcd':
    ...     print item, item_counts[item]
    ...
    a 4
    b 3
    c 2
    d 0

    Yes, it's only a 4-line function, but it's a frequently re-written 4-line function.

    @bethard bethard mannequin assigned rhettinger Apr 7, 2007
    @bethard bethard mannequin added the stdlib Python modules in the Lib dir label Apr 7, 2007
    @bethard bethard mannequin assigned rhettinger Apr 7, 2007
    @bethard bethard mannequin added the stdlib Python modules in the Lib dir label Apr 7, 2007
    @gvanrossum
    Copy link
    Member

    Does it have to be a defaultdict? I.e. is it important that item_counts['d'] not raise KeyError?

    @bethard
    Copy link
    Mannequin Author

    bethard mannequin commented Apr 7, 2007

    I think it's okay if it's not a defaultdict. That was the simplest implementation, but I certainly have no problem calling d.get() when necessary. Should I change the implementation to use a dict()?

    @gvanrossum
    Copy link
    Member

    I guess it's simplicity of implementation vs. simplicity of use. And I'm not even sure which is easier to use. It's just that defaultdicts are a very new thing and still feel "weird" -- even though I pushed for the implementation based on popular demand I'm not a user myself. Perhaps ask around on python-dev?

    @bethard
    Copy link
    Mannequin Author

    bethard mannequin commented Apr 19, 2007

    A summary of the python-dev thread (http://mail.python.org/pipermail/python-dev/2007-April/072502.html)

    Since the number of times an unseen item was seen is 0, most people felt returning 0 was more natural behavior than raising KeyError.

    There was some discussion of alternate names, but most people were fine with counts().

    Raymond suggested making it a classmethod of dict, but people were a little concerned about adding to dict's already complex API, and since the result of counts() needed to return 0s instead of raising KeyErrors, it wouldn't really have the same behavior as a plain dict() anyway.

    @rhettinger
    Copy link
    Contributor

    Attaching an updated patch for Py2.7.

    • Kept OP's simple constructor call but renamed it from counts() to
      Counter():
      item_counts = Counter('acabbacba')
    • Followed Guido's advice and avoided subclassing from defaultdict().
      Instead, subclassed from dict, keeping its familiar API.
    • Avoided KeyError issue by defining __missing__().
    • Added most_common() method to address a principal use case --
      patterned after Smalltalk's sortedByCount() method.
    • Added elements() method requested by Alex Martelli -- patterned after
      C++, Smalltalk, and Knuth's examples.
    • Made necessary subclass overrides to dict methods: __repr__, update,
      fromkeys, and copy.
    • Wrote docstrings, doctests, and motivating examples.
    • Verified that instances are copyable, deepcopyable, and picklable.

    Working on docs and unittests.

    Nice example (most common words in a text file):

    >>> import re
    >>> words = re.findall('\w+', open('hamlet.txt').read().lower())
    >>> Counter(words).most_common(10)
    [('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
    ('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

    @rhettinger rhettinger added type-feature A feature request or enhancement labels Jan 9, 2009
    @rhettinger
    Copy link
    Contributor

    Added a few more in-module tests.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 9, 2009

    Some comments:

    • Counter sounds like a really strange name for a container. Why not
      call it "Bag" or "Multiset" (or "CountingSet"?)
    • why are the unittests inline rather than in Lib/test? inline tests
      don't get executed by the buildbots, nor by people running the test
      suite at home

    @rhettinger
    Copy link
    Contributor

    The counts/counter moniker emerged from the python-dev discussion and
    I'm basically happy with it since the typical usage is c=Counter(myseq)
    with no other non-dict accesses (mostly just c[elem]+=1 and print
    c[elem]). It's a simple counter with a dict interface so the name
    shouldn't suggest anything more complicated than that.

    To me, MultiSet or CountingSet is too offputtingly computer-sciency and
    misleadingly suggests a set-like API instead of a dict interface. I know
    several programmers who don't know the terms, bag or multiset, but they
    intuitively understand what a counter does. Am open to calling it a Bag
    but I rather like the self-descriptiveness and simplicity of Counter.

    As noted previously, standalone unittests are forthcoming (and a doc
    patch). Wanted to get the API and sample use cases worked-out first.

    Thanks for looking at the initial patch.

    @rhettinger
    Copy link
    Contributor

    Georg, could you give this a once over before I commit? Thanks.

    @rhettinger rhettinger assigned birkenfeld and unassigned rhettinger Jan 12, 2009
    @birkenfeld
    Copy link
    Member

    Yes, I'll have a look this evening.

    @rhettinger
    Copy link
    Contributor

    Attaching an update with improved docs. Thanks for looking at this.

    @amauryfa
    Copy link
    Member

    the typical usage is c=Counter(myseq) with no other non-dict accesses
    (mostly just c[elem]+=1 and print c[elem])

    Isn't collections.defaultdict(lambda:0) enough for this purpose?

    @bethard
    Copy link
    Mannequin Author

    bethard mannequin commented Jan 12, 2009

    The whole point was to have a function (or class) that accumulates a
    sequence and counts it. collections.defaultdict(lambda: 0) doesn't
    achieve this on its own because it only knows how to handle sequences of
    (key, value):

    >>> d = collections.defaultdict(lambda: 0)
    >>> d.update('aaabbac')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: dictionary update sequence element #0 has length 1; 2 is
    required

    The feature request here was mainly a request to provide an abbreviation
    of a very common 4-line function.

    @ned-deily
    Copy link
    Member

    In counter6.diff line 56

    "Assigning a count of zero or reducing the count to the zero leaves the"

    suggest s/the zero/zero/

    @birkenfeld
    Copy link
    Member

    Attaching new patch with small changes:

    • Don't describe a class with "Returns ..." as if it was a function.
    • Indent interposed paragraphs so that the method descriptions still
      belong to the .. class directive.
    • Fixed Ned's typo.
    • Note that elements() and most_common() order elements arbitrarily.
    • Reorder the seealso's a bit; the first line is bold.

    Questions:

    • Steven's counter.update('abcdee') wouldn't work -- I guess one is
      supposed to use counter.update(Counter('abcdee')). Should that be noted
      in the docs?
    • Is it needed to sort the items for the __repr__?

    @rhettinger
    Copy link
    Contributor

    Thanks for the review comments. Incorporated all suggested changes and
    did some other minor tidying-up. Extended the update example to include
    c.update(Counter('abcdee')). Committed as r68559 .

    Decided to leave __repr__() with a sort. Though it's not strictly
    necessary, it's a nice convenience saving a user from doing the same
    thing mentally. Also, it helps with doctests by giving a reliable
    ordering when element counts are non-equal. For datasets small enough
    that someone would want to display their repr, the time spent sorting is
    negligible.

    @rhettinger rhettinger assigned rhettinger and unassigned birkenfeld Jan 12, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants