Issue 1696199: Add collections.counts()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44815

classification

Title:	Add collections.counts()
Type:	enhancement	Stage:	test needed
Components:	Library (Lib)	Versions:	Python 3.1, Python 2.7

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	amaury.forgeotdarc, bethard, georg.brandl, ned.deily, pitrou, rhettinger
Priority:	normal	Keywords:	patch

Created on 2007-04-07 21:38 by bethard, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
collections_counts.diff	bethard, 2007-04-07 21:38	Patch adding collections.counts()
counter6.diff	rhettinger, 2009-01-12 11:16	Full patch with tests and improved docs
counter7.diff	georg.brandl, 2009-01-12 19:40

Messages (17)
msg31726 - (view)	Author: Steven Bethard (bethard) *	Date: 2007-04-07 21:38
As suggested in http://mail.python.org/pipermail/python-list/2007-April/433986.html this is a patch to add a counts() function to the collections module. Usage looks like: >>> items = 'acabbacba' >>> item_counts = counts(items) >>> for item in 'abcd': ... print item, item_counts[item] ... a 4 b 3 c 2 d 0 Yes, it's only a 4-line function, but it's a frequently re-written 4-line function.
msg31727 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-04-07 23:39
Does it have to be a defaultdict? I.e. is it important that item_counts['d'] not raise KeyError?
msg31728 - (view)	Author: Steven Bethard (bethard) *	Date: 2007-04-07 23:48
I think it's okay if it's not a defaultdict. That was the simplest implementation, but I certainly have no problem calling d.get() when necessary. Should I change the implementation to use a dict()?
msg31729 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-04-08 00:23
I guess it's simplicity of implementation vs. simplicity of use. And I'm not even sure which is easier to use. It's just that defaultdicts are a very new thing and still feel "weird" -- even though I pushed for the implementation based on popular demand I'm not a user myself. Perhaps ask around on python-dev?
msg31730 - (view)	Author: Steven Bethard (bethard) *	Date: 2007-04-19 15:15
A summary of the python-dev thread (http://mail.python.org/pipermail/python-dev/2007-April/072502.html) Since the number of times an unseen item was seen is 0, most people felt returning 0 was more natural behavior than raising KeyError. There was some discussion of alternate names, but most people were fine with counts(). Raymond suggested making it a classmethod of dict, but people were a little concerned about adding to dict's already complex API, and since the result of counts() needed to return 0s instead of raising KeyErrors, it wouldn't really have the same behavior as a plain dict() anyway.
msg79461 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-01-09 11:04
Attaching an updated patch for Py2.7. * Kept OP's simple constructor call but renamed it from counts() to Counter(): item_counts = Counter('acabbacba') * Followed Guido's advice and avoided subclassing from defaultdict(). Instead, subclassed from dict, keeping its familiar API. * Avoided KeyError issue by defining __missing__(). * Added most_common() method to address a principal use case -- patterned after Smalltalk's sortedByCount() method. * Added elements() method requested by Alex Martelli -- patterned after C++, Smalltalk, and Knuth's examples. * Made necessary subclass overrides to dict methods: __repr__, update, fromkeys, and copy. * Wrote docstrings, doctests, and motivating examples. * Verified that instances are copyable, deepcopyable, and picklable. Working on docs and unittests. Nice example (most common words in a text file): >>> import re >>> words = re.findall('\w+', open('hamlet.txt').read().lower()) >>> Counter(words).most_common(10) [('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631), ('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
msg79467 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-01-09 12:13
Added a few more in-module tests.
msg79471 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-01-09 14:36
Some comments: - Counter sounds like a really strange name for a container. Why not call it "Bag" or "Multiset" (or "CountingSet"?) - why are the unittests inline rather than in Lib/test? inline tests don't get executed by the buildbots, nor by people running the test suite at home
msg79512 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-01-09 21:36
The counts/counter moniker emerged from the python-dev discussion and I'm basically happy with it since the typical usage is c=Counter(myseq) with no other non-dict accesses (mostly just c[elem]+=1 and print c[elem]). It's a simple counter with a dict interface so the name shouldn't suggest anything more complicated than that. To me, MultiSet or CountingSet is too offputtingly computer-sciency and misleadingly suggests a set-like API instead of a dict interface. I know several programmers who don't know the terms, bag or multiset, but they intuitively understand what a counter does. Am open to calling it a Bag but I rather like the self-descriptiveness and simplicity of Counter. As noted previously, standalone unittests are forthcoming (and a doc patch). Wanted to get the API and sample use cases worked-out first. Thanks for looking at the initial patch.
msg79650 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-01-12 08:01
Georg, could you give this a once over before I commit? Thanks.
msg79661 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2009-01-12 10:40
Yes, I'll have a look this evening.
msg79663 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-01-12 11:16
Attaching an update with improved docs. Thanks for looking at this.
msg79690 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-01-12 18:20
> the typical usage is c=Counter(myseq) with no other non-dict accesses > (mostly just c[elem]+=1 and print c[elem]) Isn't collections.defaultdict(lambda:0) enough for this purpose?
msg79694 - (view)	Author: Steven Bethard (bethard) *	Date: 2009-01-12 18:55
The whole point was to have a function (or class) that accumulates a sequence and counts it. collections.defaultdict(lambda: 0) doesn't achieve this on its own because it only knows how to handle sequences of (key, value): >>> d = collections.defaultdict(lambda: 0) >>> d.update('aaabbac') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: dictionary update sequence element #0 has length 1; 2 is required The feature request here was mainly a request to provide an abbreviation of a very common 4-line function.
msg79696 - (view)	Author: Ned Deily (ned.deily) *	Date: 2009-01-12 19:25
In counter6.diff line 56 "Assigning a count of zero or reducing the count to the zero leaves the" suggest s/the zero/zero/
msg79698 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2009-01-12 19:40
Attaching new patch with small changes: * Don't describe a class with "Returns ..." as if it was a function. * Indent interposed paragraphs so that the method descriptions still belong to the .. class directive. * Fixed Ned's typo. * Note that elements() and most_common() order elements arbitrarily. * Reorder the seealso's a bit; the first line is bold. Questions: * Steven's counter.update('abcdee') wouldn't work -- I guess one is supposed to use counter.update(Counter('abcdee')). Should that be noted in the docs? * Is it needed to sort the items for the __repr__?
msg79707 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-01-12 23:05
Thanks for the review comments. Incorporated all suggested changes and did some other minor tidying-up. Extended the update example to include c.update(Counter('abcdee')). Committed as r68559 . Decided to leave __repr__() with a sort. Though it's not strictly necessary, it's a nice convenience saving a user from doing the same thing mentally. Also, it helps with doctests by giving a reliable ordering when element counts are non-equal. For datasets small enough that someone would want to display their repr, the time spent sorting is negligible.

History
Date	User	Action	Args
2022-04-11 14:56:23	admin	set	github: 44815
2009-01-12 23:05:13	rhettinger	set	status: open -> closed assignee: georg.brandl -> rhettinger resolution: accepted messages: + msg79707
2009-01-12 19:56:35	gvanrossum	set	nosy: - gvanrossum
2009-01-12 19:40:24	georg.brandl	set	files: + counter7.diff messages: + msg79698
2009-01-12 19:25:09	ned.deily	set	nosy: + ned.deily messages: + msg79696
2009-01-12 18:55:10	bethard	set	messages: + msg79694
2009-01-12 18:20:57	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg79690
2009-01-12 11:16:40	rhettinger	set	files: - counter5.diff
2009-01-12 11:16:34	rhettinger	set	files: - counter4.diff
2009-01-12 11:16:20	rhettinger	set	files: + counter6.diff messages: + msg79663
2009-01-12 10:40:29	georg.brandl	set	messages: + msg79661
2009-01-12 08:01:16	rhettinger	set	files: + counter5.diff assignee: rhettinger -> georg.brandl messages: + msg79650 nosy: + georg.brandl
2009-01-09 23:04:38	rhettinger	set	files: - counter3.diff
2009-01-09 23:04:24	rhettinger	set	files: + counter4.diff
2009-01-09 21:36:45	rhettinger	set	messages: + msg79512
2009-01-09 14:36:53	pitrou	set	nosy: + pitrou messages: + msg79471
2009-01-09 13:02:42	rhettinger	set	files: - counter2.diff
2009-01-09 13:01:35	rhettinger	set	files: + counter3.diff
2009-01-09 12:14:26	rhettinger	set	files: - counter.diff
2009-01-09 12:13:54	rhettinger	set	files: + counter2.diff messages: + msg79467
2009-01-09 11:04:38	rhettinger	set	files: + counter.diff versions: + Python 3.1, Python 2.7, - Python 2.6 messages: + msg79461 keywords: + patch type: enhancement stage: test needed
2007-04-07 21:38:28	bethard	create