Issue 10667: collections.Counter object in C

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54876

classification

Title:	collections.Counter object in C
Type:	performance	Stage:	patch review
Components:	Extension Modules, Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	asvetlov, daniel.urban, gregory.p.smith, jpeel, pitrou, rhettinger
Priority:	low	Keywords:	patch

Created on 2010-12-09 22:58 by jpeel, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
counterpatch.diff	jpeel, 2010-12-09 22:58	Patch for the collections.Counter object
time_counter.py	rhettinger, 2010-12-09 23:09	Timing script
counterpatch2.diff	jpeel, 2010-12-11 00:26	Patch for collections that makes a C-enhanced Counter class and a pure Python PyCounter class
fastcount.patch	rhettinger, 2010-12-14 18:32	fast helper function for counting elements

Messages (9)
msg123706 - (view)	Author: Justin Peel (jpeel)	Date: 2010-12-09 22:58
I put the Counter's update and __missing__ methods into C code. The rest of the existing methods remain the same. The new Counter is at least 2x-10x (or more) faster for updating depending on the input. I also added a new method, update_fromsubs, which is documented below the timings for updating a Counter. Here are a few examples to demonstrate the speed of the new update function: Counting list(range(100))100 - Old Counter: 18.6 msec New Counter: 1.43 13x speed-up Counting range(10000) - Old Counter: 21.5 msec New Counter: 3.84 msec 5.6x speed-up Counter generator i % 100 for i in range(10000) - Old Counter: 36.7 msec New Counter: 17.5 msec 2.1x speed-up and the __missing__ method being in C makes getting missing keys about twice as fast. Also, I added a new method, update_fromsubs, which counts subarrays/substrings of a sequence. This method only works with immutable sequences like tuples, bytes, and unicode objects. The method is of the form update_fromsubs(seq, frame[, step[, lo[, hi]]]) where frame is the size of the subarrays, step is how far to move forward after each subarray, and lo and hi are the respective starting and ending indices to count subarrays within the sequence. For instance, c = Counter() c.update_fromsubs('abcd', 2) yields Counter({'cd': 1, 'ab': 1, 'bc': 1}) and d = Counter() d.update_fromsubs('abcd', 2, 2) yields Counter({'ab': 1, 'cd: 1}) These sorts of operations could be done with generators in a manner like Counter(s[i:i+2] for i in range(0, len(s)-1)) but it can be about twice as fast by using update_fromsubs. Here's an example Counting subarrays of length 2 of "ab"10000: update_fromsubs: 30.8 ms New Counter w/ generator: 54.3 ms Old Counter w/ generator: 98.8 ms This method would probably be most useful in processing strings, but there may be other uses. If it is accepted, please feel to change the method name and parameters. I especially wasn't sure what to call this method (update_fromsubsequences seemed rather long).
msg123707 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-12-09 23:07
Thanks for the patch. FWIW, I'm attaching some timing code that I've used in the past.
msg123710 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-12-09 23:24
When adding some C accelerations, we often try to keep a pure Python alternative in the stdlib, since it can then benefit other Python implementations (and easy prototyping). If you move some of the methods inside a mixin and use multiple inheritance with the base C impl, this should be easy. Also, the unit tests should then test both the C impl and the pure Python impl.
msg123759 - (view)	Author: Justin Peel (jpeel)	Date: 2010-12-11 00:26
I've done as Antoine asked and made a pure Python PyCounter class and a C-enhanced Counter class that both use the mixin CounterBase class. I also added to the tests so that both PyCounter and Counter are tested. I left the update_fromsubs() method in Counter for now, but haven't made a Python equivalent function for PyCounter in case the method is rejected. Also, I realized that my build was still in debug mode so those benchmark weren't quite accurate (Oops). The Counter class is only 4-6x faster for most cases. That's still good, but not quite as good as I'd previously reported. Lastly, I thought of one more method that could be quite useful. Say that you have a list like [('apples',4),('oranges',5),('apples',8)] where the second element in each tuple indicates how many of them there are. The resultant Counter should have {'apples': 12, 'oranges':5}. We don't really have a good way to count this, but we could add a small method like the following: def fromitems(self, items): for key, value in items: self[key] += value and I could easily make a C function for it. I think that this simple method could be very useful to some people. What do you all think of this addition?
msg123761 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-12-11 01:01
I would like this API to "sit and cook" for a good while. There are many possible ways to add more methods and most be end-up being YAGNI. Also, my experience with dict.fromkeys() is that a fair number of people get confused by alternative constructors. So, at this point I have zero interest in adding from_items(). Am taking the C optimizations under advisement. In the meantime, there is no need to keep submitting patch variations. The question is less how-to-do-it and more a question of whether-to-do-it. We try not to add an alternate C paths unless a feature is an important building block. There is an on-going maintenance cost and a risk of slightly different behaviors (at a minimum, the tracebacks look different and the code can't be traced through pdb). That being said, there is a nice speed-up to be had, so I'll give that some weight.
msg123768 - (view)	Author: Justin Peel (jpeel)	Date: 2010-12-11 02:41
Okay, I was done submitting. I thought that Antoine was asking for a version that kept a pure Python version so I did that.
msg123968 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-12-14 18:32
Attaching a new patch for high-speed update() and __init__(). I also tried a C version of __missing__() but the speed-up was too small to be worth it.
msg123987 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2010-12-14 22:40
rhettinger's fastcount.patch looks good to me. a couple style nits but they're minor. no space between if and (? yuck. short variable names like "it"? yuck. but the code looks good otherwise. i'm all for it.
msg124027 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-12-15 16:31
Fixed "if(" spacing and applied in r87265.

History
Date	User	Action	Args
2022-04-11 14:57:10	admin	set	github: 54876
2010-12-15 16:31:54	rhettinger	set	status: open -> closed messages: + msg124027 resolution: fixed nosy: rhettinger, gregory.p.smith, pitrou, asvetlov, daniel.urban, jpeel
2010-12-14 22:40:45	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg123987
2010-12-14 18:32:31	rhettinger	set	files: + fastcount.patch versions: + Python 3.2, - Python 3.3 messages: + msg123968 resolution: later -> (no value) stage: patch review
2010-12-11 02:41:12	jpeel	set	messages: + msg123768
2010-12-11 01:01:43	rhettinger	set	priority: normal -> low resolution: later messages: + msg123761
2010-12-11 00:26:43	jpeel	set	files: + counterpatch2.diff messages: + msg123759
2010-12-10 20:59:26	asvetlov	set	nosy: + asvetlov
2010-12-10 08:52:06	daniel.urban	set	nosy: + daniel.urban
2010-12-09 23:24:02	pitrou	set	nosy: + pitrou messages: + msg123710
2010-12-09 23:09:29	rhettinger	set	files: + time_counter.py
2010-12-09 23:07:52	rhettinger	set	assignee: rhettinger messages: + msg123707
2010-12-09 22:58:58	jpeel	create