This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Bad performance of colllections.Counter at initialisation from an iterable
Type: performance Stage:
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: SilentGhost, rhettinger
Priority: low Keywords:

Created on 2009-06-29 14:56 by SilentGhost, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tmp_time_missing.py rhettinger, 2009-06-29 22:47 Timing suite
Messages (4)
msg89846 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2009-06-29 14:56
I'm comparing initialisation of Counter from an iterable with the
following function:

def unique(seq):
	"""Dict of unique values (keys) & their counts in original sequence"""

	out_dict = dict.fromkeys(set(seq), 0)
	for i in seq:
		out_dict[i] += 1
	return out_dict


iterable = list(range(43)) + list(range(43, 0, -1))

The timeit-obtained values show that it takes Counter four (4) times
longer to finish. As it's obvious from comparing my function and lines
429-430 of collections.py the only difference is preallocating the final
dictionary. When line 430 of collections is replaced with:

self[elem] = self.get(elem, 0) + 1

I was able to get about 25% time-performance increase (I assume
__missing__ is bypassed). I hope that it's possible to improve its
implementation even further.
msg89857 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-06-29 16:11
Your proposed function assumes the input is a sequence and it cannot
handle not a general purpose iterable (it gives the wrong answer when
the latter is submitted).

Also, it does two lookups per item which can be expensive if the
elements have costly hash functions (such as Decimal objects, tuple
objects, and strings that haven't been interned).  

And, it does not return a dict subclass, so it cannot provide all of the
methods currently offered by Counter (such as most_common() and
elements()), nor does it override regular dict methods that do not make
sense for counters (such as the update() method).

Test data:

# case that give *wrong* answer because the input isn't reiterable
>>> unique(c.lower() for c in 'Abracadabra')

# case that is slower because of expensive multiple lookups
>>> unique([abs(Decimal(x)/4) for x in range(-2000, 2000)])

# case that fails because update() was not overridden
>>> c = unique(range(10))
>>> c.update(range(5))

# cases that do not provide needed subclass methods
>>> unique('abracadabra').most_common()
>>> unique('abracadabra').elements()

Though the code for unique() is hopeless, I will take a look at the
"self[elem] = self.get(elem, 0) + 1" approach.  That shows some promise.
msg89863 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2009-06-29 16:36
I've never meant to suggest any kind of replacement of the Counter with
my example. I just tried to show that 

self[elem] += 1    # line 430 of collections.py

which at initialisation naturally propagates to __missing__ is somewhat
slow. and had it been possible to always have key in the Counter, it
might have speed up initialisation and/or update
msg89873 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-06-29 19:11
I run many timings and the dict.get() approach is decidedly faster. 
Thanks for pointing that out.  Applied in r73695, r73696, and 73697.

FWIW, a timing suite is attached.
History
Date User Action Args
2022-04-11 14:56:50adminsetgithub: 50619
2009-06-29 22:47:27rhettingersetfiles: + tmp_time_missing.py
2009-06-29 22:47:04rhettingersetfiles: - tmp_time_missing.py
2009-06-29 19:11:28rhettingersetstatus: open -> closed
files: + tmp_time_missing.py
versions: + Python 2.7, Python 3.2
messages: + msg89873

resolution: fixed
2009-06-29 16:36:35SilentGhostsetmessages: + msg89863
2009-06-29 16:11:15rhettingersetmessages: + msg89857
2009-06-29 15:07:32rhettingersetpriority: low
assignee: rhettinger

nosy: + rhettinger
2009-06-29 14:56:27SilentGhostcreate