classification
Title: collections.Counter's += copies the entire object
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: larsmans, petri.lehtinen, python-dev, rhettinger
Priority: low Keywords: patch

Created on 2011-10-07 08:50 by larsmans, last changed 2011-10-19 20:41 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
cpython-counter-iadd.diff larsmans, 2011-10-07 08:50 Patch that adds __iadd__ to collections.Counter review
counter.diff rhettinger, 2011-10-19 05:06 Patch for in-place multiset operations review
Messages (5)
msg145063 - (view) Author: Lars Buitinck (larsmans) Date: 2011-10-07 08:50
I've found some counterintuitive behavior in collections.Counter while hacking on the scikit-learn project [1]. I wanted to use a bunch of Counters to do some simple term counting in a set of documents, roughly as follows:

   count_total = Counter()
   for doc in documents:
       count_current = Counter(analyze(doc))
       count_total += count_current
       count_per_doc.append(count_current)

Performance was horrible. After some digging, I found out that Counter [2] does not have __iadd__ and += copies the entire left-hand side in __add__. I've attached a patch that fixes the issue (for += only, and I've not patched the testsuite.)

[1] https://github.com/scikit-learn/scikit-learn/commit/de6e93094499e4d81b8e3b15fc66b6b9252945af
msg145065 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2011-10-07 08:58
This is slightly backwards incompatible, as some people may depend on the old behavior. Otherwise sounds like a good idea to me.
msg145067 - (view) Author: Lars Buitinck (larsmans) Date: 2011-10-07 09:17
If this is not implemented because it is backwards incompat, then it might be useful to add a note to update's docstring explaining that it is much more efficient than +=. I was very surprised that it took *minutes* to add a few thousand moderate-sized Counters.
msg145071 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-10-07 11:49
I'll add the in-place methods including __iadd__, __isub__, __iand__, and __ior__.

If speed is your issue, you should continue to use the update() method which will always be faster because it doesn't have a step to strip zeros and negative values from the existing Counter.  Also, update() has a fast-path written in C.
msg145959 - (view) Author: Roundup Robot (python-dev) Date: 2011-10-19 20:40
New changeset 5cced40374df by Raymond Hettinger in branch 'default':
Issue #13121: Support in-place math operators for collections.Counter().
http://hg.python.org/cpython/rev/5cced40374df
History
Date User Action Args
2011-10-19 20:41:20rhettingersetstatus: open -> closed
resolution: fixed
2011-10-19 20:40:49python-devsetnosy: + python-dev
messages: + msg145959
2011-10-19 05:06:59rhettingersetfiles: + counter.diff
2011-10-07 11:49:28rhettingersetpriority: normal -> low

messages: + msg145071
2011-10-07 09:17:07larsmanssetmessages: + msg145067
2011-10-07 09:11:35mark.dickinsonsettype: behavior -> enhancement
2011-10-07 09:11:18mark.dickinsonsetassignee: rhettinger
2011-10-07 08:58:13petri.lehtinensetversions: + Python 3.3, - Python 3.4
nosy: + rhettinger, petri.lehtinen

messages: + msg145065

stage: patch review
2011-10-07 08:50:33larsmanscreate