Issue 36380: collections.Counter in-place operators are unexpectedly slow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/80561

classification

Title:	collections.Counter in-place operators are unexpectedly slow
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8, Python 3.7, Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	Nikita Smetanin, josh.r, remi.lapeyre, rhettinger, xtreak
Priority:	normal	Keywords:

Created on 2019-03-20 12:13 by Nikita Smetanin, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (10)
msg338461 - (view)	Author: Nikita Smetanin (Nikita Smetanin)	Date: 2019-03-20 12:13
All of collections.Counter in-place operators: +=, -=, \|= and &= are obviously expected to have time complexity O(b) as in the following example: a = Counter(...) # e.g 1M elements b = Counter(...) # e.g. 10 elements a += b But in fact, all of them are having O(a + b) time complexity due to inefficient implementation of _keep_positive method, which checks ALL of the elements of "a" counter after modification while it's required only to check CHANGED elements (no more than a size of "b") having time complexity O(b). See https://github.com/python/cpython/blob/master/Lib/collections/__init__.py#L819 It also unclear if there's even a need to check for non-positives with _keep_positive in ops like __iadd__, __iand__ and __ior__ (except __isub__) as it expects Counters which are always positive. This unobvious inefficiency leads to unnecessary large execution times while, for example, iteratively accumulating some small Counters into a large one (frequent case). In this case .update method works much faster, but it doesn't check for zeros or negatives for some reason (is this a bug too?).
msg338463 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-03-20 12:39
> It also unclear if there's even a need to check for non-positives with _keep_positive in ops like __iadd__, __iand__ and __ior__ (except __isub__) as it expects Counters which are always positive. Counter accepts dictionaries and keyword arguments where keys can have negative or zero as values where _keep_positive needs to be checked for add operation too. >>> Counter(a=3) + Counter(a=-1) Counter({'a': 2}) Counter can have keys where the value is already 0 during construction that are not deleted and thus during __iadd__ all the keys have to be checked for zero. Below is the behavior in Python 3.7 and checking only for changed "b" key for non-zero in the example would leave "a" key also in the result with the optimization proposed? $ python3.7 Python 3.7.1rc2 (v3.7.1rc2:6c06ef7dc3, Oct 13 2018, 05:10:29) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from collections import Counter >>> a = Counter(a=0, b=1) >>> a Counter({'b': 1, 'a': 0}) # Not removed during construction >>> b = Counter(b=1) >>> a += b >>> a Counter({'b': 2})
msg338465 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2019-03-20 12:58
@xtreak __init__ delegates work to update() which has the same behavior: Python 3.7.2 (default, Feb 12 2019, 08:15:36) [Clang 10.0.0 (clang-1000.11.45.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from collections import Counter >>> d = Counter() >>> d.update(a=0) >>> d Counter({'a': 0}) If we want to keep this behavior could we keep an internal list of keys whose value is 0 that we append to at https://github.com/python/cpython/blob/master/Lib/collections/__init__.py#L652 so we doing operations we just check values of keys which are updated and the content of this list?
msg338470 - (view)	Author: Nikita Smetanin (Nikita Smetanin)	Date: 2019-03-20 13:28
@xtreak I agree, also, this behavior is stated in documentation, but it's quite inconsistent in many ways, like in the following examples: Counter(a=-1) + Counter(a=-2) produces empty Counter() instead of Counter(a=-2) which is unexpected, but Counter(a=-1) + Counter(a=2) produces correct result Counter(a=1). The same is with in-place operators. That makes no sense. It's clear that in-place addition, for example, isn't a good place to remove negative elements which wasn't involved in this operation at all. The only possible optimization here (if we don't want to change the behavior) is to keep track of positive / non-positive elements in a separate set. The better solution could be in providing two classes — Counter for any (signed) values with consistent operations and MultisetCounter (e.g.) or specific methods (like .multiset_add) for positive-only values and positive-checks.
msg338471 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-03-20 13:36
I would defer to Raymond at this point and would prefer keeping current __iadd__ behavior and . See also issue23509 that discussed about C implementation of _keep_positive and some other optimizations. > Counter(a=-1) + Counter(a=-2) produces empty Counter() instead of Counter(a=-2) which is unexpected, but I think you meant Counter(a=-3) here
msg338477 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2019-03-20 14:44
@Nikita: Your examples aren't unexpected at all, per the documented behavior: >Several mathematical operations are provided for combining Counter objects to produce multisets (counters that have counts greater than zero). Addition and subtraction combine counters by adding or subtracting the counts of corresponding elements. Intersection and union return the minimum and maximum of corresponding counts. Each operation can accept inputs with signed counts, but the output will exclude results with counts of zero or less. That's pretty explicit; "Each operation can accept inputs with signed counts" means the inputs can have arbitrary counts (negative, zero or positive), but "the output will exclude results with counts of zero or less" means the output will only contain positive counts. This is repeated again in the design "Note" at the bottom: * The multiset methods are designed only for use cases with positive values. The inputs may be negative or zero, but only outputs with positive values are created. while also noting (reiterating in the case of subtract, whose method docs already specify this) that the update and subtract methods are appropriate for use when both inputs and outputs must preserve negative or zero values. So both of your examples are "expected" per the documentation. "Counter(a=-1) + Counter(a=-2) produces empty Counter()" is 100% required by the documentation. And, again per the documentation, _keep_positive is necessary, even for the in-place ops, because there is no expectation that the inputs are positive. So your original motivating case could be spelled: a = Counter(...) # e.g 1M elements b = Counter(...) # e.g. 10 elements # or just b = {...} or b = some_iterable_of_values_to_count since update works just # fine with plain dicts and iterables to count, and making a second Counter is # unnecessary overhead a.update(b) and it would work (and since _keep_positive isn't involved, it would run faster). Yes, if you want to eliminate non-positive counts you'll eventually need to do: a += Counter() (or abusing implementation details, a._keep_positive()), but that's largely unavoidable; the only way to adhere to the documented guarantees without a per mathematical operation _keep_positive would be to slow down many operations (and dramatically increase memory requirements) by separately tracking the non-positive values as they're created, just in case a math operation would need to remove them. And when I say "slow down", I don't mean by a little. Right now, Counter doesn't overload __setitem__ (it just inherits dict's implementation). So when you do: c = Counter() c['a'] += 1 c['a'] += 1 that calls dict.__getitem__ twice (the first of which ends up calling Counter.__missing__ because the key 'a' doesn't exist), and dict.__setitem__ twice (to store back 1, then 2). To make Counter track non-positive values live, you'd need to reimplement __setitem__ as: def __setitem__(self, elem, count): if count <= 0: self._nonpositive.add(elem) else: self._nonpositive.discard(elem) return super().__setitem__(elem, count) which would significant overhead, because: 1. Any movement from the C layer back to the Python layer is a massive performance hit in general 2. It wouldn't just slow __setitem__, it would also slow down the _count_elements accelerator for counting iterables (the thing that finally made Counter faster than defaultdict(int) for its intended use); the accelerator only uses the fast path when the mapping in question is a dict subclass and inherited dict's get and __setitem__ methods unmodified. If either one is overridden, it falls back to the slow path, which misses out on several major improvements available only with the concrete dict API. On top of the performance hit, it wouldn't be remotely thread-safe: Counter (and most Python level types) aren't fully thread-safe to start with, but they generally break in predictable ways (e.g. doing c[key] += 1 in two threads simultaneously might only increment once, not twice, but at least the logical outcome of at least one of the operations occurred, not something wholly unrelated). The race condition here would break code a long time after the actual race occurred; two threads could try to set the same value, one positive, one negative, and interleaved improperly, the negative value could be stored in the dict, but discarded from _nonpositive (or vice versa, but that's less problematic for correctness). Now when something should be eliminating non-positive values, it doesn't know that the negative value is even there, so it survives indefinitely (unless something else sets the count for it again anyway). For the record, if you want to make a Counter without this behavior, it's actually pretty easy: class SignedCounter(Counter): def _keep_positive(self): pass but the interlocking documented requirements on mathematical ops and the methods mean you definitely can't change Counter itself.
msg338486 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-03-20 15:50
As Josh pointed out, this is how counters are documented and tested to work, so it is an unchangeable decision (i.e. it would break code that relying on the saturating arithmetic feature). Accordingly, I'm marking this as closed, not a bug. On the plus side (no pun intended), a counter is just a kind of dictionary so you can easily implement your own behaviors with just a simple for-loop: for x in a: b[a] += a
msg338491 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2019-03-20 16:10
Hi what do you think of a patch to this effect that would speed up operations without changing the current semantics? diff --git a/Lib/collections/__init__.py b/Lib/collections/__init__.py index cff75a48d6..fe5d5b2dca 100644 --- a/Lib/collections/__init__.py +++ b/Lib/collections/__init__.py @@ -561,6 +561,7 @@ class Counter(dict): if len(args) > 1: raise TypeError('expected at most 1 arguments, got %d' % len(args)) super(Counter, self).__init__() + self._nonpositives = [] self.update(args, *kwds) def __missing__(self, key): @@ -652,6 +653,12 @@ class Counter(dict): self[elem] = count + self_get(elem, 0) else: super(Counter, self).update(iterable) # fast path when counter is empty + for elem, count in self.items(): + try: + if not count > 0: + self._nonpositives.append(elem) + except TypeError: + pass else: _count_elements(self, iterable) if kwds: @@ -818,9 +825,10 @@ class Counter(dict): def _keep_positive(self): '''Internal method to strip elements with a negative or zero count''' - nonpositive = [elem for elem, count in self.items() if not count > 0] - for elem in nonpositive: - del self[elem] + for elem in self._nonpositives: + if not self[elem] > 0: + del self[elem] + self._nonpositives.clear() return self def __iadd__(self, other): @@ -833,7 +841,10 @@ class Counter(dict): ''' for elem, count in other.items(): - self[elem] += count + count = self[elem] + count + self[elem] = count + if not count > 0: + self._nonpositives.append(elem) return self._keep_positive() def __isub__(self, other): @@ -846,7 +857,10 @@ class Counter(dict): ''' for elem, count in other.items(): - self[elem] -= count + count = self[elem] - count + self[elem] = count + if not count > 0: + del self[elem] return self._keep_positive() def __ior__(self, other): @@ -858,10 +872,11 @@ class Counter(dict): Counter({'b': 3, 'c': 2, 'a': 1}) ''' - for elem, other_count in other.items(): - count = self[elem] - if other_count > count: - self[elem] = other_count + for elem, count in other.items(): + count = max(count, self[elem]) + self[elem] = count + if not count > 0: + self._nonpositives.append(elem) return self._keep_positive() def __iand__(self, other): @@ -874,9 +889,10 @@ class Counter(dict): ''' for elem, count in self.items(): - other_count = other[elem] - if other_count < count: - self[elem] = other_count + count = min(other[elem], count) + self[elem] = count + if not count > 0: + self._nonpositives.append(elem) return self._keep_positive() It seems to give great improvements for the case Nikita Smetanin gave: ➜ cpython git:(speed-up-Counter) ✗ ./python.exe tests.py Setup took: 9.101021380999999 Computation took: 1.634299999864197e-05 ➜ cpython git:(speed-up-Counter) ✗ python3 tests.py Setup took: 11.615437155 Computation took: 0.34183154800000004 Is there a performance test suite for Counter I could use to check this does not add regressions in some cases?
msg338492 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-03-20 16:15
I don't want to complexify the normal case. Please don't go down the path of turning something simple into a mess. Because counters are just a dict subclass, users are free to make updates in any way they want.
msg338505 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-03-20 19:07
FWIW, the promised semantics of saturating arithmetic require that _keep_positive be run on entire the entire counter: >>> c1 = Counter(a=-3, b=4) >>> +c1 Counter({'b': 4}) >>> from collections import Counter >>> c1 = Counter(a=-3, b=4) >>> c2 = Counter(b=5, c=5) >>> c1 += c2 # The "a" entry gets removed >>> c1 Counter({'b': 9, 'c': 5}) When this behavior isn't wanted, use the update() method which is documented not to perform removal of non-positive values. That method is fast in pure python and has an even faster C helper function: >>> c1 = Counter(a=-3, b=4) >>> c2 = Counter(b=5, c=5) >>> c1.update(c2) >>> c1 Counter({'b': 9, 'c': 5, 'a': -3}) And as mentioned before, the Counter API is not encapsulated -- the underlying structure is fully exposed, so a user is free to do anything with it that they can do with a regular dictionary. A final thought is to be careful when proposing to rewrite the internals of the existing methods. These methods are careful to retain ordering of inputs, to not use auxiliary data fields (it is just a dict variant with no other structure), and to make only minimal assumptions about the datatype for the count. They are designed to not be magical.

History
Date	User	Action	Args
2022-04-11 14:59:12	admin	set	github: 80561
2019-03-20 19:07:07	rhettinger	set	messages: + msg338505
2019-03-20 16:15:49	rhettinger	set	messages: + msg338492
2019-03-20 16:10:33	remi.lapeyre	set	messages: + msg338491
2019-03-20 15:50:05	rhettinger	set	status: open -> closed resolution: not a bug messages: + msg338486 stage: resolved
2019-03-20 15:44:32	rhettinger	set	assignee: rhettinger
2019-03-20 14:44:42	josh.r	set	nosy: + josh.r messages: + msg338477
2019-03-20 13:36:48	xtreak	set	messages: + msg338471
2019-03-20 13:28:02	Nikita Smetanin	set	messages: + msg338470
2019-03-20 12:58:40	remi.lapeyre	set	nosy: + remi.lapeyre messages: + msg338465
2019-03-20 12:39:30	xtreak	set	nosy: + xtreak messages: + msg338463
2019-03-20 12:24:35	SilentGhost	set	versions: - Python 3.5, Python 3.6
2019-03-20 12:19:05	xtreak	set	nosy: + rhettinger
2019-03-20 12:13:26	Nikita Smetanin	create