Message 338477 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josh.r
Recipients	Nikita Smetanin, josh.r, remi.lapeyre, rhettinger, xtreak
Date	2019-03-20.14:44:41
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1553093082.12.0.286405904477.issue36380@roundup.psfhosted.org>
In-reply-to

Content
@Nikita: Your examples aren't unexpected at all, per the documented behavior: >Several mathematical operations are provided for combining Counter objects to produce multisets (counters that have counts greater than zero). Addition and subtraction combine counters by adding or subtracting the counts of corresponding elements. Intersection and union return the minimum and maximum of corresponding counts. Each operation can accept inputs with signed counts, but the output will exclude results with counts of zero or less. That's pretty explicit; "Each operation can accept inputs with signed counts" means the inputs can have arbitrary counts (negative, zero or positive), but "the output will exclude results with counts of zero or less" means the output will only contain positive counts. This is repeated again in the design "Note" at the bottom: * The multiset methods are designed only for use cases with positive values. The inputs may be negative or zero, but only outputs with positive values are created. while also noting (reiterating in the case of subtract, whose method docs already specify this) that the update and subtract methods are appropriate for use when both inputs and outputs must preserve negative or zero values. So both of your examples are "expected" per the documentation. "Counter(a=-1) + Counter(a=-2) produces empty Counter()" is 100% required by the documentation. And, again per the documentation, _keep_positive is necessary, even for the in-place ops, because there is no expectation that the inputs are positive. So your original motivating case could be spelled: a = Counter(...) # e.g 1M elements b = Counter(...) # e.g. 10 elements # or just b = {...} or b = some_iterable_of_values_to_count since update works just # fine with plain dicts and iterables to count, and making a second Counter is # unnecessary overhead a.update(b) and it would work (and since _keep_positive isn't involved, it would run faster). Yes, if you want to eliminate non-positive counts you'll eventually need to do: a += Counter() (or abusing implementation details, a._keep_positive()), but that's largely unavoidable; the only way to adhere to the documented guarantees without a per mathematical operation _keep_positive would be to slow down many operations (and dramatically increase memory requirements) by separately tracking the non-positive values as they're created, just in case a math operation would need to remove them. And when I say "slow down", I don't mean by a little. Right now, Counter doesn't overload __setitem__ (it just inherits dict's implementation). So when you do: c = Counter() c['a'] += 1 c['a'] += 1 that calls dict.__getitem__ twice (the first of which ends up calling Counter.__missing__ because the key 'a' doesn't exist), and dict.__setitem__ twice (to store back 1, then 2). To make Counter track non-positive values live, you'd need to reimplement __setitem__ as: def __setitem__(self, elem, count): if count <= 0: self._nonpositive.add(elem) else: self._nonpositive.discard(elem) return super().__setitem__(elem, count) which would significant overhead, because: 1. Any movement from the C layer back to the Python layer is a massive performance hit in general 2. It wouldn't just slow __setitem__, it would also slow down the _count_elements accelerator for counting iterables (the thing that finally made Counter faster than defaultdict(int) for its intended use); the accelerator only uses the fast path when the mapping in question is a dict subclass and inherited dict's get and __setitem__ methods unmodified. If either one is overridden, it falls back to the slow path, which misses out on several major improvements available only with the concrete dict API. On top of the performance hit, it wouldn't be remotely thread-safe: Counter (and most Python level types) aren't fully thread-safe to start with, but they generally break in predictable ways (e.g. doing c[key] += 1 in two threads simultaneously might only increment once, not twice, but at least the logical outcome of at least one of the operations occurred, not something wholly unrelated). The race condition here would break code a long time after the actual race occurred; two threads could try to set the same value, one positive, one negative, and interleaved improperly, the negative value could be stored in the dict, but discarded from _nonpositive (or vice versa, but that's less problematic for correctness). Now when something should be eliminating non-positive values, it doesn't know that the negative value is even there, so it survives indefinitely (unless something else sets the count for it again anyway). For the record, if you want to make a Counter without this behavior, it's actually pretty easy: class SignedCounter(Counter): def _keep_positive(self): pass but the interlocking documented requirements on mathematical ops and the methods mean you definitely can't change Counter itself.

@Nikita: Your examples aren't unexpected at all, per the documented behavior:

>Several mathematical operations are provided for combining Counter objects to produce multisets (counters that have counts greater than zero). Addition and subtraction combine counters by adding or subtracting the counts of corresponding elements. Intersection and union return the minimum and maximum of corresponding counts. Each operation can accept inputs with signed counts, but the output will exclude results with counts of zero or less.

That's pretty explicit; "Each operation can accept inputs with signed counts" means the inputs can have arbitrary counts (negative, zero or positive), but "the output will exclude results with counts of zero or less" means the output will *only* contain positive counts.

This is repeated again in the design "Note" at the bottom:

* The multiset methods are designed only for use cases with positive values. The inputs may be negative or zero, but only outputs with positive values are created.

while also noting (reiterating in the case of subtract, whose method docs already specify this) that the update and subtract methods are appropriate for use when both inputs and outputs must preserve negative or zero values.

So both of your examples are "expected" per the documentation. 
"Counter(a=-1) + Counter(a=-2) produces empty Counter()" is 100% required by the documentation.

And, again per the documentation, _keep_positive is necessary, even for the in-place ops, because there is no expectation that the inputs are positive.

So your original motivating case could be spelled:

a = Counter(...)  # e.g 1M elements
b = Counter(...)  # e.g. 10 elements
# or just b = {...} or b = some_iterable_of_values_to_count since update works just
# fine with plain dicts and iterables to count, and making a second Counter is
# unnecessary overhead

a.update(b)

and it would work (and since _keep_positive isn't involved, it would run faster). Yes, if you want to eliminate non-positive counts you'll eventually need to do:

a += Counter()

(or abusing implementation details, a._keep_positive()), but that's largely unavoidable; the only way to adhere to the documented guarantees without a per mathematical operation _keep_positive would be to slow down many operations (and dramatically increase memory requirements) by separately tracking the non-positive values as they're created, just in case a math operation would need to remove them.

And when I say "slow down", I don't mean by a little. Right now, Counter doesn't overload __setitem__ (it just inherits dict's implementation). So when you do:

c = Counter()
c['a'] += 1
c['a'] += 1

that calls dict.__getitem__ twice (the first of which ends up calling Counter.__missing__ because the key 'a' doesn't exist), and dict.__setitem__ twice (to store back 1, then 2).

To make Counter track non-positive values live, you'd need to reimplement __setitem__ as:

def __setitem__(self, elem, count):
    if count <= 0:
        self._nonpositive.add(elem)
    else:
        self._nonpositive.discard(elem)
    return super().__setitem__(elem, count)

which would significant overhead, because:

1. Any movement from the C layer back to the Python layer is a massive performance hit in general
2. It wouldn't just slow __setitem__, it would also slow down the _count_elements accelerator for counting iterables (the thing that finally made Counter faster than defaultdict(int) for its intended use); the accelerator only uses the fast path when the mapping in question is a dict subclass and inherited dict's get and __setitem__ methods unmodified. If either one is overridden, it falls back to the slow path, which misses out on several major improvements available only with the concrete dict API.

On top of the performance hit, it wouldn't be remotely thread-safe: Counter (and most Python level types) aren't fully thread-safe to start with, but they generally break in predictable ways (e.g. doing c[key] += 1 in two threads simultaneously might only increment once, not twice, but at least the logical outcome of at least *one* of the operations occurred, not something wholly unrelated).

The race condition here would break code a long time after the actual race occurred; two threads could try to set the same value, one positive, one negative, and interleaved improperly, the negative value could be stored in the dict, but discarded from _nonpositive (or vice versa, but that's less problematic for correctness). Now when something should be eliminating non-positive values, it doesn't know that the negative value is even there, so it survives indefinitely (unless something else sets the count for it again anyway).

For the record, if you want to make a Counter without this behavior, it's actually pretty easy:

class SignedCounter(Counter):
    def _keep_positive(self):
        pass

but the interlocking documented requirements on mathematical ops and the methods mean you definitely can't change Counter itself.

History
Date	User	Action	Args
2019-03-20 14:44:42	josh.r	set	recipients: + josh.r, rhettinger, remi.lapeyre, xtreak, Nikita Smetanin
2019-03-20 14:44:42	josh.r	set	messageid: <1553093082.12.0.286405904477.issue36380@roundup.psfhosted.org>
2019-03-20 14:44:42	josh.r	link	issue36380 messages
2019-03-20 14:44:41	josh.r	create