This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use PyNumber_InPlaceAdd in sum() for the second iteration onward
Type: performance Stage: resolved
Components: Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: brandtbucher, edk
Priority: normal Keywords: patch

Created on 2020-01-23 23:00 by edk, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 18156 closed ammar2, 2020-01-24 00:00
PR 18240 merged brandtbucher, 2020-01-28 17:08
Messages (4)
msg360583 - (view) Author: edk (edk) Date: 2020-01-23 23:00
The C implementation of sum() contains this comment:

        /* It's tempting to use PyNumber_InPlaceAdd instead of
           PyNumber_Add here, to avoid quadratic running time
           when doing 'sum(list_of_lists, [])'.  However, this
           would produce a change in behaviour: a snippet like
             empty = []
             sum([[x] for x in range(10)], empty)
           would change the value of empty. */

But that doesn't hold after PyNumber_Add has been called once,
because from that point onward the accumulator value is something
we got from an __add__, and the caller can't know if we reuse it.

The in-place version is substantially faster for some workloads--
the example in the comment is an obvious one, but the context in
which I ran into this was using a sum of Counters in a one-liner
a bit like this:

    sum((Counter({line.split("|", 3): len(line)}) for line in sys.stdin), Counter())

in which significant time seems to be spent adding the contents
of the previous accumulator value to the new one.

# before
; ./python -m timeit 'sum(([x] for x in range(1000)), [])'
500 loops, best of 5: 888 usec per loop

# after
; ./python -m timeit 'sum(([x] for x in range(1000)), [])'
5000 loops, best of 5: 65.3 usec per loop


; cat test_.py 
from collections import Counter
import timeit
import random

data = [Counter({random.choice(['foo', 'bar', 'baz', 'qux']): random.randint(1,1000000)}) for _ in range(10000)]

print(min(timeit.repeat('sum(data, Counter())', 'from collections import Counter', number=100, globals={'data': data})))
print(min(timeit.repeat('reduce(Counter.__iadd__, data, Counter())', 'from collections import Counter; from functools import reduce', number=100, globals={'data': data})))

# before
; ./python test_.py
1.8981186050223187
0.7094596439856105

# after
; ./python test_.py 
0.715508968976792
0.7050370009965263
msg360587 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-01-24 00:10
Prior discussion at https://bugs.python.org/issue18305. Note the final comment.

In short, this is a breaking semantic change, and the general consensus is that using sum for sequence concatenation is an anti-pattern.
msg360588 - (view) Author: edk (edk) Date: 2020-01-24 00:23
It's not only sequences. But I didn't know this had come up before, sorry. Maybe the nature of the warning in the source code s a bit misleading?
msg360884 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2020-01-28 17:08
Perhaps. I've opened a PR to update the comment with more info.
History
Date User Action Args
2022-04-11 14:59:25adminsetgithub: 83621
2020-01-28 17:08:51brandtbuchersetmessages: + msg360884
2020-01-28 17:08:17brandtbuchersetpull_requests: + pull_request17619
2020-01-24 00:23:00edksetstatus: open -> closed

messages: + msg360588
stage: patch review -> resolved
2020-01-24 00:20:01brandtbuchersetresolution: duplicate
2020-01-24 00:10:14brandtbuchersetnosy: + brandtbucher
messages: + msg360587
2020-01-24 00:00:01ammar2setkeywords: + patch
stage: patch review
pull_requests: + pull_request17542
2020-01-23 23:00:12edkcreate