Issue 39440: Use PyNumber_InPlaceAdd in sum() for the second iteration onward

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/83621

classification

Title:	Use PyNumber_InPlaceAdd in sum() for the second iteration onward
Type:	performance	Stage:	resolved
Components:		Versions:

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:		Nosy List:	brandtbucher, edk
Priority:	normal	Keywords:	patch

Created on 2020-01-23 23:00 by edk, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 18156	closed	ammar2, 2020-01-24 00:00
PR 18240	merged	brandtbucher, 2020-01-28 17:08

Messages (4)
msg360583 - (view)	Author: edk (edk)	Date: 2020-01-23 23:00
The C implementation of sum() contains this comment: /* It's tempting to use PyNumber_InPlaceAdd instead of PyNumber_Add here, to avoid quadratic running time when doing 'sum(list_of_lists, [])'. However, this would produce a change in behaviour: a snippet like empty = [] sum([[x] for x in range(10)], empty) would change the value of empty. */ But that doesn't hold after PyNumber_Add has been called once, because from that point onward the accumulator value is something we got from an __add__, and the caller can't know if we reuse it. The in-place version is substantially faster for some workloads-- the example in the comment is an obvious one, but the context in which I ran into this was using a sum of Counters in a one-liner a bit like this: sum((Counter({line.split("\|", 3): len(line)}) for line in sys.stdin), Counter()) in which significant time seems to be spent adding the contents of the previous accumulator value to the new one. # before ; ./python -m timeit 'sum(([x] for x in range(1000)), [])' 500 loops, best of 5: 888 usec per loop # after ; ./python -m timeit 'sum(([x] for x in range(1000)), [])' 5000 loops, best of 5: 65.3 usec per loop ; cat test_.py from collections import Counter import timeit import random data = [Counter({random.choice(['foo', 'bar', 'baz', 'qux']): random.randint(1,1000000)}) for _ in range(10000)] print(min(timeit.repeat('sum(data, Counter())', 'from collections import Counter', number=100, globals={'data': data}))) print(min(timeit.repeat('reduce(Counter.__iadd__, data, Counter())', 'from collections import Counter; from functools import reduce', number=100, globals={'data': data}))) # before ; ./python test_.py 1.8981186050223187 0.7094596439856105 # after ; ./python test_.py 0.715508968976792 0.7050370009965263
msg360587 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2020-01-24 00:10
Prior discussion at https://bugs.python.org/issue18305. Note the final comment. In short, this is a breaking semantic change, and the general consensus is that using sum for sequence concatenation is an anti-pattern.
msg360588 - (view)	Author: edk (edk)	Date: 2020-01-24 00:23
It's not only sequences. But I didn't know this had come up before, sorry. Maybe the nature of the warning in the source code s a bit misleading?
msg360884 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2020-01-28 17:08
Perhaps. I've opened a PR to update the comment with more info.

History
Date	User	Action	Args
2022-04-11 14:59:25	admin	set	github: 83621
2020-01-28 17:08:51	brandtbucher	set	messages: + msg360884
2020-01-28 17:08:17	brandtbucher	set	pull_requests: + pull_request17619
2020-01-24 00:23:00	edk	set	status: open -> closed messages: + msg360588 stage: patch review -> resolved
2020-01-24 00:20:01	brandtbucher	set	resolution: duplicate
2020-01-24 00:10:14	brandtbucher	set	nosy: + brandtbucher messages: + msg360587
2020-01-24 00:00:01	ammar2	set	keywords: + patch stage: patch review pull_requests: + pull_request17542
2020-01-23 23:00:12	edk	create