Issue 38436: Improved performance for list addition.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82617

classification

Title:	Improved performance for list addition.
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.9

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	brandtbucher, pablogsal, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2019-10-10 19:16 by brandtbucher, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 16705	closed	brandtbucher, 2019-10-10 19:16

Messages (10)
msg354399 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-10-10 19:16
The attached PR adds a fast path for BINARY_ADD instructions involving two lists, where the left list has a refcount of exactly 1. In this case, we instead do a PySequence_InPlaceConcat operation. This has the affect of avoiding quadratic complexity for list summations, by keeping only one intermediate result and extending it in-place. For (potentially large) lists: a + b + c + d ... Currently executed as: tmp = a + b tmp = tmp + c tmp = tmp + d ... With this change: tmp = a + b tmp += c tmp += d ... Here are the pyperformance results (optimizations enabled): Slower (6): - pickle_list: 3.63 us +- 0.08 us -> 3.78 us +- 0.10 us: 1.04x slower (+4%) - xml_etree_generate: 102 ms +- 2 ms -> 104 ms +- 3 ms: 1.02x slower (+2%) - xml_etree_parse: 166 ms +- 4 ms -> 169 ms +- 9 ms: 1.02x slower (+2%) - scimark_sparse_mat_mult: 5.42 ms +- 0.11 ms -> 5.47 ms +- 0.16 ms: 1.01x slower (+1%) - pickle_dict: 23.6 us +- 0.8 us -> 23.8 us +- 0.4 us: 1.01x slower (+1%) - scimark_fft: 413 ms +- 6 ms -> 416 ms +- 6 ms: 1.01x slower (+1%) Faster (25): - 2to3: 394 ms +- 18 ms -> 381 ms +- 8 ms: 1.03x faster (-3%) - python_startup_no_site: 12.0 ms +- 0.2 ms -> 11.7 ms +- 0.2 ms: 1.03x faster (-3%) - nqueens: 121 ms +- 2 ms -> 118 ms +- 3 ms: 1.03x faster (-2%) - pathlib: 36.0 ms +- 1.3 ms -> 35.1 ms +- 0.9 ms: 1.02x faster (-2%) - pickle_pure_python: 523 us +- 15 us -> 511 us +- 13 us: 1.02x faster (-2%) - deltablue: 8.54 ms +- 0.25 ms -> 8.35 ms +- 0.27 ms: 1.02x faster (-2%) - logging_silent: 220 ns +- 6 ns -> 215 ns +- 8 ns: 1.02x faster (-2%) - raytrace: 566 ms +- 8 ms -> 554 ms +- 11 ms: 1.02x faster (-2%) - hexiom: 11.2 ms +- 0.3 ms -> 11.0 ms +- 0.2 ms: 1.02x faster (-2%) - go: 294 ms +- 6 ms -> 288 ms +- 5 ms: 1.02x faster (-2%) - python_startup: 15.7 ms +- 0.4 ms -> 15.4 ms +- 0.5 ms: 1.02x faster (-2%) - float: 136 ms +- 4 ms -> 134 ms +- 6 ms: 1.02x faster (-2%) - scimark_sor: 232 ms +- 8 ms -> 228 ms +- 5 ms: 1.02x faster (-2%) - fannkuch: 554 ms +- 8 ms -> 545 ms +- 9 ms: 1.02x faster (-2%) - meteor_contest: 115 ms +- 2 ms -> 113 ms +- 3 ms: 1.02x faster (-2%) - telco: 6.58 ms +- 0.22 ms -> 6.48 ms +- 0.15 ms: 1.02x faster (-2%) - regex_effbot: 3.76 ms +- 0.08 ms -> 3.70 ms +- 0.09 ms: 1.02x faster (-2%) - chaos: 129 ms +- 3 ms -> 127 ms +- 2 ms: 1.01x faster (-1%) - regex_v8: 26.3 ms +- 0.4 ms -> 25.9 ms +- 0.6 ms: 1.01x faster (-1%) - regex_compile: 200 ms +- 5 ms -> 197 ms +- 5 ms: 1.01x faster (-1%) - logging_simple: 9.78 us +- 0.19 us -> 9.65 us +- 0.26 us: 1.01x faster (-1%) - sqlite_synth: 3.24 us +- 0.10 us -> 3.20 us +- 0.06 us: 1.01x faster (-1%) - unpickle: 16.4 us +- 0.5 us -> 16.2 us +- 0.5 us: 1.01x faster (-1%) - logging_format: 10.8 us +- 0.2 us -> 10.7 us +- 0.3 us: 1.01x faster (-1%) - json_loads: 30.2 us +- 0.8 us -> 29.9 us +- 0.6 us: 1.01x faster (-1%) Benchmark hidden because not significant (14): json_dumps, nbody, pickle, pidigits, regex_dna, richards, scimark_lu, scimark_monte_carlo, spectral_norm, unpack_sequence, unpickle_list, unpickle_pure_python, xml_etree_iterparse, xml_etree_process This is related to my earlier work in bpo-36229, however this is a much less invasive change that's limited to ceval.c, where our knowledge of context is much better.
msg354411 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-10-10 22:07
From the PR: Where the benchmarks executed with CPU isolation? I am a bit suspicious of the - pickle_list: 3.63 us +- 0.08 us -> 3.78 us +- 0.10 us: 1.04x slower (+4%) If not, can you repeat them with CPU isolation and affinity? Check this for more info
msg354424 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-10-11 07:13
How often do you need to add three or more lists in comparison with other uses of the "+" operator? How larger is the benefit of this optimization? How much it slows down other uses of the "+" operator? Would be nice if you provide some numbers. More efficient way of concatenating several lists in single expression is: [a, b, c, d, ...]
msg354473 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-10-11 17:25
I have repeated the benchmark in the speed.python.org server with CPU isolation + PGO + LTO: Slower (21): - xml_etree_iterparse: 127 ms +- 2 ms -> 131 ms +- 2 ms: 1.03x slower (+3%) - xml_etree_parse: 195 ms +- 1 ms -> 200 ms +- 2 ms: 1.03x slower (+3%) - pickle_list: 5.62 us +- 0.05 us -> 5.76 us +- 0.06 us: 1.03x slower (+3%) - pickle: 13.2 us +- 0.1 us -> 13.5 us +- 0.1 us: 1.02x slower (+2%) - scimark_lu: 185 ms +- 3 ms -> 188 ms +- 5 ms: 1.02x slower (+2%) - scimark_fft: 427 ms +- 2 ms -> 434 ms +- 2 ms: 1.02x slower (+2%) - sympy_integrate: 25.0 ms +- 0.2 ms -> 25.3 ms +- 0.2 ms: 1.01x slower (+1%) - spectral_norm: 166 ms +- 3 ms -> 169 ms +- 1 ms: 1.01x slower (+1%) - scimark_sor: 246 ms +- 6 ms -> 249 ms +- 4 ms: 1.01x slower (+1%) - sympy_sum: 203 ms +- 2 ms -> 205 ms +- 2 ms: 1.01x slower (+1%) - chaos: 138 ms +- 1 ms -> 139 ms +- 1 ms: 1.01x slower (+1%) - json_loads: 37.2 us +- 1.0 us -> 37.5 us +- 0.3 us: 1.01x slower (+1%) - xml_etree_process: 91.0 ms +- 1.1 ms -> 91.7 ms +- 1.8 ms: 1.01x slower (+1%) - sympy_str: 315 ms +- 2 ms -> 317 ms +- 2 ms: 1.01x slower (+1%) - sqlalchemy_declarative: 210 ms +- 3 ms -> 211 ms +- 3 ms: 1.01x slower (+1%) - regex_compile: 214 ms +- 1 ms -> 215 ms +- 1 ms: 1.01x slower (+1%) - regex_dna: 268 ms +- 1 ms -> 270 ms +- 1 ms: 1.01x slower (+1%) - logging_format: 11.2 us +- 0.1 us -> 11.3 us +- 0.2 us: 1.01x slower (+1%) - unpickle_pure_python: 392 us +- 4 us -> 394 us +- 4 us: 1.00x slower (+0%) - sympy_expand: 492 ms +- 2 ms -> 493 ms +- 3 ms: 1.00x slower (+0%) - pidigits: 236 ms +- 1 ms -> 236 ms +- 1 ms: 1.00x slower (+0%) Faster (21): - unpack_sequence: 73.9 ns +- 1.3 ns -> 69.3 ns +- 1.0 ns: 1.07x faster (-6%) - 2to3: 405 ms +- 2 ms -> 393 ms +- 2 ms: 1.03x faster (-3%) - float: 146 ms +- 1 ms -> 142 ms +- 2 ms: 1.03x faster (-3%) - scimark_sparse_mat_mult: 5.46 ms +- 0.02 ms -> 5.34 ms +- 0.06 ms: 1.02x faster (-2%) - unpickle: 17.1 us +- 0.1 us -> 16.7 us +- 0.2 us: 1.02x faster (-2%) - regex_effbot: 3.89 ms +- 0.11 ms -> 3.82 ms +- 0.03 ms: 1.02x faster (-2%) - json_dumps: 15.9 ms +- 0.1 ms -> 15.7 ms +- 0.1 ms: 1.01x faster (-1%) - python_startup: 12.4 ms +- 0.1 ms -> 12.3 ms +- 0.0 ms: 1.01x faster (-1%) - go: 320 ms +- 4 ms -> 316 ms +- 3 ms: 1.01x faster (-1%) - pathlib: 25.4 ms +- 0.4 ms -> 25.1 ms +- 0.3 ms: 1.01x faster (-1%) - mako: 19.6 ms +- 0.2 ms -> 19.4 ms +- 0.1 ms: 1.01x faster (-1%) - fannkuch: 598 ms +- 7 ms -> 591 ms +- 4 ms: 1.01x faster (-1%) - hexiom: 12.2 ms +- 0.2 ms -> 12.1 ms +- 0.1 ms: 1.01x faster (-1%) - richards: 85.9 ms +- 1.3 ms -> 85.0 ms +- 1.3 ms: 1.01x faster (-1%) - python_startup_no_site: 9.18 ms +- 0.05 ms -> 9.09 ms +- 0.02 ms: 1.01x faster (-1%) - genshi_xml: 74.8 ms +- 0.6 ms -> 74.2 ms +- 0.9 ms: 1.01x faster (-1%) - nbody: 157 ms +- 2 ms -> 155 ms +- 2 ms: 1.01x faster (-1%) - unpickle_list: 5.89 us +- 0.03 us -> 5.85 us +- 0.04 us: 1.01x faster (-1%) - genshi_text: 36.7 ms +- 0.4 ms -> 36.5 ms +- 0.3 ms: 1.01x faster (-1%) - dulwich_log: 80.5 ms +- 0.5 ms -> 80.0 ms +- 0.5 ms: 1.01x faster (-1%) - regex_v8: 28.9 ms +- 0.1 ms -> 28.9 ms +- 0.1 ms: 1.00x faster (-0%) Benchmark hidden because not significant (15): chameleon, crypto_pyaes, deltablue, logging_silent, logging_simple, meteor_contest, nqueens, pickle_dict, pickle_pure_python, raytrace, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, telco, xml_etree_generate Ignored benchmarks (3) of json/2019-09-16_14-25-master-89b8933bb537.json.gz: django_template, html5lib, tornado_http
msg354479 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-10-11 18:20
Thanks, Pablo, for providing that. So the changes look like mostly a wash on these benchmarks. Serhiy: I do not see any significant change in + operator timing on my machine (again, just a rough test): $ ./python.exe -m timeit -s z=0 z+0 # master 10000000 loops, best of 5: 21.3 nsec per loop $ ./python.exe -m timeit -s z=0 z+0 # list-add 10000000 loops, best of 5: 21.2 nsec per loop I'm aware that unpacking is "better". With that said, adding a list literal (or a slice of any list, or a list comprehension) to another list is a fairly common operation (I count several dozen examples in the stdlib). Even though these cases only have two operands, they will still see the speed-up. And the speed-up is good, even in these cases. You can compare using the new code: $ ./python.exe -m timeit -s l=[0,1,2,3] [0,1,2,3]+l # Hits new branch 5000000 loops, best of 5: 87.9 nsec per loop $ ./python.exe -m timeit -s l=[0,1,2,3] l+[0,1,2,3] # Hits old branch 5000000 loops, best of 5: 92.5 nsec per loop
msg354480 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-10-11 18:26
...and obviously the gains are more pronounced for more/longer lists. In general I'm not married to this change, though. If the consensus is "not worth it", I get it. But it seems like too easy of a win to me.
msg354485 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-10-11 18:43
> I do not see any significant change in + operator timing on my machine (again, just a rough test): Because the time includes the time of iterating, which can be significant in comparison with adding two integers. Please use the pyperf module with --duplicate=1000. > I'm aware that unpacking is "better". With that said, adding a list literal (or a slice of any list, or a list comprehension) to another list is a fairly common operation (I count several dozen examples in the stdlib). Could you provide any numbers? For example you can patch ceval.c to count these BINARY_ADD for which your optimization works and these for which it does not work and output counts at the exit of Python. Run several Python programs with the modified interpreter. Tests are bad example, but better than nothing. Something like hg would be better. You are free to find programs that would benefit the most from your optimization. > And the speed-up is good, even in these cases. 5%. It is not impressive for optimizing a single rare operation.
msg354487 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-10-11 19:24
Serhiy, here are the better performance measurements: $ ./python.exe -m pyperf timeit --duplicate=1000 -s z=0 z+0 # list-add ..................... Mean +- std dev: 17.3 ns +- 0.3 ns $ ./python.exe -m pyperf timeit --duplicate=1000 -s z=0 z+0 # master ..................... Mean +- std dev: 17.2 ns +- 0.3 ns $ ./python.exe -m pyperf timeit --duplicate=1000 -s l=[0,1,2,3] [0,1,2,3]+l # New branch ..................... Mean +- std dev: 92.6 ns +- 1.7 ns $ ./python.exe -m pyperf timeit --duplicate=1000 -s l=[0,1,2,3] l+[0,1,2,3] # Old branch ..................... Mean +- std dev: 99.8 ns +- 1.4 ns Honestly, I'll defer to you here. If you feel like pursuing this is a total waste of time, I'll just close the issue/PR. Otherwise I can explore how often the branch is hit in production code, as you suggested.
msg354492 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-10-11 20:49
I went ahead and ran an instrumented build on some random production code (mostly financial data processing), just because I was curious: BINARY_ADD ops: 3,720,776 BINARY_ADD ops with two lists: 100,452 (2.7% of total) BINARY_ADD with new fast path: 26,357 (26.2% of list adds, 0.7% of total)
msg354831 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2019-10-17 05:52
I'm going to go ahead and close this, since the payoff doesn't seem to be worth the effort. Thanks for the benchmarking and feedback!

History
Date	User	Action	Args
2022-04-11 14:59:21	admin	set	github: 82617
2019-10-17 05:52:07	brandtbucher	set	status: open -> closed resolution: rejected messages: + msg354831 stage: patch review -> resolved
2019-10-11 20:49:18	brandtbucher	set	messages: + msg354492
2019-10-11 19:24:51	brandtbucher	set	messages: + msg354487
2019-10-11 18:43:44	serhiy.storchaka	set	messages: + msg354485
2019-10-11 18:26:12	brandtbucher	set	messages: + msg354480
2019-10-11 18:20:35	brandtbucher	set	messages: + msg354479
2019-10-11 17:25:26	pablogsal	set	messages: + msg354473
2019-10-11 07:13:31	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg354424
2019-10-10 22:07:13	pablogsal	set	nosy: + pablogsal messages: + msg354411
2019-10-10 19:16:24	brandtbucher	set	keywords: + patch stage: patch review pull_requests: + pull_request16289
2019-10-10 19:16:04	brandtbucher	create