classification
Title: Improved performance for list addition.
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: brandtbucher, pablogsal, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-10-10 19:16 by brandtbucher, last changed 2019-10-17 05:52 by brandtbucher. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 16705 closed brandtbucher, 2019-10-10 19:16
Messages (10)
msg354399 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2019-10-10 19:16
The attached PR adds a fast path for BINARY_ADD instructions involving two lists, where the left list has a refcount of exactly 1.

In this case, we instead do a PySequence_InPlaceConcat operation. This has the affect of avoiding quadratic complexity for list summations, by keeping only one intermediate result and extending it in-place.

For (potentially large) lists:

    a + b + c + d ...

Currently executed as:

    tmp = a + b
    tmp = tmp + c
    tmp = tmp + d
    ...

With this change:

    tmp = a + b
    tmp += c
    tmp += d
    ...

Here are the pyperformance results (optimizations enabled):

Slower (6):
- pickle_list: 3.63 us +- 0.08 us -> 3.78 us +- 0.10 us: 1.04x slower (+4%)
- xml_etree_generate: 102 ms +- 2 ms -> 104 ms +- 3 ms: 1.02x slower (+2%)
- xml_etree_parse: 166 ms +- 4 ms -> 169 ms +- 9 ms: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 5.42 ms +- 0.11 ms -> 5.47 ms +- 0.16 ms: 1.01x slower (+1%)
- pickle_dict: 23.6 us +- 0.8 us -> 23.8 us +- 0.4 us: 1.01x slower (+1%)
- scimark_fft: 413 ms +- 6 ms -> 416 ms +- 6 ms: 1.01x slower (+1%)

Faster (25):
- 2to3: 394 ms +- 18 ms -> 381 ms +- 8 ms: 1.03x faster (-3%)
- python_startup_no_site: 12.0 ms +- 0.2 ms -> 11.7 ms +- 0.2 ms: 1.03x faster (-3%)
- nqueens: 121 ms +- 2 ms -> 118 ms +- 3 ms: 1.03x faster (-2%)
- pathlib: 36.0 ms +- 1.3 ms -> 35.1 ms +- 0.9 ms: 1.02x faster (-2%)
- pickle_pure_python: 523 us +- 15 us -> 511 us +- 13 us: 1.02x faster (-2%)
- deltablue: 8.54 ms +- 0.25 ms -> 8.35 ms +- 0.27 ms: 1.02x faster (-2%)
- logging_silent: 220 ns +- 6 ns -> 215 ns +- 8 ns: 1.02x faster (-2%)
- raytrace: 566 ms +- 8 ms -> 554 ms +- 11 ms: 1.02x faster (-2%)
- hexiom: 11.2 ms +- 0.3 ms -> 11.0 ms +- 0.2 ms: 1.02x faster (-2%)
- go: 294 ms +- 6 ms -> 288 ms +- 5 ms: 1.02x faster (-2%)
- python_startup: 15.7 ms +- 0.4 ms -> 15.4 ms +- 0.5 ms: 1.02x faster (-2%)
- float: 136 ms +- 4 ms -> 134 ms +- 6 ms: 1.02x faster (-2%)
- scimark_sor: 232 ms +- 8 ms -> 228 ms +- 5 ms: 1.02x faster (-2%)
- fannkuch: 554 ms +- 8 ms -> 545 ms +- 9 ms: 1.02x faster (-2%)
- meteor_contest: 115 ms +- 2 ms -> 113 ms +- 3 ms: 1.02x faster (-2%)
- telco: 6.58 ms +- 0.22 ms -> 6.48 ms +- 0.15 ms: 1.02x faster (-2%)
- regex_effbot: 3.76 ms +- 0.08 ms -> 3.70 ms +- 0.09 ms: 1.02x faster (-2%)
- chaos: 129 ms +- 3 ms -> 127 ms +- 2 ms: 1.01x faster (-1%)
- regex_v8: 26.3 ms +- 0.4 ms -> 25.9 ms +- 0.6 ms: 1.01x faster (-1%)
- regex_compile: 200 ms +- 5 ms -> 197 ms +- 5 ms: 1.01x faster (-1%)
- logging_simple: 9.78 us +- 0.19 us -> 9.65 us +- 0.26 us: 1.01x faster (-1%)
- sqlite_synth: 3.24 us +- 0.10 us -> 3.20 us +- 0.06 us: 1.01x faster (-1%)
- unpickle: 16.4 us +- 0.5 us -> 16.2 us +- 0.5 us: 1.01x faster (-1%)
- logging_format: 10.8 us +- 0.2 us -> 10.7 us +- 0.3 us: 1.01x faster (-1%)
- json_loads: 30.2 us +- 0.8 us -> 29.9 us +- 0.6 us: 1.01x faster (-1%)

Benchmark hidden because not significant (14): json_dumps, nbody, pickle, pidigits, regex_dna, richards, scimark_lu, scimark_monte_carlo, spectral_norm, unpack_sequence, unpickle_list, unpickle_pure_python, xml_etree_iterparse, xml_etree_process

This is related to my earlier work in bpo-36229, however this is a much less invasive change that's limited to ceval.c, where our knowledge of context is much better.
msg354411 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-10-10 22:07
From the PR:

Where the benchmarks executed with CPU isolation? I am a bit suspicious of the

- pickle_list: 3.63 us +- 0.08 us -> 3.78 us +- 0.10 us: 1.04x slower (+4%)
If not, can you repeat them with CPU isolation and affinity? Check this for more info
msg354424 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-10-11 07:13
How often do you need to add three or more lists in comparison with other uses of the "+" operator?

How larger is the benefit of this optimization?

How much it slows down other uses of the "+" operator?

Would be nice if you provide some numbers.

More efficient way of concatenating several lists in single expression is:

    [*a, *b, *c, *d, ...]
msg354473 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-10-11 17:25
I have repeated the benchmark in the speed.python.org server with CPU isolation + PGO + LTO:

Slower (21):
- xml_etree_iterparse: 127 ms +- 2 ms -> 131 ms +- 2 ms: 1.03x slower (+3%)
- xml_etree_parse: 195 ms +- 1 ms -> 200 ms +- 2 ms: 1.03x slower (+3%)
- pickle_list: 5.62 us +- 0.05 us -> 5.76 us +- 0.06 us: 1.03x slower (+3%)
- pickle: 13.2 us +- 0.1 us -> 13.5 us +- 0.1 us: 1.02x slower (+2%)
- scimark_lu: 185 ms +- 3 ms -> 188 ms +- 5 ms: 1.02x slower (+2%)
- scimark_fft: 427 ms +- 2 ms -> 434 ms +- 2 ms: 1.02x slower (+2%)
- sympy_integrate: 25.0 ms +- 0.2 ms -> 25.3 ms +- 0.2 ms: 1.01x slower (+1%)
- spectral_norm: 166 ms +- 3 ms -> 169 ms +- 1 ms: 1.01x slower (+1%)
- scimark_sor: 246 ms +- 6 ms -> 249 ms +- 4 ms: 1.01x slower (+1%)
- sympy_sum: 203 ms +- 2 ms -> 205 ms +- 2 ms: 1.01x slower (+1%)
- chaos: 138 ms +- 1 ms -> 139 ms +- 1 ms: 1.01x slower (+1%)
- json_loads: 37.2 us +- 1.0 us -> 37.5 us +- 0.3 us: 1.01x slower (+1%)
- xml_etree_process: 91.0 ms +- 1.1 ms -> 91.7 ms +- 1.8 ms: 1.01x slower (+1%)
- sympy_str: 315 ms +- 2 ms -> 317 ms +- 2 ms: 1.01x slower (+1%)
- sqlalchemy_declarative: 210 ms +- 3 ms -> 211 ms +- 3 ms: 1.01x slower (+1%)
- regex_compile: 214 ms +- 1 ms -> 215 ms +- 1 ms: 1.01x slower (+1%)
- regex_dna: 268 ms +- 1 ms -> 270 ms +- 1 ms: 1.01x slower (+1%)
- logging_format: 11.2 us +- 0.1 us -> 11.3 us +- 0.2 us: 1.01x slower (+1%)
- unpickle_pure_python: 392 us +- 4 us -> 394 us +- 4 us: 1.00x slower (+0%)
- sympy_expand: 492 ms +- 2 ms -> 493 ms +- 3 ms: 1.00x slower (+0%)
- pidigits: 236 ms +- 1 ms -> 236 ms +- 1 ms: 1.00x slower (+0%)

Faster (21):
- unpack_sequence: 73.9 ns +- 1.3 ns -> 69.3 ns +- 1.0 ns: 1.07x faster (-6%)
- 2to3: 405 ms +- 2 ms -> 393 ms +- 2 ms: 1.03x faster (-3%)
- float: 146 ms +- 1 ms -> 142 ms +- 2 ms: 1.03x faster (-3%)
- scimark_sparse_mat_mult: 5.46 ms +- 0.02 ms -> 5.34 ms +- 0.06 ms: 1.02x faster (-2%)
- unpickle: 17.1 us +- 0.1 us -> 16.7 us +- 0.2 us: 1.02x faster (-2%)
- regex_effbot: 3.89 ms +- 0.11 ms -> 3.82 ms +- 0.03 ms: 1.02x faster (-2%)
- json_dumps: 15.9 ms +- 0.1 ms -> 15.7 ms +- 0.1 ms: 1.01x faster (-1%)
- python_startup: 12.4 ms +- 0.1 ms -> 12.3 ms +- 0.0 ms: 1.01x faster (-1%)
- go: 320 ms +- 4 ms -> 316 ms +- 3 ms: 1.01x faster (-1%)
- pathlib: 25.4 ms +- 0.4 ms -> 25.1 ms +- 0.3 ms: 1.01x faster (-1%)
- mako: 19.6 ms +- 0.2 ms -> 19.4 ms +- 0.1 ms: 1.01x faster (-1%)
- fannkuch: 598 ms +- 7 ms -> 591 ms +- 4 ms: 1.01x faster (-1%)
- hexiom: 12.2 ms +- 0.2 ms -> 12.1 ms +- 0.1 ms: 1.01x faster (-1%)
- richards: 85.9 ms +- 1.3 ms -> 85.0 ms +- 1.3 ms: 1.01x faster (-1%)
- python_startup_no_site: 9.18 ms +- 0.05 ms -> 9.09 ms +- 0.02 ms: 1.01x faster (-1%)
- genshi_xml: 74.8 ms +- 0.6 ms -> 74.2 ms +- 0.9 ms: 1.01x faster (-1%)
- nbody: 157 ms +- 2 ms -> 155 ms +- 2 ms: 1.01x faster (-1%)
- unpickle_list: 5.89 us +- 0.03 us -> 5.85 us +- 0.04 us: 1.01x faster (-1%)
- genshi_text: 36.7 ms +- 0.4 ms -> 36.5 ms +- 0.3 ms: 1.01x faster (-1%)
- dulwich_log: 80.5 ms +- 0.5 ms -> 80.0 ms +- 0.5 ms: 1.01x faster (-1%)
- regex_v8: 28.9 ms +- 0.1 ms -> 28.9 ms +- 0.1 ms: 1.00x faster (-0%)

Benchmark hidden because not significant (15): chameleon, crypto_pyaes, deltablue, logging_silent, logging_simple, meteor_contest, nqueens, pickle_dict, pickle_pure_python, raytrace, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, telco, xml_etree_generate
Ignored benchmarks (3) of json/2019-09-16_14-25-master-89b8933bb537.json.gz: django_template, html5lib, tornado_http
msg354479 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2019-10-11 18:20
Thanks, Pablo, for providing that. So the changes look like mostly a wash on these benchmarks. 

Serhiy:

I do not see any significant change in + operator timing on my machine (again, just a rough test):

$ ./python.exe -m timeit -s z=0 z+0  # master
10000000 loops, best of 5: 21.3 nsec per loop

$ ./python.exe -m timeit -s z=0 z+0  # list-add
10000000 loops, best of 5: 21.2 nsec per loop

I'm aware that unpacking is "better". With that said, adding a list literal (or a slice of any list, or a list comprehension) to another list is a fairly common operation (I count several dozen examples in the stdlib). Even though these cases only have two operands, they will still see the speed-up.

And the speed-up is good, even in these cases. You can compare using the new code:

$ ./python.exe -m timeit -s l=[0,1,2,3] [0,1,2,3]+l  # Hits new branch
5000000 loops, best of 5: 87.9 nsec per loop

$ ./python.exe -m timeit -s l=[0,1,2,3] l+[0,1,2,3]  # Hits old branch
5000000 loops, best of 5: 92.5 nsec per loop
msg354480 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2019-10-11 18:26
...and obviously the gains are more pronounced for more/longer lists.

In general I'm not married to this change, though. If the consensus is "not worth it", I get it.

But it seems like too easy of a win to me.
msg354485 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-10-11 18:43
> I do not see any significant change in + operator timing on my machine (again, just a rough test):

Because the time includes the time of iterating, which can be significant in comparison with adding two integers. Please use the pyperf module with --duplicate=1000.

> I'm aware that unpacking is "better". With that said, adding a list literal (or a slice of any list, or a list comprehension) to another list is a fairly common operation (I count several dozen examples in the stdlib).

Could you provide any numbers? For example you can patch ceval.c to count these BINARY_ADD for which your optimization works and these for which it does not work and output counts at the exit of Python. Run several Python programs with the modified interpreter. Tests are bad example, but better than nothing. Something like hg would be better. You are free to find programs that would benefit the most from your optimization.

> And the speed-up is good, even in these cases.

5%. It is not impressive for optimizing a single rare operation.
msg354487 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2019-10-11 19:24
Serhiy, here are the better performance measurements:


$ ./python.exe -m pyperf timeit --duplicate=1000 -s z=0 z+0  # list-add
.....................
Mean +- std dev: 17.3 ns +- 0.3 ns

$ ./python.exe -m pyperf timeit --duplicate=1000 -s z=0 z+0  # master
.....................
Mean +- std dev: 17.2 ns +- 0.3 ns

$ ./python.exe -m pyperf timeit --duplicate=1000 -s l=[0,1,2,3] [0,1,2,3]+l  # New branch
.....................
Mean +- std dev: 92.6 ns +- 1.7 ns

$ ./python.exe -m pyperf timeit --duplicate=1000 -s l=[0,1,2,3] l+[0,1,2,3]  # Old branch
.....................
Mean +- std dev: 99.8 ns +- 1.4 ns


Honestly, I'll defer to you here. If you feel like pursuing this is a total waste of time, I'll just close the issue/PR. Otherwise I can explore how often the branch is hit in production code, as you suggested.
msg354492 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2019-10-11 20:49
I went ahead and ran an instrumented build on some random production code (mostly financial data processing), just because I was curious:

BINARY_ADD ops: 3,720,776
BINARY_ADD ops with two lists: 100,452 (2.7% of total)
BINARY_ADD with new fast path: 26,357 (26.2% of list adds, 0.7% of total)
msg354831 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2019-10-17 05:52
I'm going to go ahead and close this, since the payoff doesn't seem to be worth the effort.

Thanks for the benchmarking and feedback!
History
Date User Action Args
2019-10-17 05:52:07brandtbuchersetstatus: open -> closed
resolution: rejected
messages: + msg354831

stage: patch review -> resolved
2019-10-11 20:49:18brandtbuchersetmessages: + msg354492
2019-10-11 19:24:51brandtbuchersetmessages: + msg354487
2019-10-11 18:43:44serhiy.storchakasetmessages: + msg354485
2019-10-11 18:26:12brandtbuchersetmessages: + msg354480
2019-10-11 18:20:35brandtbuchersetmessages: + msg354479
2019-10-11 17:25:26pablogsalsetmessages: + msg354473
2019-10-11 07:13:31serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg354424
2019-10-10 22:07:13pablogsalsetnosy: + pablogsal
messages: + msg354411
2019-10-10 19:16:24brandtbuchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request16289
2019-10-10 19:16:04brandtbuchercreate