Issue 32534: Speed-up list.insert: use memmove()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76715

classification

Title:	Speed-up list.insert: use memmove()
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	jbakker, jeethu, matrixise, pitrou, rhettinger, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2018-01-11 13:52 by jeethu, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 5159	closed	jeethu, 2018-01-11 13:57

Messages (28)
msg309806 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-11 13:52
I've noticed that replacing the for loop in the ins1 function in listobject.c with a memmove when the number of pointers to move is greater than 16 seems to speed up list.insert by about 3 to 4x on a contrived benchmark. # Before jeethu@dev:cpython (master)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 200 loops, best of 5: 3.07 msec per loop #After jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 744 usec per loop Both builds were configured with --enable-optimizations and --with-lto
msg309809 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-11 14:17
What are results when set the threshold to 0 or 1?
msg309810 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-11 14:40
I tried it with a couple of different thresholds, twice each, ignoring the results of the first run. 16 seems to be the sweet spot. THRESHOLD = 0 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 787 usec per loop THRESHOLD = 4 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 781 usec per loop THRESHOLD = 8 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 780 usec per loop THRESHOLD = 16 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 758 usec per loop THRESHOLD = 32 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 764 usec per loop
msg309812 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-11 14:59
In your benchmarks the difference between thresholds is only 4%. It would be not worth to keep a special case for such small benefit. But note that in your benchmarks you inserted in a list with the size up to 50000 elements. The value of the threshold affects only the first few insertions. Your need to test inserting in short lists. Recreate a list at every iteration.
msg309917 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-14 10:57
I managed to tune an i7700k desktop running Ubuntu 17.10 per this doc[1], and ran the pyperformance benchmarks[2]. I also tried various threshold with this benchmark and 16 still seems to be the sweet spot. The geometric mean of the relative changes across all benchmarks indicates a 12.85% improvement. [1]: http://pyperformance.readthedocs.io/usage.html#how-to-get-stable-benchmarks [2]: https://gist.github.com/jeethu/4d9a8bd3eecbd067c00f085f7d2e7595
msg309929 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-01-14 18:15
The result likely varies quite a bit from compiler-to-compiler and processor-to-processor and os-to-os. It is may also be affected by data size and caching.
msg309983 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-15 14:16
I'm quite surprised so many benchmarks would be speeded up so significantly by a list.insert() optimization (why a 27% speedup on computing digits of pi, or a 33% speedup on a N-body simulation?). Are you sure the two executables are similarly compiled?
msg310022 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-15 23:17
I rebased my branch off of master and rebuilt it, and also rebuilt the baseline from master. Both versions were configured with --with-lto and --enable-optimizations. The benchmark numbers are rather different this time[1]. pidigits is slower, but nbody is still inexplicably faster. Should I run the benchmark without a PGO build (i.e without --enable-optimizations)? Also, I've attached the output from the pyperformance run to the aforementioned gist. [1]: https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88
msg310023 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-15 23:19
Perhaps the patch is interfering weirdly with PGO? > Should I run the benchmark without a PGO build (i.e without --enable-optimizations)? That would help clear things up, IMHO.
msg310056 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-16 09:09
Built and benchmarked both the baseline and the patch without PGO; the differences are less pronounced, but still present. https://gist.github.com/jeethu/abd404e39c6dfcbabb4c01661b9238d1
msg310059 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-16 09:13
Thanks. That's really surprising. I'll give it a try myself.
msg310064 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-16 09:30
> jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" Please don't use timeit, but perf timeit to run such microbenchmark (time smaller than 1 ms). Your benchmark measures also the performance of the loop, it might be significant in such short loop (100 items). You may try to unroll the loop manually, and truncate the list: vstinner@apu$ python3 -c 'print("l.insert(None); " * 3 + "l.clear();")' l.insert(None); l.insert(None); l.insert(None); l.clear(); Example: python3 -m perf timeit -s 'l=[]' $(python3 -c 'print("l.insert(0, None); " * 100 + "l.clear();")') --duplicate 100 Jeethu Rao: Are you using CPU isolation? What is your OS?
msg310094 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-16 17:05
Ok, I ran the benchmarks here (Ubuntu 16.04, Core i5-2500K, PGO and LTO disabled) and I don't get any consistent speedup, which is more in line with what I was expecting: https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a
msg310120 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-17 00:07
Victor: I’m booting with the isolcpus and rcu_nocbs flags, and running pyperformance with the --affinity flag to pin the benchmark to the isolated CPU cores. I’ve also run `perf system tune`. And the OS is Ubuntu 17.10. Thanks for the tip on using perf timeit instead of timeit. I’ve run the benchmark that you've suggested with a minor change (to avoid the cost of LOAD_ATTR) and attached the output on a gist[1]. Antoine: Thanks for benchmarking it. After looking at the generated assembly[2], I found that ins1 is being inlined and the call to memmove was appearing before the loop (possibly because the compiler assumes that the call to memmove is more likely). I made a minor change and increased the threshold to 32. I’ve attached the generated assembly in a gist[3] (The relevant sequence is around line 8406, if you’re interested). And here’s the pyperformance comparison[4]. Could you please try benchmarking this version on your machine? [1]: https://gist.github.com/jeethu/2d2de55afdb8ea4ad03b6a5d04d5227f [2]: Generated with “gcc -DNDEBUG -fwrapv -O3 -std=c99 -I. -I./Include -DPy_BUILD_CORE -S -masm=intel Objects/listobject.c” [3]: https://gist.github.com/jeethu/596bfc1251590bc51cc230046b52fb38 [4]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030
msg310121 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-17 00:28
> I’ve run the benchmark that you've suggested with a minor change (to avoid the cost of LOAD_ATTR) Be careful. Moving "l.insert" lookup of the loop might make the code slower. I never looked why. But Python 3.7 was also optimized in many places to call methods, so I'm not sure anymore :)
msg310122 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-17 00:56
> Be careful. Moving "l.insert" lookup of the loop might make the code slower. I never looked why. But Python 3.7 was also optimized in many places to call methods, so I'm not sure anymore :) Thanks again! Here's a gist without the hack[1]. [1]: https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470
msg310137 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-01-17 04:38
FWIW, we've encountered a number of situations in the past when something that improved the timings on one compiler would make timings worse on another compiler. There was also variance between timings on 32-bit builds versus 64-bit builds.
msg310146 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-17 09:45
https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470 Mean +- std dev: 10.5 us +- 1.4 us => Mean +- std dev: 9.68 us +- 0.89 us It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an optimization should make a microbenchmark at least 10% faster. For this optimization, I have no strong opinion. Using memmove() for large copy is a good idea. The main question is the "if (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()? Previously, Python had a Py_MEMCPY() macro which also had such threshold. Basically, it's a workaround for compiler performance issues: #if defined(_MSC_VER) #define Py_MEMCPY(target, source, length) do { \ size_t i_, n_ = (length); \ char t_ = (void) (target); \ const char s_ = (void) (source); \ if (n_ >= 16) \ memcpy(t_, s_, n_); \ else \ for (i_ = 0; i_ < n_; i_++) \ t_[i_] = s_[i_]; \ } while (0) #else #define Py_MEMCPY memcpy #endif Hopefully, the macro now just became: /* Py_MEMCPY is kept for backwards compatibility, * see https://bugs.python.org/issue28126 / #define Py_MEMCPY memcpy And it's no more used. I recall a performance issues with GCC memcmp() builtin function (replacing the libc function during the compilation): https://bugs.python.org/issue17628#msg186012 See also: https://bugs.python.org/issue13134 * https://bugs.python.org/issue29782
msg310160 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-17 13:36
> FWIW, we've encountered a number of situations in the past when something that improved the timings on one compiler would make timings worse on another compiler. There was also variance between timings on 32-bit builds versus 64-bit builds. I've verified that both clang and gcc generate similar assembly (memcpy is not inlined and the loop is not vectorized) 32-bit mode. I'd wager that the improvement with vectorization (in memmove) would be even more pronounced on 32-bit systems, given that pointers are half the size and cache lines are still 64 bytes wide. > It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an optimization should make a microbenchmark at least 10% faster. That's true if we assume lists to have 100 or lesser elements. On the other hand, on the pyperformance comparison I'd posted yesterday[1], there seems to be an average improvement of 1.27x on the first seven benchmarks, and the slowest slowdown is only 1.03x. Albeit, the improvement cannot be better than by a constant factor with the vectorized loop in memmove. > Using memmove() for large copy is a good idea. The main question is the "if (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()? The overhead of calling memmove makes it slower for small lists. That's how I arrived at this patch in the first place. I tried replacing the loop with a memmove() and it was slower on pyperformance and it was faster with switching to memmove() after a threshold. > Previously, Python had a Py_MEMCPY() macro which also had such threshold. Basically, it's a workaround for compiler performance issues: That's very interesting! I think it boils down to the pointer aliasing problem. The pointers in memcpy()'s signature have the `restrict` qualifier, which gives the compiler more leeway to optimize calls to memcpy, while the compiler has to be more conservative with memmove(). I wonder if it's worth trying out a Py_MEMMOVE() :) [1]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030
msg310161 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-17 13:37
Le 17/01/2018 à 14:36, Jeethu Rao a écrit : > > On the other hand, on the pyperformance comparison I'd posted yesterday[1], there seems to be an average improvement of 1.27x on the first seven benchmarks, and the slowest slowdown is only 1.03x. I still think those numbers are misleading or downright bogus. There is no existing proof that list.insert() is a critical path in those benchmarks.
msg310163 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-17 13:43
> I still think those numbers are misleading or downright bogus. There is no existing proof that list.insert() is a critical path in those benchmarks. Can someone check if these bencmarks really use list.insert() in hot code? If yes, why? :-) The cost of list.insert() is known, no?
msg310184 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-17 15:07
> > I still think those numbers are misleading or downright bogus. There is no existing proof that list.insert() is a critical path in those benchmarks. > Can someone check if these bencmarks really use list.insert() in hot code? If yes, why? :-) The cost of list.insert() is known, no? Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294
msg310187 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-17 15:24
> Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294 I don't understand how to read this output. "54640 ins1 called 40050 times" What is 54640? I'm interested to know which benchmarks call list.insert() 40k times.
msg310189 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-17 15:29
> What is 54640? That's the pid of the process. > I'm interested to know which benchmarks call list.insert() 40k times. The django_template benchmark.
msg310193 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-01-17 15:39
In https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88 I read: \| django_template \| 160 ms \| 129 ms \| 1.24x faster \| Significant (t=15.05) \| So on this benchmark, the optimization seems significant. But in the same paste, I see other benchmarks 1.2x faster whereas they don't abuse list.insert(), and maybe don't use list.insert() at all. Maybe Django template engine should be fixed to use deque instead ;-)
msg310195 - (view)	Author: Jeethu Rao (jeethu) *	Date: 2018-01-17 15:50
It's also interesting that in https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a : \| django_template \| 307 ms \| 312 ms \| 1.02x slower \| Not significant \| It seems to be slower and the benchmarks before it (2to3, chameleon, chaos, crypto_pyaes, deltablue) which we now know barely use list.insert are slightly faster :) ¯\_(ツ)_/¯
msg316647 - (view)	Author: Stéphane Wirtel (matrixise) *	Date: 2018-05-15 12:19
Hi, just a small reminder for this issue because I was reviewing the PR. what is the status? Thanks
msg317075 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-05-19 00:11
This issue is a micro-optimization which is only 1.08x faster: https://bugs.python.org/issue32534#msg310146 Moreover, it seems really hard to measure precisely the benefit on benchmarks. Results seem to not be reliable. I suggest to close the issue as WONTFIX, as I cannot see any reliable speedup nor any significant impact on performance.

History
Date	User	Action	Args
2022-04-11 14:58:56	admin	set	github: 76715
2019-04-08 08:08:11	methane	set	status: open -> closed resolution: wont fix stage: patch review -> resolved
2018-05-19 00:11:04	vstinner	set	messages: + msg317075
2018-05-15 12:19:40	matrixise	set	nosy: + matrixise messages: + msg316647
2018-01-19 01:10:30	jbakker	set	nosy: + jbakker
2018-01-17 15:50:49	jeethu	set	messages: + msg310195
2018-01-17 15:39:41	vstinner	set	messages: + msg310193
2018-01-17 15:30:00	jeethu	set	messages: + msg310189
2018-01-17 15:24:52	vstinner	set	messages: + msg310187
2018-01-17 15:07:24	jeethu	set	messages: + msg310184
2018-01-17 13:43:26	vstinner	set	messages: + msg310163
2018-01-17 13:37:37	pitrou	set	messages: + msg310161
2018-01-17 13:36:07	jeethu	set	messages: + msg310160
2018-01-17 09:45:26	vstinner	set	messages: + msg310146
2018-01-17 04:38:51	rhettinger	set	messages: + msg310137
2018-01-17 00:56:55	jeethu	set	messages: + msg310122
2018-01-17 00:28:43	vstinner	set	messages: + msg310121
2018-01-17 00:07:53	jeethu	set	messages: + msg310120
2018-01-16 17:05:47	pitrou	set	messages: + msg310094
2018-01-16 09:44:18	vstinner	set	title: Speed-up list.insert -> Speed-up list.insert: use memmove()
2018-01-16 09:30:58	vstinner	set	nosy: + vstinner messages: + msg310064
2018-01-16 09:13:16	pitrou	set	messages: + msg310059
2018-01-16 09:09:41	jeethu	set	messages: + msg310056
2018-01-15 23:19:44	pitrou	set	messages: + msg310023
2018-01-15 23:17:52	jeethu	set	messages: + msg310022
2018-01-15 14:16:39	pitrou	set	nosy: + pitrou messages: + msg309983
2018-01-14 18:15:19	rhettinger	set	nosy: + rhettinger messages: + msg309929
2018-01-14 10:57:09	jeethu	set	messages: + msg309917
2018-01-11 14:59:24	serhiy.storchaka	set	messages: + msg309812
2018-01-11 14:40:00	jeethu	set	messages: + msg309810
2018-01-11 14:17:29	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg309809
2018-01-11 13:57:48	jeethu	set	keywords: + patch stage: patch review pull_requests: + pull_request5017
2018-01-11 13:52:14	jeethu	create