classification
Title: Speed-up list.insert: use memmove()
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: jeethu, pitrou, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-01-11 13:52 by jeethu, last changed 2018-01-17 15:50 by jeethu.

Pull Requests
URL Status Linked Edit
PR 5159 open jeethu, 2018-01-11 13:57
Messages (26)
msg309806 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-11 13:52
I've noticed that replacing the for loop in the ins1 function in listobject.c with a memmove when the number of pointers to move is greater than 16 seems to speed up list.insert by about 3 to 4x on a contrived benchmark.

# Before
jeethu@dev:cpython  (master)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"
200 loops, best of 5: 3.07 msec per loop

#After
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 744 usec per loop

Both builds were configured with --enable-optimizations and --with-lto
msg309809 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-11 14:17
What are results when set the threshold to 0 or 1?
msg309810 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-11 14:40
I tried it with a couple of different thresholds, twice each, ignoring the results of the first run. 16 seems to be the sweet spot.

THRESHOLD = 0
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 787 usec per loop

THRESHOLD = 4
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 781 usec per loop

THRESHOLD = 8
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 780 usec per loop

THRESHOLD = 16
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 758 usec per loop

THRESHOLD = 32
jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"
500 loops, best of 5: 764 usec per loop
msg309812 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-11 14:59
In your benchmarks the difference between thresholds is only 4%. It would be not worth to keep a special case for such small benefit.

But note that in your benchmarks you inserted in a list with the size up to 50000 elements. The value of the threshold affects only the first few insertions. Your need to test inserting in short lists. Recreate a list at every iteration.
msg309917 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-14 10:57
I managed to tune an i7700k desktop running Ubuntu 17.10 per this doc[1], and ran the pyperformance benchmarks[2].
I also tried various threshold with this benchmark and 16 still seems to be the sweet spot.
The geometric mean of the relative changes across all benchmarks indicates a 12.85% improvement.

[1]: http://pyperformance.readthedocs.io/usage.html#how-to-get-stable-benchmarks
[2]: https://gist.github.com/jeethu/4d9a8bd3eecbd067c00f085f7d2e7595
msg309929 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-01-14 18:15
The result likely varies quite a bit from compiler-to-compiler and processor-to-processor and os-to-os.   It is may also be affected by data size and caching.
msg309983 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-15 14:16
I'm quite surprised so many benchmarks would be speeded up so significantly by a list.insert() optimization (why a 27% speedup on computing digits of pi, or a 33% speedup on a N-body simulation?).  Are you sure the two executables are similarly compiled?
msg310022 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-15 23:17
I rebased my branch off of master and rebuilt it, and also rebuilt the baseline from master. Both versions were configured with --with-lto and --enable-optimizations. The benchmark numbers are rather different this time[1]. pidigits is slower, but nbody is still inexplicably faster. Should I run the benchmark without a PGO build (i.e without --enable-optimizations)?

Also, I've attached the output from the pyperformance run to the aforementioned gist.

[1]: https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88
msg310023 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-15 23:19
Perhaps the patch is interfering weirdly with PGO?

> Should I run the benchmark without a PGO build (i.e without --enable-optimizations)?

That would help clear things up, IMHO.
msg310056 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-16 09:09
Built and benchmarked both the baseline and the patch without PGO; the differences are less pronounced, but still present.

https://gist.github.com/jeethu/abd404e39c6dfcbabb4c01661b9238d1
msg310059 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-16 09:13
Thanks.  That's really surprising.  I'll give it a try myself.
msg310064 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-16 09:30
> jeethu@dev:cpython  (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)"

Please don't use timeit, but perf timeit to run such *microbenchmark* (time smaller than 1 ms).

Your benchmark measures also the performance of the loop, it might be significant in such short loop (100 items). You may try to unroll the loop manually, and truncate the list:

vstinner@apu$ python3 -c 'print("l.insert(None); " * 3 + "l.clear();")'
l.insert(None); l.insert(None); l.insert(None); l.clear();

Example:

python3 -m perf timeit -s 'l=[]' $(python3 -c 'print("l.insert(0, None); " * 100 + "l.clear();")') --duplicate 100


Jeethu Rao: Are you using CPU isolation? What is your OS?
msg310094 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-16 17:05
Ok, I ran the benchmarks here (Ubuntu 16.04, Core i5-2500K, PGO and LTO disabled) and I don't get any consistent speedup, which is more in line with what I was expecting:
https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a
msg310120 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-17 00:07
Victor: I’m booting with the isolcpus and rcu_nocbs flags, and running pyperformance with the --affinity flag to pin the benchmark to the isolated CPU cores. I’ve also run `perf system tune`. And the OS is Ubuntu 17.10. Thanks for the tip on using perf timeit instead of timeit. I’ve run the benchmark that you've suggested with a minor change (to avoid the cost of LOAD_ATTR) and attached the output on a gist[1].

Antoine: Thanks for benchmarking it. After looking at the generated assembly[2], I found that ins1 is being inlined and the call to memmove was appearing before the loop (possibly because the compiler assumes that the call to memmove is more likely). I made a minor change and increased the threshold to 32. I’ve attached the generated assembly in a gist[3] (The relevant sequence is around line 8406, if you’re interested). And here’s the pyperformance comparison[4]. Could you please try benchmarking this version on your machine?

[1]: https://gist.github.com/jeethu/2d2de55afdb8ea4ad03b6a5d04d5227f
[2]: Generated with “gcc -DNDEBUG -fwrapv -O3 -std=c99  -I. -I./Include -DPy_BUILD_CORE -S -masm=intel Objects/listobject.c”
[3]: https://gist.github.com/jeethu/596bfc1251590bc51cc230046b52fb38
[4]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030
msg310121 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-17 00:28
> I’ve run the benchmark that you've suggested with a minor change (to avoid the cost of LOAD_ATTR)

Be careful. Moving "l.insert" lookup of the loop might make the code slower. I never looked why. But Python 3.7 was also optimized in many places to call methods, so I'm not sure anymore :)
msg310122 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-17 00:56
> Be careful. Moving "l.insert" lookup of the loop might make the code slower. I never looked why. But Python 3.7 was also optimized in many places to call methods, so I'm not sure anymore :)

Thanks again! Here's a gist without the hack[1].

[1]: https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470
msg310137 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-01-17 04:38
FWIW, we've encountered a number of situations in the past when something that improved the timings on one compiler would make timings worse on another compiler.  There was also variance between timings on 32-bit builds versus 64-bit builds.
msg310146 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-17 09:45
https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470

Mean +- std dev: 10.5 us +- 1.4 us => Mean +- std dev: 9.68 us +- 0.89 us

It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an optimization should make a *microbenchmark* at least 10% faster.

For this optimization, I have no strong opinion.


Using memmove() for large copy is a good idea. The main question is the "if (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()?


Previously, Python had a Py_MEMCPY() macro which also had such threshold. Basically, it's a workaround for compiler performance issues:

#if defined(_MSC_VER)
#define Py_MEMCPY(target, source, length) do {                          \
        size_t i_, n_ = (length);                                       \
        char *t_ = (void*) (target);                                    \
        const char *s_ = (void*) (source);                              \
        if (n_ >= 16)                                                   \
            memcpy(t_, s_, n_);                                         \
        else                                                            \
            for (i_ = 0; i_ < n_; i_++)                                 \
                t_[i_] = s_[i_];                                        \
    } while (0)
#else
#define Py_MEMCPY memcpy
#endif

Hopefully, the macro now just became:

/* Py_MEMCPY is kept for backwards compatibility,
 * see https://bugs.python.org/issue28126 */
#define Py_MEMCPY memcpy

And it's no more used.


I recall a performance issues with GCC memcmp() builtin function (replacing the libc function during the compilation): https://bugs.python.org/issue17628#msg186012

See also:
* https://bugs.python.org/issue13134
* https://bugs.python.org/issue29782
msg310160 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-17 13:36
> FWIW, we've encountered a number of situations in the past when something that improved the timings on one compiler would make timings worse on another compiler.  There was also variance between timings on 32-bit builds versus 64-bit builds.

I've verified that both clang and gcc generate similar assembly (memcpy is not inlined and the loop is not vectorized) 32-bit mode. I'd wager that the improvement with vectorization (in memmove) would be even more pronounced on 32-bit systems, given that pointers are half the size and cache lines are still 64 bytes wide.

> It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an optimization should make a *microbenchmark* at least 10% faster.

That's true if we assume lists to have 100 or lesser elements. On the other hand, on the pyperformance comparison I'd posted yesterday[1], there seems to be an average improvement of 1.27x  on the first seven benchmarks, and the slowest slowdown is only 1.03x. Albeit, the improvement cannot be better than by a constant factor with the vectorized loop in memmove.

> Using memmove() for large copy is a good idea. The main question is the "if (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()?

The overhead of calling memmove makes it slower for small lists. That's how I arrived at this patch in the first place. I tried replacing the loop with a memmove() and it was slower on pyperformance and it was faster with switching to memmove() after a threshold.

> Previously, Python had a Py_MEMCPY() macro which also had such threshold. Basically, it's a workaround for compiler performance issues:

That's very interesting! I think it boils down to the pointer aliasing problem. The pointers in memcpy()'s signature have the `restrict` qualifier, which gives the compiler more leeway to optimize calls to memcpy, while the compiler has to be more conservative with memmove(). I wonder if it's worth trying out a Py_MEMMOVE() :)


[1]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030
msg310161 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-01-17 13:37
Le 17/01/2018 à 14:36, Jeethu Rao a écrit :
> 
> On the other hand, on the pyperformance comparison I'd posted yesterday[1], there seems to be an average improvement of 1.27x  on the first seven benchmarks, and the slowest slowdown is only 1.03x.

I still think those numbers are misleading or downright bogus.  There is
no existing proof that list.insert() is a critical path in those benchmarks.
msg310163 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-17 13:43
> I still think those numbers are misleading or downright bogus.  There is no existing proof that list.insert() is a critical path in those benchmarks.

Can someone check if these bencmarks really use list.insert() in hot code? If yes, why? :-) The cost of list.insert() is known, no?
msg310184 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-17 15:07
> > I still think those numbers are misleading or downright bogus.  There is no existing proof that list.insert() is a critical path in those benchmarks.

> Can someone check if these bencmarks really use list.insert() in hot code? If yes, why? :-) The cost of list.insert() is known, no?

Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294
msg310187 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-17 15:24
> Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294

I don't understand how to read this output.

"54640 ins1 called 40050 times"

What is 54640?

I'm interested to know which benchmarks call list.insert() 40k times.
msg310189 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-17 15:29
> What is 54640?

That's the pid of the process.

> I'm interested to know which benchmarks call list.insert() 40k times.

The django_template benchmark.
msg310193 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-17 15:39
In https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88 I read:

| django_template | 160 ms | 129 ms | 1.24x faster | Significant (t=15.05) |

So on this benchmark, the optimization seems significant. But in the same paste, I see other benchmarks 1.2x faster whereas they don't abuse list.insert(), and maybe don't use list.insert() at all.

Maybe Django template engine should be fixed to use deque instead ;-)
msg310195 - (view) Author: Jeethu Rao (jeethu) * Date: 2018-01-17 15:50
It's also interesting that in https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a :

| django_template         | 307 ms    | 312 ms                   | 1.02x slower | Not significant        |

It seems to be slower and the benchmarks before it (2to3, chameleon, chaos, crypto_pyaes, deltablue) which we now know barely use list.insert are slightly faster :)

¯\_(ツ)_/¯
History
Date User Action Args
2018-01-17 15:50:49jeethusetmessages: + msg310195
2018-01-17 15:39:41vstinnersetmessages: + msg310193
2018-01-17 15:30:00jeethusetmessages: + msg310189
2018-01-17 15:24:52vstinnersetmessages: + msg310187
2018-01-17 15:07:24jeethusetmessages: + msg310184
2018-01-17 13:43:26vstinnersetmessages: + msg310163
2018-01-17 13:37:37pitrousetmessages: + msg310161
2018-01-17 13:36:07jeethusetmessages: + msg310160
2018-01-17 09:45:26vstinnersetmessages: + msg310146
2018-01-17 04:38:51rhettingersetmessages: + msg310137
2018-01-17 00:56:55jeethusetmessages: + msg310122
2018-01-17 00:28:43vstinnersetmessages: + msg310121
2018-01-17 00:07:53jeethusetmessages: + msg310120
2018-01-16 17:05:47pitrousetmessages: + msg310094
2018-01-16 09:44:18vstinnersettitle: Speed-up list.insert -> Speed-up list.insert: use memmove()
2018-01-16 09:30:58vstinnersetnosy: + vstinner
messages: + msg310064
2018-01-16 09:13:16pitrousetmessages: + msg310059
2018-01-16 09:09:41jeethusetmessages: + msg310056
2018-01-15 23:19:44pitrousetmessages: + msg310023
2018-01-15 23:17:52jeethusetmessages: + msg310022
2018-01-15 14:16:39pitrousetnosy: + pitrou
messages: + msg309983
2018-01-14 18:15:19rhettingersetnosy: + rhettinger
messages: + msg309929
2018-01-14 10:57:09jeethusetmessages: + msg309917
2018-01-11 14:59:24serhiy.storchakasetmessages: + msg309812
2018-01-11 14:40:00jeethusetmessages: + msg309810
2018-01-11 14:17:29serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg309809
2018-01-11 13:57:48jeethusetkeywords: + patch
stage: patch review
pull_requests: + pull_request5017
2018-01-11 13:52:14jeethucreate