Title: list.insert is slow, likely due to manual memmove
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.9
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Stefan Pochmann, rhettinger, tim.peters
Priority: low Keywords:

Created on 2020-02-29 16:11 by Stefan Pochmann, last changed 2020-05-18 01:31 by rhettinger.

Messages (10)
msg362986 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2020-02-29 16:11
Using a list's insert function is much slower than using slice assignment:

> python -m timeit -n 100000 -s "a=[]" "a.insert(0,0)"
100000 loops, best of 5: 19.2 usec per loop

> python -m timeit -n 100000 -s "a=[]" "a[0:0]=[0]"
100000 loops, best of 5: 6.78 usec per loop

(Note that the list starts empty but grows to 100,000 elements.)

At first I thought maybe it's the attribute lookup or function call overhead or so, but inserting near the end shows that that's negligible:

> python -m timeit -n 100000 -s "a=[]" "a.insert(-1,0)"
100000 loops, best of 5: 79.1 nsec per loop

I asked at StackOverflow and someone pointed out that list.insert uses a manual loop instead of memmove:
msg362993 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2020-02-29 17:07
I believe it also affects bisect.insort, which I occasionally use when I need a "self-sorting list" (I can't easily test it, as I'm not set up to modify the C version of bisect.insort).

And also the popular sortedcontainers package, which relies on such list operations to be fast:
msg362996 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2020-02-29 17:14
Benchmarking with two *Python* versions of bisect.insort, the "insert" version takes 1.08 seconds to insort the shuffled range(10**5) while the slice assignment version only takes 0.46 seconds:

from timeit import timeit
import random
from bisect import bisect_right

def insort1(a, x):
    lo = bisect_right(a, x)
    a.insert(lo, x)

def insort2(a, x):
    lo = bisect_right(a, x)
    a[lo:lo] = [x]

for _ in range(3):
    a = list(range(10**5))
    for insort in insort1, insort2:
        it = iter(a)
        s = []
        print(timeit(lambda: insort(s, next(it)), number=len(a)))
msg362998 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-02-29 17:28
Be cautious about this.  Many years ago I looked into this, mostly in the context of the list sort's binary insertion sort, and results were all over the map, depending on the compiler in use, the optimization level, and the range at which (if any!) memmove() was actually faster than a simple loop.  A fancy memmove() will (at least) arrange to copy (say) 8 bytes at a time, but the overhead of setting that up (+ the function call) made it a loser when the number of bytes to be moved was "too small".

Don't know whether the comment in listobject.c's binarysort() still applies:

           Caution: using memmove is much slower under MSVC 5;
           we're not usually moving many slots. */

Optimizing to move 100,000 elements isn't really sane - cutting the constant factor on an inherently quadratic-time too-simplistic basic approach isn't really doing you a favor ;-)

Also note that sortedcontainers does not use unbounded-length Python lists under the covers.  It puts an upper bound on the length of the Python lists it uses, precisely to avoid visibly quadratic-time behavior.
msg363000 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-02-29 18:11
Here's a diff you can try out:

diff --git a/Objects/listobject.c b/Objects/listobject.c
index 3c39c6444b..ac6c9cc07a 100644
--- a/Objects/listobject.c
+++ b/Objects/listobject.c
@@ -272,8 +272,7 @@ ins1(PyListObject *self, Py_ssize_t where, PyObject *v)
     if (where > n)
         where = n;
     items = self->ob_item;
-    for (i = n; --i >= where; )
-        items[i+1] = items[i];
+    memmove(items+where+1, items+where+0, (n - where) * sizeof(PyObject *));
     items[where] = v;
     return 0;

For me, that patch makes little difference in the timings.

Another thing that could be tried is to have ins1() call list_ass_slice() to do the work.  That way, both paths will use the same underlying insertion code.  It would be interesting to see if this makes any difference.

A few other thoughts:

* We usually prefer deques to lists when prepending because the former are O(1) and the latter are O(n).

* Modern compilers are very good at optimizing data movements in a simple for-loop.

* It is very difficult to get good timings for these kind of operations.  Internally, both make a call to list_resize() which in turn uses realloc().  The latter function can be very fast or very slow depending solely on whether it can resize in-place or has to move the data.  Accordingly, timings are easily confounded by minor changes to memory useage.  To get a clearcut timing, you would need to isolate just the for-loop or memmove operation and not include Python calling overhead, realloc operations, and everything else that is going on in the given timings.

* Another source of confounding is the compiler itself.  Changes that makes small improvements on Clang may hurt the GCC build or vice-versa.  The comparison can also be thrown-off by whether you're timing a PGO build (which tends to do an amazing job of optimizing for-foops).  The results may also differ between 32-bit builds and 64-builds.

* The timing isn't representative of how people use insert(x, 0).  It would be a mistake to optimize an uncommon case if it comes at the expense of the common case (insertions into small lists).  To guard against this, you would need to measure inserts into a small list to find-out whether memmove() has more setup time than the for-loop.  When we've done such investigations in the past, for-loops were found to beat memmove() for sizes under 16.  Almost certainly this changed as compilers and processors have changed over the years.

* For smaller list sizes, the calling overhead dominates the insertion time.  Disassembling the two different calls show a good deal of overhead for calls.

>>> from dis import dis
>>> dis('a.insert(0,0)')
  1           0 LOAD_NAME                0 (a)
              2 LOAD_METHOD              1 (insert)
              4 LOAD_CONST               0 (0)
              6 LOAD_CONST               0 (0)
              8 CALL_METHOD              2
             10 RETURN_VALUE
>>> dis('a[0:0]=[0]')
  1           0 LOAD_CONST               0 (0)
              2 BUILD_LIST               1
              4 LOAD_NAME                0 (a)
              6 LOAD_CONST               0 (0)
              8 LOAD_CONST               0 (0)
             10 BUILD_SLICE              2
             12 STORE_SUBSCR
             14 LOAD_CONST               1 (None)
             16 RETURN_VALUE
msg363001 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2020-02-29 18:24
Good point, Tim. Although binarysort really moves very few slots (I think at most 62, and average like 13). That might not be representative for how people use list.insert, either. I think I mainly use it for the mentioned case of a "self-sorting list", either naively via bisect.insort with a single list or via sortedcontainers' SortedList using multiple lists (used it only once so far). If I'm not mistaken, SortedList has a default load factor of 1000 and splits at double that size.

I might do more reasonable benchmarks later (haven't read Raymond's reply yet).

In any case, binarysort does its own inserting, so at least it wouldn't be affected if list.insert were changed.
msg363004 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-02-29 18:47
Marking this as "pending".  It is more of a personal research project than a bug report, feature request, or known optimization.

The underlying hypothesis is that compilers aren't smart enough to generate good code for a simple for-loop that moves data.

The evidence is weak as involves timing unusual cases, with significantly different calling patterns, and that make repeated calls to realloc() which can throw the results off wildly.
msg363034 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2020-02-29 23:35
I have better benchmarks now and am trying some more, though I guess we roughly agree about when memmove is faster and when it's slower but that the real question is how large the common case is. 

Do you know where I can find those past investigations? Was memmove *faster* for sizes *over* 16? What is the common case, i.e., how do people use list.insert?
msg363035 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2020-03-01 02:06
I think I have a decent way to isolate the memmove vs for-loop times, in Python code. Here are example results, five rounds of inserting with list.insert or with slice assignment. The times for each of the two ways are pretty stable:

  insert    slice
  0.024221  0.006754
  0.024157  0.006710
  0.024015  0.006734
  0.024206  0.006731
  0.024123  0.006769

For each run, I build a list of length 2**20 and append one value. Then I can insert 2**17 more values without the list internally resizing.

Then I measure inserting at index -1 and consider that the baseline, as it includes all the overhead costs like function calls and other differences between the two insertion methods.

Then I measure inserting at index -500 and subtract the baseline time. This difference should reflect how long the memmove or the for-loop took, as that's the difference between inserting at -1 and at -500. It's what I showed in those results above.

I chose -500 here because the use case I have in mind is sortedcontainers.SortedList, which I think mostly involves lists of length 1000 or more (up to 2000), and insertions somewhere in the middle.

Results for index -50 instead:

  insert    slice
  0.002479  0.002217
  0.002546  0.002113
  0.002566  0.002007
  0.002425  0.002081
  0.002555  0.002261

I'm btw running Python 3.8.1 32-bit on Windows 10 64-bit on an Intel i5-7200U.

Rest of this message is the benchmark code:

from timeit import repeat

statements = (
  'a[-1:0] = 0,',
  'a[-50:0] = 0,',

# Verify that the statements work and don't cause internal list resizes.
for stmt in statements:
    a = [0] * 2**20
    size_before = a.__sizeof__()
    stmt = compile(stmt, '', 'single')
    for _ in range(2**17):
    assert len(a) == 2**20 + 1 + 2**17
    assert a.__sizeof__() == size_before

# Run the benchmark.
print('  insert    slice')
for _ in range(5):
    times = []
    for stmt in statements:
        t = min(repeat(stmt, 'a=[0]*2**20; a.append(0)', number=2**17, repeat=20))
    print('%10.6f' * 2 % (times[1] - times[0], times[3] - times[2]))
msg363036 - (view) Author: Stefan Pochmann (Stefan Pochmann) * Date: 2020-03-01 02:14
I misspoke. I do the insertions at -1 and -500 not on the same list but each on a fresh list (of length 2**20+1).
Date User Action Args
2020-05-18 01:31:11rhettingersetassignee: rhettinger ->
2020-03-01 02:14:31Stefan Pochmannsetmessages: + msg363036
2020-03-01 02:06:31Stefan Pochmannsetstatus: pending -> open

messages: + msg363035
2020-02-29 23:36:51Stefan Pochmannsetstatus: open -> pending
2020-02-29 23:35:28Stefan Pochmannsetstatus: pending -> open

messages: + msg363034
2020-02-29 18:47:40rhettingersetpriority: normal -> low
status: open -> pending
messages: + msg363004

versions: + Python 3.9, - Python 3.8
2020-02-29 18:24:52Stefan Pochmannsetmessages: + msg363001
2020-02-29 18:11:27rhettingersetmessages: + msg363000
2020-02-29 17:28:41tim.peterssetnosy: + tim.peters
messages: + msg362998
2020-02-29 17:14:26Stefan Pochmannsetmessages: + msg362996
2020-02-29 17:07:01Stefan Pochmannsetmessages: + msg362993
2020-02-29 17:04:11rhettingersetassignee: rhettinger

nosy: + rhettinger
2020-02-29 16:27:12Stefan Pochmannsettitle: list.insert is slow due to manual memmove -> list.insert is slow, likely due to manual memmove
2020-02-29 16:11:28Stefan Pochmanncreate