Issue 39801: list.insert is slow, likely due to manual memmove

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/83982

classification

Title:	list.insert is slow, likely due to manual memmove
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Stefan Pochmann, rhettinger, tim.peters
Priority:	low	Keywords:

Created on 2020-02-29 16:11 by Stefan Pochmann, last changed 2022-04-11 14:59 by admin.

Messages (10)
msg362986 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2020-02-29 16:11
Using a list's insert function is much slower than using slice assignment: > python -m timeit -n 100000 -s "a=[]" "a.insert(0,0)" 100000 loops, best of 5: 19.2 usec per loop > python -m timeit -n 100000 -s "a=[]" "a[0:0]=[0]" 100000 loops, best of 5: 6.78 usec per loop (Note that the list starts empty but grows to 100,000 elements.) At first I thought maybe it's the attribute lookup or function call overhead or so, but inserting near the end shows that that's negligible: > python -m timeit -n 100000 -s "a=[]" "a.insert(-1,0)" 100000 loops, best of 5: 79.1 nsec per loop I asked at StackOverflow and someone pointed out that list.insert uses a manual loop instead of memmove: https://stackoverflow.com/a/60466572/12671057
msg362993 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2020-02-29 17:07
I believe it also affects bisect.insort, which I occasionally use when I need a "self-sorting list" (I can't easily test it, as I'm not set up to modify the C version of bisect.insort). And also the popular sortedcontainers package, which relies on such list operations to be fast: http://www.grantjenks.com/docs/sortedcontainers/
msg362996 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2020-02-29 17:14
Benchmarking with two Python versions of bisect.insort, the "insert" version takes 1.08 seconds to insort the shuffled range(105) while the slice assignment version only takes 0.46 seconds: from timeit import timeit import random from bisect import bisect_right def insort1(a, x): lo = bisect_right(a, x) a.insert(lo, x) def insort2(a, x): lo = bisect_right(a, x) a[lo:lo] = [x] for _ in range(3): a = list(range(105)) random.shuffle(a) for insort in insort1, insort2: it = iter(a) s = [] print(timeit(lambda: insort(s, next(it)), number=len(a))) print()
msg362998 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-02-29 17:28
Be cautious about this. Many years ago I looked into this, mostly in the context of the list sort's binary insertion sort, and results were all over the map, depending on the compiler in use, the optimization level, and the range at which (if any!) memmove() was actually faster than a simple loop. A fancy memmove() will (at least) arrange to copy (say) 8 bytes at a time, but the overhead of setting that up (+ the function call) made it a loser when the number of bytes to be moved was "too small". Don't know whether the comment in listobject.c's binarysort() still applies: Caution: using memmove is much slower under MSVC 5; we're not usually moving many slots. */ Optimizing to move 100,000 elements isn't really sane - cutting the constant factor on an inherently quadratic-time too-simplistic basic approach isn't really doing you a favor ;-) Also note that sortedcontainers does not use unbounded-length Python lists under the covers. It puts an upper bound on the length of the Python lists it uses, precisely to avoid visibly quadratic-time behavior.
msg363000 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-02-29 18:11
Here's a diff you can try out: diff --git a/Objects/listobject.c b/Objects/listobject.c index 3c39c6444b..ac6c9cc07a 100644 --- a/Objects/listobject.c +++ b/Objects/listobject.c @@ -272,8 +272,7 @@ ins1(PyListObject self, Py_ssize_t where, PyObject v) if (where > n) where = n; items = self->ob_item; - for (i = n; --i >= where; ) - items[i+1] = items[i]; + memmove(items+where+1, items+where+0, (n - where) * sizeof(PyObject )); Py_INCREF(v); items[where] = v; return 0; For me, that patch makes little difference in the timings. Another thing that could be tried is to have ins1() call list_ass_slice() to do the work. That way, both paths will use the same underlying insertion code. It would be interesting to see if this makes any difference. A few other thoughts: We usually prefer deques to lists when prepending because the former are O(1) and the latter are O(n). * Modern compilers are very good at optimizing data movements in a simple for-loop. * It is very difficult to get good timings for these kind of operations. Internally, both make a call to list_resize() which in turn uses realloc(). The latter function can be very fast or very slow depending solely on whether it can resize in-place or has to move the data. Accordingly, timings are easily confounded by minor changes to memory useage. To get a clearcut timing, you would need to isolate just the for-loop or memmove operation and not include Python calling overhead, realloc operations, and everything else that is going on in the given timings. * Another source of confounding is the compiler itself. Changes that makes small improvements on Clang may hurt the GCC build or vice-versa. The comparison can also be thrown-off by whether you're timing a PGO build (which tends to do an amazing job of optimizing for-foops). The results may also differ between 32-bit builds and 64-builds. * The timing isn't representative of how people use insert(x, 0). It would be a mistake to optimize an uncommon case if it comes at the expense of the common case (insertions into small lists). To guard against this, you would need to measure inserts into a small list to find-out whether memmove() has more setup time than the for-loop. When we've done such investigations in the past, for-loops were found to beat memmove() for sizes under 16. Almost certainly this changed as compilers and processors have changed over the years. * For smaller list sizes, the calling overhead dominates the insertion time. Disassembling the two different calls show a good deal of overhead for calls. >>> from dis import dis >>> dis('a.insert(0,0)') 1 0 LOAD_NAME 0 (a) 2 LOAD_METHOD 1 (insert) 4 LOAD_CONST 0 (0) 6 LOAD_CONST 0 (0) 8 CALL_METHOD 2 10 RETURN_VALUE >>> dis('a[0:0]=[0]') 1 0 LOAD_CONST 0 (0) 2 BUILD_LIST 1 4 LOAD_NAME 0 (a) 6 LOAD_CONST 0 (0) 8 LOAD_CONST 0 (0) 10 BUILD_SLICE 2 12 STORE_SUBSCR 14 LOAD_CONST 1 (None) 16 RETURN_VALUE
msg363001 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2020-02-29 18:24
Good point, Tim. Although binarysort really moves very few slots (I think at most 62, and average like 13). That might not be representative for how people use list.insert, either. I think I mainly use it for the mentioned case of a "self-sorting list", either naively via bisect.insort with a single list or via sortedcontainers' SortedList using multiple lists (used it only once so far). If I'm not mistaken, SortedList has a default load factor of 1000 and splits at double that size. I might do more reasonable benchmarks later (haven't read Raymond's reply yet). In any case, binarysort does its own inserting, so at least it wouldn't be affected if list.insert were changed.
msg363004 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-02-29 18:47
Marking this as "pending". It is more of a personal research project than a bug report, feature request, or known optimization. The underlying hypothesis is that compilers aren't smart enough to generate good code for a simple for-loop that moves data. The evidence is weak as involves timing unusual cases, with significantly different calling patterns, and that make repeated calls to realloc() which can throw the results off wildly.
msg363034 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2020-02-29 23:35
I have better benchmarks now and am trying some more, though I guess we roughly agree about when memmove is faster and when it's slower but that the real question is how large the common case is. Do you know where I can find those past investigations? Was memmove faster for sizes over 16? What is the common case, i.e., how do people use list.insert?
msg363035 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2020-03-01 02:06
I think I have a decent way to isolate the memmove vs for-loop times, in Python code. Here are example results, five rounds of inserting with list.insert or with slice assignment. The times for each of the two ways are pretty stable: insert slice 0.024221 0.006754 0.024157 0.006710 0.024015 0.006734 0.024206 0.006731 0.024123 0.006769 For each run, I build a list of length 220 and append one value. Then I can insert 217 more values without the list internally resizing. Then I measure inserting at index -1 and consider that the baseline, as it includes all the overhead costs like function calls and other differences between the two insertion methods. Then I measure inserting at index -500 and subtract the baseline time. This difference should reflect how long the memmove or the for-loop took, as that's the difference between inserting at -1 and at -500. It's what I showed in those results above. I chose -500 here because the use case I have in mind is sortedcontainers.SortedList, which I think mostly involves lists of length 1000 or more (up to 2000), and insertions somewhere in the middle. Results for index -50 instead: insert slice 0.002479 0.002217 0.002546 0.002113 0.002566 0.002007 0.002425 0.002081 0.002555 0.002261 I'm btw running Python 3.8.1 32-bit on Windows 10 64-bit on an Intel i5-7200U. Rest of this message is the benchmark code: from timeit import repeat statements = ( 'a.insert(-1,0)', 'a.insert(-50,0)', 'a[-1:0] = 0,', 'a[-50:0] = 0,', ) # Verify that the statements work and don't cause internal list resizes. for stmt in statements: a = [0] * 220 a.append(0) size_before = a.__sizeof__() stmt = compile(stmt, '', 'single') for _ in range(217): exec(stmt) assert len(a) == 220 + 1 + 217 assert a.__sizeof__() == size_before # Run the benchmark. print(' insert slice') for _ in range(5): times = [] for stmt in statements: t = min(repeat(stmt, 'a=[0]220; a.append(0)', number=217, repeat=20)) times.append(t) print('%10.6f' 2 % (times[1] - times[0], times[3] - times[2]))
msg363036 - (view)	Author: Stefan Pochmann (Stefan Pochmann) *	Date: 2020-03-01 02:14
I misspoke. I do the insertions at -1 and -500 not on the same list but each on a fresh list (of length 2**20+1).

History
Date	User	Action	Args
2022-04-11 14:59:27	admin	set	github: 83982
2020-05-18 01:31:11	rhettinger	set	assignee: rhettinger ->
2020-03-01 02:14:31	Stefan Pochmann	set	messages: + msg363036
2020-03-01 02:06:31	Stefan Pochmann	set	status: pending -> open messages: + msg363035
2020-02-29 23:36:51	Stefan Pochmann	set	status: open -> pending
2020-02-29 23:35:28	Stefan Pochmann	set	status: pending -> open messages: + msg363034
2020-02-29 18:47:40	rhettinger	set	priority: normal -> low status: open -> pending messages: + msg363004 versions: + Python 3.9, - Python 3.8
2020-02-29 18:24:52	Stefan Pochmann	set	messages: + msg363001
2020-02-29 18:11:27	rhettinger	set	messages: + msg363000
2020-02-29 17:28:41	tim.peters	set	nosy: + tim.peters messages: + msg362998
2020-02-29 17:14:26	Stefan Pochmann	set	messages: + msg362996
2020-02-29 17:07:01	Stefan Pochmann	set	messages: + msg362993
2020-02-29 17:04:11	rhettinger	set	assignee: rhettinger nosy: + rhettinger
2020-02-29 16:27:12	Stefan Pochmann	set	title: list.insert is slow due to manual memmove -> list.insert is slow, likely due to manual memmove
2020-02-29 16:11:28	Stefan Pochmann	create