Issue 46528: Simplify the VM's stack manipulations

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90686

classification

Title:	Simplify the VM's stack manipulations
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.11

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	brandtbucher	Nosy List:	Dennis Sweeney, Mark.Shannon, barry, brandtbucher, kj
Priority:	normal	Keywords:	patch

Created on 2022-01-26 01:39 by brandtbucher, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 30902	merged	brandtbucher, 2022-01-26 01:48
PR 30970	merged	brandtbucher, 2022-01-27 22:27
PR 30998	merged	brandtbucher, 2022-01-28 23:17
PR 31039	merged	brandtbucher, 2022-01-31 23:32

Messages (15)
msg411697 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-01-26 01:39
...as discussed in https://github.com/faster-cpython/ideas/discussions/228. We can dramatically simplify our stack manipulations by getting rid of the `DUP_TOP` and `ROT_` families of instructions: - Replace `DUP_TOP` with `COPY(1)`. - Replace `DUP_TOP_TWO` with `COPY(2), COPY(2)`. - Introduce a new `SWAP` instruction. - Replace `ROT_TWO` with `SWAP(2)`. - Replace `ROT_THREE` with `SWAP(3), SWAP(2)`. - Remove `ROT_FOUR`. - Replace `ROT_N(n)` with `SWAP(n), SWAP(n - 1), ..., SWAP(2)`. It then becomes much simpler for the peephole optimizer to reason about code sequences involving these instructions (for example, it's pretty straightforward to truly optimize an arbitrary sequence of swaps).
msg411698 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2022-01-26 01:44
What are the (unoptimized) performance implications of replacing one instruction with multiple instructions?
msg411700 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-01-26 02:11
In practice, pretty much the only thing that will do more work now is augmented subscription: >>> a[b] += c main: 1 2 LOAD_NAME 0 (a) 4 LOAD_NAME 1 (b) 6 DUP_TOP_TWO 8 BINARY_SUBSCR 10 LOAD_NAME 2 (c) 12 BINARY_OP 13 (+=) 14 ROT_THREE 16 STORE_SUBSCR patched: 1 2 LOAD_NAME 0 (a) 4 LOAD_NAME 1 (b) 6 COPY 2 8 COPY 2 10 BINARY_SUBSCR 12 LOAD_NAME 2 (c) 14 BINARY_OP 13 (+=) 16 SWAP 3 18 SWAP 2 20 STORE_SUBSCR Pattern matching is the only place where we use ROT_N, and frankly it's misused there... often, it makes the cost of storing names quadratic at runtime. Even though we emit more instructions now, the peephole optimizer actually cuts them down significantly, and the result is more efficient than before. For example: >>> match x: ... case (a, b, c, None): ... pass main: 1 2 LOAD_NAME 0 (x) 2 4 MATCH_SEQUENCE 6 POP_JUMP_IF_FALSE 20 (to 40) 8 GET_LEN 10 LOAD_CONST 0 (4) 12 COMPARE_OP 2 (==) 14 POP_JUMP_IF_FALSE 20 (to 40) 16 UNPACK_SEQUENCE 4 18 ROT_FOUR 20 ROT_FOUR 22 ROT_FOUR 24 POP_JUMP_IF_NOT_NONE 18 (to 36) 26 STORE_NAME 1 (a) 28 STORE_NAME 2 (b) 30 STORE_NAME 3 (c) 3 32 LOAD_CONST 1 (None) 34 RETURN_VALUE 2 >> 36 POP_TOP 38 POP_TOP >> 40 POP_TOP 42 LOAD_CONST 1 (None) 44 RETURN_VALUE patched: 1 2 LOAD_NAME 0 (x) 2 4 MATCH_SEQUENCE 6 POP_JUMP_IF_FALSE 20 (to 40) 8 GET_LEN 10 LOAD_CONST 0 (4) 12 COMPARE_OP 2 (==) 14 POP_JUMP_IF_FALSE 20 (to 40) 16 UNPACK_SEQUENCE 4 18 SWAP 2 20 SWAP 3 22 SWAP 4 24 POP_JUMP_IF_NOT_NONE 18 (to 36) 26 STORE_NAME 1 (a) 28 STORE_NAME 2 (b) 30 STORE_NAME 3 (c) 3 32 LOAD_CONST 1 (None) 34 RETURN_VALUE 2 >> 36 POP_TOP 38 POP_TOP >> 40 POP_TOP 42 LOAD_CONST 1 (None) 44 RETURN_VALUE Replacing the ROT_FOURs with SWAPs may seem minor, but it ends up being a lot less work at runtime.
msg411791 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-01-26 20:47
New changeset 85483668647e7840c7b9a1877caaf2ef14a4443f by Brandt Bucher in branch 'main': bpo-46528: Simplify the VM's stack manipulations (GH-30902) https://github.com/python/cpython/commit/85483668647e7840c7b9a1877caaf2ef14a4443f
msg411937 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2022-01-27 23:20
Very interesting. Do we have any current (or even cutting edge, i.e. 3.11) timings for individual instructions? Or a breakdown of how frequently different instructions are invoked? I remember Carl Shapiro presenting his findings here several years ago (I think in the penultimate in-person Language Summit).
msg411938 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-01-27 23:28
In a typical run of the pyperformance benchmark suite, rotations account for a bit over 1% of all instructions executed. I don't have timings for individual instructions, unfortunately.
msg411940 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2022-01-27 23:40
Timings for individual instructions are a bit meaningless, as out-of-order execution and speculation on modern CPUs makes it hard to pin down the timing of anything. I did an experiment to double the number of instructions. It slowed things down by ~10%, so increasing the number of instructions by 1% would be expected to result in a slowdown of 0.1%. In other words, this is going to make little or no difference to performance. It does make things cleaner and simpler though, which has its own benefits.
msg411941 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-01-27 23:46
> I did an experiment to double the number of instructions. Were the extra instructions just NOPs, or were they actually doing any work? If they were NOPs, then presumably those numbers tell us more about the overhead of dispatch and cache pressure than anything else.
msg411952 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2022-01-28 00:56
IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top. I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus.
msg411986 - (view)	Author: Ken Jin (kj) *	Date: 2022-01-28 11:00
> IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top. I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus. AFAIK, on nix systems with PGO, the order doesn't matter. Brandt did some interesting research on this earlier https://github.com/faster-cpython/ideas/issues/96. On Windows, it might be a different story. There are reports of up to 5% speedups in pyperformance just by moving LOAD_FAST and LOAD_GLOBAL out of the loop (to the top). See the last few messages in https://bugs.python.org/issue45116. I've heard from some people that MSVC's PGO doesn't deal with gigantic switch-case statements very well, though I can't confirm the veracity of that info.
msg412024 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2022-01-28 18:47
Minor nit: I think swaptimize() should check the for PyMem_Malloc returning NULL here. // Create an array with elements {0, 1, 2, ..., depth - 1}: int stack = PyMem_Malloc(depth sizeof(int)); for (int i = 0; i < depth; i++) { stack[i] = i; }
msg412030 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-01-28 20:13
Hm, yeah. Bummer that this needs error handling now. I'll have a fix up soon.
msg412314 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-02-01 21:41
New changeset a0e55a571cf01885fd5826266c37abaee307c309 by Brandt Bucher in branch 'main': bpo-46528: Simplify BUILD_TUPLE/UNPACK_SEQUENCE folding (GH-31039) https://github.com/python/cpython/commit/a0e55a571cf01885fd5826266c37abaee307c309
msg412939 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-02-09 19:31
New changeset 46328d8ae6529db916ccaabb9247fb0327ce0c1e by Brandt Bucher in branch 'main': bpo-46528: Check PyMem_Malloc for NULL (GH-30998) https://github.com/python/cpython/commit/46328d8ae6529db916ccaabb9247fb0327ce0c1e
msg412954 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2022-02-09 23:15
New changeset 78ae4cc6dc949e8bc39fab25fea5efe983dc0ad1 by Brandt Bucher in branch 'main': bpo-46528: Attempt SWAPs at compile-time (GH-30970) https://github.com/python/cpython/commit/78ae4cc6dc949e8bc39fab25fea5efe983dc0ad1

History
Date	User	Action	Args
2022-04-11 14:59:55	admin	set	github: 90686
2022-02-09 23:16:39	brandtbucher	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2022-02-09 23:15:52	brandtbucher	set	messages: + msg412954
2022-02-09 19:31:15	brandtbucher	set	messages: + msg412939
2022-02-01 21:41:42	brandtbucher	set	messages: + msg412314
2022-01-31 23:32:55	brandtbucher	set	pull_requests: + pull_request29222
2022-01-28 23:17:27	brandtbucher	set	pull_requests: + pull_request29180
2022-01-28 20:13:57	brandtbucher	set	messages: + msg412030
2022-01-28 18:47:41	Dennis Sweeney	set	nosy: + Dennis Sweeney messages: + msg412024
2022-01-28 11:00:26	kj	set	nosy: + kj messages: + msg411986
2022-01-28 00:56:39	barry	set	messages: + msg411952
2022-01-27 23:46:08	brandtbucher	set	messages: + msg411941
2022-01-27 23:40:21	Mark.Shannon	set	messages: + msg411940
2022-01-27 23:28:04	brandtbucher	set	messages: + msg411938
2022-01-27 23:20:12	barry	set	messages: + msg411937
2022-01-27 22:27:25	brandtbucher	set	pull_requests: + pull_request29148
2022-01-26 20:47:52	brandtbucher	set	messages: + msg411791
2022-01-26 02:11:23	brandtbucher	set	messages: + msg411700
2022-01-26 01:48:32	brandtbucher	set	keywords: + patch stage: patch review pull_requests: + pull_request29081
2022-01-26 01:44:17	barry	set	nosy: + barry messages: + msg411698
2022-01-26 01:39:59	brandtbucher	create