msg411697 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-01-26 01:39 |
...as discussed in https://github.com/faster-cpython/ideas/discussions/228.
We can dramatically simplify our stack manipulations by getting rid of the `DUP_TOP*` and `ROT_*` families of instructions:
- Replace `DUP_TOP` with `COPY(1)`.
- Replace `DUP_TOP_TWO` with `COPY(2), COPY(2)`.
- Introduce a new `SWAP` instruction.
- Replace `ROT_TWO` with `SWAP(2)`.
- Replace `ROT_THREE` with `SWAP(3), SWAP(2)`.
- Remove `ROT_FOUR`.
- Replace `ROT_N(n)` with `SWAP(n), SWAP(n - 1), ..., SWAP(2)`.
It then becomes much simpler for the peephole optimizer to reason about code sequences involving these instructions (for example, it's pretty straightforward to truly *optimize* an arbitrary sequence of swaps).
|
msg411698 - (view) |
Author: Barry A. Warsaw (barry) * |
Date: 2022-01-26 01:44 |
What are the (unoptimized) performance implications of replacing one instruction with multiple instructions?
|
msg411700 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-01-26 02:11 |
In practice, pretty much the only thing that will do more work now is augmented subscription:
>>> a[b] += c
main:
1 2 LOAD_NAME 0 (a)
4 LOAD_NAME 1 (b)
6 DUP_TOP_TWO
8 BINARY_SUBSCR
10 LOAD_NAME 2 (c)
12 BINARY_OP 13 (+=)
14 ROT_THREE
16 STORE_SUBSCR
patched:
1 2 LOAD_NAME 0 (a)
4 LOAD_NAME 1 (b)
6 COPY 2
8 COPY 2
10 BINARY_SUBSCR
12 LOAD_NAME 2 (c)
14 BINARY_OP 13 (+=)
16 SWAP 3
18 SWAP 2
20 STORE_SUBSCR
Pattern matching is the only place where we use ROT_N, and frankly it's *misused* there... often, it makes the cost of storing names quadratic at runtime. Even though we emit *more* instructions now, the peephole optimizer actually cuts them down significantly, and the result is more efficient than before. For example:
>>> match x:
... case (a, b, c, None):
... pass
main:
1 2 LOAD_NAME 0 (x)
2 4 MATCH_SEQUENCE
6 POP_JUMP_IF_FALSE 20 (to 40)
8 GET_LEN
10 LOAD_CONST 0 (4)
12 COMPARE_OP 2 (==)
14 POP_JUMP_IF_FALSE 20 (to 40)
16 UNPACK_SEQUENCE 4
18 ROT_FOUR
20 ROT_FOUR
22 ROT_FOUR
24 POP_JUMP_IF_NOT_NONE 18 (to 36)
26 STORE_NAME 1 (a)
28 STORE_NAME 2 (b)
30 STORE_NAME 3 (c)
3 32 LOAD_CONST 1 (None)
34 RETURN_VALUE
2 >> 36 POP_TOP
38 POP_TOP
>> 40 POP_TOP
42 LOAD_CONST 1 (None)
44 RETURN_VALUE
patched:
1 2 LOAD_NAME 0 (x)
2 4 MATCH_SEQUENCE
6 POP_JUMP_IF_FALSE 20 (to 40)
8 GET_LEN
10 LOAD_CONST 0 (4)
12 COMPARE_OP 2 (==)
14 POP_JUMP_IF_FALSE 20 (to 40)
16 UNPACK_SEQUENCE 4
18 SWAP 2
20 SWAP 3
22 SWAP 4
24 POP_JUMP_IF_NOT_NONE 18 (to 36)
26 STORE_NAME 1 (a)
28 STORE_NAME 2 (b)
30 STORE_NAME 3 (c)
3 32 LOAD_CONST 1 (None)
34 RETURN_VALUE
2 >> 36 POP_TOP
38 POP_TOP
>> 40 POP_TOP
42 LOAD_CONST 1 (None)
44 RETURN_VALUE
Replacing the ROT_FOURs with SWAPs may seem minor, but it ends up being a *lot* less work at runtime.
|
msg411791 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-01-26 20:47 |
New changeset 85483668647e7840c7b9a1877caaf2ef14a4443f by Brandt Bucher in branch 'main':
bpo-46528: Simplify the VM's stack manipulations (GH-30902)
https://github.com/python/cpython/commit/85483668647e7840c7b9a1877caaf2ef14a4443f
|
msg411937 - (view) |
Author: Barry A. Warsaw (barry) * |
Date: 2022-01-27 23:20 |
Very interesting. Do we have any current (or even cutting edge, i.e. 3.11) timings for individual instructions? Or a breakdown of how frequently different instructions are invoked? I remember Carl Shapiro presenting his findings here several years ago (I think in the penultimate in-person Language Summit).
|
msg411938 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-01-27 23:28 |
In a typical run of the pyperformance benchmark suite, rotations account for a bit over 1% of all instructions executed.
I don't have timings for individual instructions, unfortunately.
|
msg411940 - (view) |
Author: Mark Shannon (Mark.Shannon) * |
Date: 2022-01-27 23:40 |
Timings for individual instructions are a bit meaningless, as out-of-order execution and speculation on modern CPUs makes it hard to pin down the timing of anything.
I did an experiment to double the number of instructions. It slowed things down by ~10%, so increasing the number of instructions by 1% would be expected to result in a slowdown of 0.1%.
In other words, this is going to make little or no difference to performance. It does make things cleaner and simpler though, which has its own benefits.
|
msg411941 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-01-27 23:46 |
> I did an experiment to double the number of instructions.
Were the extra instructions just NOPs, or were they actually doing any work?
If they were NOPs, then presumably those numbers tell us more about the overhead of dispatch and cache pressure than anything else.
|
msg411952 - (view) |
Author: Barry A. Warsaw (barry) * |
Date: 2022-01-28 00:56 |
IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top. I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus.
|
msg411986 - (view) |
Author: Ken Jin (kj) * |
Date: 2022-01-28 11:00 |
> IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top. I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus.
AFAIK, on nix systems with PGO, the order doesn't matter. Brandt did some interesting research on this earlier https://github.com/faster-cpython/ideas/issues/96.
On Windows, it might be a different story. There are reports of up to 5% speedups in pyperformance just by moving LOAD_FAST and LOAD_GLOBAL out of the loop (to the top). See the last few messages in https://bugs.python.org/issue45116. I've heard from some people that MSVC's PGO doesn't deal with gigantic switch-case statements very well, though I can't confirm the veracity of that info.
|
msg412024 - (view) |
Author: Dennis Sweeney (Dennis Sweeney) * |
Date: 2022-01-28 18:47 |
Minor nit: I think swaptimize() should check the for PyMem_Malloc returning NULL here.
// Create an array with elements {0, 1, 2, ..., depth - 1}:
int *stack = PyMem_Malloc(depth * sizeof(int));
for (int i = 0; i < depth; i++) {
stack[i] = i;
}
|
msg412030 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-01-28 20:13 |
Hm, yeah. Bummer that this needs error handling now.
I'll have a fix up soon.
|
msg412314 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-02-01 21:41 |
New changeset a0e55a571cf01885fd5826266c37abaee307c309 by Brandt Bucher in branch 'main':
bpo-46528: Simplify BUILD_TUPLE/UNPACK_SEQUENCE folding (GH-31039)
https://github.com/python/cpython/commit/a0e55a571cf01885fd5826266c37abaee307c309
|
msg412939 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-02-09 19:31 |
New changeset 46328d8ae6529db916ccaabb9247fb0327ce0c1e by Brandt Bucher in branch 'main':
bpo-46528: Check PyMem_Malloc for NULL (GH-30998)
https://github.com/python/cpython/commit/46328d8ae6529db916ccaabb9247fb0327ce0c1e
|
msg412954 - (view) |
Author: Brandt Bucher (brandtbucher) * |
Date: 2022-02-09 23:15 |
New changeset 78ae4cc6dc949e8bc39fab25fea5efe983dc0ad1 by Brandt Bucher in branch 'main':
bpo-46528: Attempt SWAPs at compile-time (GH-30970)
https://github.com/python/cpython/commit/78ae4cc6dc949e8bc39fab25fea5efe983dc0ad1
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:55 | admin | set | github: 90686 |
2022-02-09 23:16:39 | brandtbucher | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2022-02-09 23:15:52 | brandtbucher | set | messages:
+ msg412954 |
2022-02-09 19:31:15 | brandtbucher | set | messages:
+ msg412939 |
2022-02-01 21:41:42 | brandtbucher | set | messages:
+ msg412314 |
2022-01-31 23:32:55 | brandtbucher | set | pull_requests:
+ pull_request29222 |
2022-01-28 23:17:27 | brandtbucher | set | pull_requests:
+ pull_request29180 |
2022-01-28 20:13:57 | brandtbucher | set | messages:
+ msg412030 |
2022-01-28 18:47:41 | Dennis Sweeney | set | nosy:
+ Dennis Sweeney messages:
+ msg412024
|
2022-01-28 11:00:26 | kj | set | nosy:
+ kj messages:
+ msg411986
|
2022-01-28 00:56:39 | barry | set | messages:
+ msg411952 |
2022-01-27 23:46:08 | brandtbucher | set | messages:
+ msg411941 |
2022-01-27 23:40:21 | Mark.Shannon | set | messages:
+ msg411940 |
2022-01-27 23:28:04 | brandtbucher | set | messages:
+ msg411938 |
2022-01-27 23:20:12 | barry | set | messages:
+ msg411937 |
2022-01-27 22:27:25 | brandtbucher | set | pull_requests:
+ pull_request29148 |
2022-01-26 20:47:52 | brandtbucher | set | messages:
+ msg411791 |
2022-01-26 02:11:23 | brandtbucher | set | messages:
+ msg411700 |
2022-01-26 01:48:32 | brandtbucher | set | keywords:
+ patch stage: patch review pull_requests:
+ pull_request29081 |
2022-01-26 01:44:17 | barry | set | nosy:
+ barry messages:
+ msg411698
|
2022-01-26 01:39:59 | brandtbucher | create | |