This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Simplify the VM's stack manipulations
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: brandtbucher Nosy List: Dennis Sweeney, Mark.Shannon, barry, brandtbucher, kj
Priority: normal Keywords: patch

Created on 2022-01-26 01:39 by brandtbucher, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 30902 merged brandtbucher, 2022-01-26 01:48
PR 30970 merged brandtbucher, 2022-01-27 22:27
PR 30998 merged brandtbucher, 2022-01-28 23:17
PR 31039 merged brandtbucher, 2022-01-31 23:32
Messages (15)
msg411697 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-01-26 01:39
...as discussed in https://github.com/faster-cpython/ideas/discussions/228.

We can dramatically simplify our stack manipulations by getting rid of the `DUP_TOP*` and `ROT_*` families of instructions:

- Replace `DUP_TOP` with `COPY(1)`.
- Replace `DUP_TOP_TWO` with `COPY(2), COPY(2)`.
- Introduce a new `SWAP` instruction.
  - Replace `ROT_TWO` with `SWAP(2)`.
  - Replace `ROT_THREE` with `SWAP(3), SWAP(2)`.
  - Remove `ROT_FOUR`.
  - Replace `ROT_N(n)` with `SWAP(n), SWAP(n - 1), ..., SWAP(2)`.

It then becomes much simpler for the peephole optimizer to reason about code sequences involving these instructions (for example, it's pretty straightforward to truly *optimize* an arbitrary sequence of swaps).
msg411698 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2022-01-26 01:44
What are the (unoptimized) performance implications of replacing one instruction with multiple instructions?
msg411700 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-01-26 02:11
In practice, pretty much the only thing that will do more work now is augmented subscription:

>>> a[b] += c

main:

  1           2 LOAD_NAME                0 (a)
              4 LOAD_NAME                1 (b)
              6 DUP_TOP_TWO
              8 BINARY_SUBSCR
             10 LOAD_NAME                2 (c)
             12 BINARY_OP               13 (+=)
             14 ROT_THREE
             16 STORE_SUBSCR

patched:

  1           2 LOAD_NAME                0 (a)
              4 LOAD_NAME                1 (b)
              6 COPY                     2
              8 COPY                     2
             10 BINARY_SUBSCR
             12 LOAD_NAME                2 (c)
             14 BINARY_OP               13 (+=)
             16 SWAP                     3
             18 SWAP                     2
             20 STORE_SUBSCR

Pattern matching is the only place where we use ROT_N, and frankly it's *misused* there... often, it makes the cost of storing names quadratic at runtime. Even though we emit *more* instructions now, the peephole optimizer actually cuts them down significantly, and the result is more efficient than before. For example:

>>> match x:
...     case (a, b, c, None):
...         pass

main:

  1           2 LOAD_NAME                0 (x)

  2           4 MATCH_SEQUENCE
              6 POP_JUMP_IF_FALSE       20 (to 40)
              8 GET_LEN
             10 LOAD_CONST               0 (4)
             12 COMPARE_OP               2 (==)
             14 POP_JUMP_IF_FALSE       20 (to 40)
             16 UNPACK_SEQUENCE          4
             18 ROT_FOUR
             20 ROT_FOUR
             22 ROT_FOUR
             24 POP_JUMP_IF_NOT_NONE    18 (to 36)
             26 STORE_NAME               1 (a)
             28 STORE_NAME               2 (b)
             30 STORE_NAME               3 (c)

  3          32 LOAD_CONST               1 (None)
             34 RETURN_VALUE

  2     >>   36 POP_TOP
             38 POP_TOP
        >>   40 POP_TOP
             42 LOAD_CONST               1 (None)
             44 RETURN_VALUE

patched:

  1           2 LOAD_NAME                0 (x)

  2           4 MATCH_SEQUENCE
              6 POP_JUMP_IF_FALSE       20 (to 40)
              8 GET_LEN
             10 LOAD_CONST               0 (4)
             12 COMPARE_OP               2 (==)
             14 POP_JUMP_IF_FALSE       20 (to 40)
             16 UNPACK_SEQUENCE          4
             18 SWAP                     2
             20 SWAP                     3
             22 SWAP                     4
             24 POP_JUMP_IF_NOT_NONE    18 (to 36)
             26 STORE_NAME               1 (a)
             28 STORE_NAME               2 (b)
             30 STORE_NAME               3 (c)

  3          32 LOAD_CONST               1 (None)
             34 RETURN_VALUE

  2     >>   36 POP_TOP
             38 POP_TOP
        >>   40 POP_TOP
             42 LOAD_CONST               1 (None)
             44 RETURN_VALUE

Replacing the ROT_FOURs with SWAPs may seem minor, but it ends up being a *lot* less work at runtime.
msg411791 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-01-26 20:47
New changeset 85483668647e7840c7b9a1877caaf2ef14a4443f by Brandt Bucher in branch 'main':
bpo-46528: Simplify the VM's stack manipulations (GH-30902)
https://github.com/python/cpython/commit/85483668647e7840c7b9a1877caaf2ef14a4443f
msg411937 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2022-01-27 23:20
Very interesting.  Do we have any current (or even cutting edge, i.e. 3.11) timings for individual instructions?  Or a breakdown of how frequently different instructions are invoked?  I remember Carl Shapiro presenting his findings here several years ago (I think in the penultimate in-person Language Summit).
msg411938 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-01-27 23:28
In a typical run of the pyperformance benchmark suite, rotations account for a bit over 1% of all instructions executed.

I don't have timings for individual instructions, unfortunately.
msg411940 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2022-01-27 23:40
Timings for individual instructions are a bit meaningless, as out-of-order execution and speculation on modern CPUs makes it hard to pin down the timing of anything.

I did an experiment to double the number of instructions. It slowed things down by ~10%, so increasing the number of instructions by 1% would be expected to result in a slowdown of 0.1%.

In other words, this is going to make little or no difference to performance. It does make things cleaner and simpler though, which has its own benefits.
msg411941 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-01-27 23:46
> I did an experiment to double the number of instructions.

Were the extra instructions just NOPs, or were they actually doing any work?

If they were NOPs, then presumably those numbers tell us more about the overhead of dispatch and cache pressure than anything else.
msg411952 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2022-01-28 00:56
IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top.  I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus.
msg411986 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2022-01-28 11:00
> IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top.  I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus.

AFAIK, on nix systems with PGO, the order doesn't matter. Brandt did some interesting research on this earlier https://github.com/faster-cpython/ideas/issues/96.

On Windows, it might be a different story. There are reports of up to 5% speedups in pyperformance just by moving LOAD_FAST and LOAD_GLOBAL out of the loop (to the top).  See the last few messages in https://bugs.python.org/issue45116. I've heard from some people that MSVC's PGO doesn't deal with gigantic switch-case statements very well, though I can't confirm the veracity of that info.
msg412024 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2022-01-28 18:47
Minor nit: I think swaptimize() should check the for PyMem_Malloc returning NULL here.

    // Create an array with elements {0, 1, 2, ..., depth - 1}:
    int *stack = PyMem_Malloc(depth * sizeof(int));
    for (int i = 0; i < depth; i++) {
        stack[i] = i;
    }
msg412030 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-01-28 20:13
Hm, yeah. Bummer that this needs error handling now.

I'll have a fix up soon.
msg412314 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-02-01 21:41
New changeset a0e55a571cf01885fd5826266c37abaee307c309 by Brandt Bucher in branch 'main':
bpo-46528: Simplify BUILD_TUPLE/UNPACK_SEQUENCE folding (GH-31039)
https://github.com/python/cpython/commit/a0e55a571cf01885fd5826266c37abaee307c309
msg412939 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-02-09 19:31
New changeset 46328d8ae6529db916ccaabb9247fb0327ce0c1e by Brandt Bucher in branch 'main':
bpo-46528: Check PyMem_Malloc for NULL (GH-30998)
https://github.com/python/cpython/commit/46328d8ae6529db916ccaabb9247fb0327ce0c1e
msg412954 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-02-09 23:15
New changeset 78ae4cc6dc949e8bc39fab25fea5efe983dc0ad1 by Brandt Bucher in branch 'main':
bpo-46528: Attempt SWAPs at compile-time (GH-30970)
https://github.com/python/cpython/commit/78ae4cc6dc949e8bc39fab25fea5efe983dc0ad1
History
Date User Action Args
2022-04-11 14:59:55adminsetgithub: 90686
2022-02-09 23:16:39brandtbuchersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2022-02-09 23:15:52brandtbuchersetmessages: + msg412954
2022-02-09 19:31:15brandtbuchersetmessages: + msg412939
2022-02-01 21:41:42brandtbuchersetmessages: + msg412314
2022-01-31 23:32:55brandtbuchersetpull_requests: + pull_request29222
2022-01-28 23:17:27brandtbuchersetpull_requests: + pull_request29180
2022-01-28 20:13:57brandtbuchersetmessages: + msg412030
2022-01-28 18:47:41Dennis Sweeneysetnosy: + Dennis Sweeney
messages: + msg412024
2022-01-28 11:00:26kjsetnosy: + kj
messages: + msg411986
2022-01-28 00:56:39barrysetmessages: + msg411952
2022-01-27 23:46:08brandtbuchersetmessages: + msg411941
2022-01-27 23:40:21Mark.Shannonsetmessages: + msg411940
2022-01-27 23:28:04brandtbuchersetmessages: + msg411938
2022-01-27 23:20:12barrysetmessages: + msg411937
2022-01-27 22:27:25brandtbuchersetpull_requests: + pull_request29148
2022-01-26 20:47:52brandtbuchersetmessages: + msg411791
2022-01-26 02:11:23brandtbuchersetmessages: + msg411700
2022-01-26 01:48:32brandtbuchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request29081
2022-01-26 01:44:17barrysetnosy: + barry
messages: + msg411698
2022-01-26 01:39:59brandtbuchercreate