classification
Title: Add combined opcodes
Type: performance Stage: patch review
Components: Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, Mark.Shannon, brandtbucher, eric.snow, gregory.p.smith, gvanrossum, pablogsal, rhettinger, serhiy.storchaka, terry.reedy, tim.peters
Priority: normal Keywords: patch

Created on 2021-03-31 22:40 by gvanrossum, last changed 2021-04-06 21:41 by gvanrossum.

Pull Requests
URL Status Linked Edit
PR 25090 open gvanrossum, 2021-03-31 22:40
Messages (11)
msg389939 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-31 22:40
I'm lining up some PRs (inspired by some of Mark Shannon's ideas) that add new opcodes which are straightforward combinations of existing opcodes. For example, ADD_INT is equivalent to LOAD_CONST + BINARY_ADD, for certain small (common) integer constants.

Each of these adds only a minor speedup, but after a dozen or so of these the speedup is (hopefully) significant enough to warrant the opcode churn.
msg389967 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-04-01 09:53
We usually rejected such propositions unless there were evidences of significant effect on performance. I myself withdrawn several my patches after analyzing statistics.

What percent of opcodes will be ADD_INT in the compiled bytecode? What percent of executed opcodes will be ADD_INT? Compare it with other opcodes, in particularly BINARY_ADD. What are microbenchmark results?
msg390101 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-04-02 21:23
Microbench results for ADD_INT:
https://github.com/python/cpython/pull/25090#issuecomment-810441738

I have a few others like this lined up -- Mark recommended submitting them one at a time to make review easier.

My philosophy here (which I learned from Tim Peters in the early 2000s) is that even though each individual improvement has no measurable effect on a general benchmark (as shown in the same comment), the combined effect of a number of tiny improvements can be significant.

We have reached a point where there are very few individual changes possible that still have a significant effect, but we should not use this as an excuse to stop trying to improve performance.

I have mislaid the results that made me think this particular opcode is a good candidate; I will go back to reproduce that result and get back to you here.
msg390104 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-04-02 21:58
"""
My philosophy here (which I learned from Tim Peters in the early 2000s) is that even though each individual improvement has no measurable effect on a general benchmark (as shown in the same comment), the combined effect of a number of tiny improvements can be significant.
"""

And sometimes more so than the naïve sum of their individual contributions! Although this was often much clearer on older architectures and compilers (both more predictable).

Indeed, in the old days I routinely checked "optimizations" in that _slowed_ microbenchmarks, provided they reduced the work on the critical path, and didn't make the code markedly harder to follow (but reducing the critical path usually makes the code clearer!).

Because, sooner or later, compilers will get smart enough to see what I saw, and generate better code. And then these can compound. Like "oh! these three temp variables don't actually need to be materialized at all anymore, and suddenly I have few enough that do need to be materialized that _all_ of them can live in fast registers...". So, at some point, the next seemingly insignificant optimization checked in could _appear_ to have an "inexplicable" benefit, by breaking the bottleneck on _some_ inscrutable HW resource overtaxed on the critical path by the original code.

So optimize for what a smart compiler will eventually do, not what they happen to do today ;-) Profile-guided optimization was a major step in that direction.

Saddest to me: layers of indirection introduced to support marginal features, because "well, they don't slow down pybench by much at all". That's the opposite: over time, they can compound to do worse damage than the sum of their individual burdens.

In the end, Everything Counts™.
msg390105 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-04-02 22:18
Statically, in the stdlib, about 1 in a 1000 opcodes is an ADD_INT candidate.

I'll have to do some more work to come up with a dynamic count.
msg390109 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-04-02 22:43
> Statically, in the stdlib, about 1 in a 1000 opcodes 
> is an ADD_INT candidate.

I would expect that ADD_INT would typically occur in loops, so even if it is a rare opcode statically, it will really count when it used.
msg390263 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-04-05 21:33
So dynamically, I see about 0.5% of all opcodes are ADD_INT, so higher than statically (which was 0.1%) but still small enough that I'm abandoning work on ADD_INT.

I've got a branch where I'm adding POP_JUMP_IF_NONE and POP_JUMP_IF_NOT_NONE (which capture if x is None, if x is not None) but the static count there is still pretty low, around 0.2% statically. These numbers are remarkably stable -- I get 0.2% exactly for mypy, 0.18% for the stdlib. (A bit more in the azure-cli source code: 0.35%.)

I expect that the dynamic count here would be a bit higher too: without running it, I'd expect around 1%, if we take it that the average loop does about 5 iterations -- my results for ADD_INT would lead to that conclusion, and this was independently verified by an Instagram blog post from 2017. (https://instagram-engineering.com/profiling-cpython-at-instagram-89d4cbeeb898, search for "loopyness")

Is that enough to persevere? I really don't know.

I have a few other potential opcode combinations, RETURN_CONST and RETURN_NONE, but since those (by definition) don't occur in a loop, I'm even less optimistic about the value of those.

We really should optimize things like LOAD_FAST + LOAD_FAST, which occurs much more frequently. However the way to encode that combination is pretty gross, and I'd rather do that another time.
msg390292 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-04-06 08:01
Interesting. What code did you use to collect statistics? I used patches in issue27255. Perhaps it is worth to add an optionally compiled code for collecting dynamic opcode statistics permanently.

We have around 125 opcodes, so 0.5% looks too small. But what is the range of that opcode? How many opcodes are more frequent? And what is the range of BINARY_ADD for comparison.

RETURN_CONST or RETURN_NONE was already discussed before. They can save few bytes in bytecode, but they are never used in tight loops and the time difference is dwarfed by the frame creation overhead in any case.

For frequently used pairs of opcodes we have the PREDICT() macros which reduces overhead.
msg390350 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-04-06 16:00
> Interesting. What code did you use to collect statistics?

For static statistics I wrote my own script: https://github.com/python/cpython/pull/25090/files#diff-994d3592c951c78cbe71084d562590d1507ddfed767e2ec040f5e2610845a11c. I can add that to Tools permanently (I have a slightly improved version that can look for different things).

> I used patches in issue27255. Perhaps it is worth to add an optionally compiled code for collecting dynamic opcode statistics permanently.

We have that already: compile with -DDYNAMIC_EXECUTION_PROFILE -DDXPAIRS. There's a script (Tools/scripts/analyze_dxp.py) that works with such data, but it's primitive. Eric is working on improvements; I added my own hacky script to run an app and collect the data.

I collected a few different sets of statistics: static stats for the stdlib and for mypy, dynamic stats for running mypy and a few of the benchmarks in pyperformance. Eric and I have plans to do this more systematically; we'll then publish our tools and results.
msg390373 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-04-06 20:43
> I collected a few different sets of statistics: static stats for the stdlib and for mypy, dynamic stats for running mypy and a few of the benchmarks in pyperformance.

I'm sure you've considered this, but I'd be *very* careful using opcode stats from the official benchmarks to inform work on these improvements. It's probably better to stick to in-sample data from the stdlib / mypy / etc. and use the benchmarks *only* for measuring out-of-sample results.

Otherwise we may just end up hacking the benchmark suite. :)
msg390378 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-04-06 21:41
Yeah, I've received many warnings about benchmark hacking already, and we
won't fall to that.

For static statistics, it's easy enough to either download a thousand of
the most popular projects from GitHub or PyPI and just try to compile all
the .py files found there.

For dynamic statistics, the problem is always to come up with a
representative run. I know how to do that for mypy (just running mypy over
itself should be pretty good), but we can't do that for random projects
from PyPI or GitHub -- not only do I not want to run 3rd party code I
haven't eyeballed reviewed carefully first, but usually there's no way to
know how to even run it -- configuration, environment, input data all need
to be fixed, and dependencies need to be installed.

Running test suites (whether it's the stdlib tests or some other set of
unit tests for some other code base) is likely to give skewed results,
given how people often write tests (trying to hit each line of code at
least once but as few times as possible so the tests run quickly). This is
where benchmarks provide some limited value, since they come with a
standard way to run non-interactively.

I wish it was the case that static stats can be a good proxy for dynamic
stats, since then we could just compute static stats for a large amount of
3rd party code, but I'm not too sure of that. So I would like to have more
examples that we can measure both statically and dynamically -- if the
dynamic numbers look sufficiently similar to the static numbers, we can see
that as an indication that static numbers are a good indicator, and if they
look different, it means that we're going to have to collect more examples
in order to make progress.
History
Date User Action Args
2021-04-06 21:41:49gvanrossumsetmessages: + msg390378
2021-04-06 20:43:39brandtbuchersetnosy: + brandtbucher
messages: + msg390373
2021-04-06 16:00:14gvanrossumsetmessages: + msg390350
2021-04-06 08:01:22serhiy.storchakasetmessages: + msg390292
2021-04-06 07:55:19gregory.p.smithsetnosy: + gregory.p.smith
2021-04-05 21:33:05gvanrossumsetmessages: + msg390263
2021-04-03 01:52:36terry.reedysetnosy: + terry.reedy
2021-04-02 22:43:00rhettingersetnosy: + rhettinger
messages: + msg390109
2021-04-02 22:18:21gvanrossumsetmessages: + msg390105
2021-04-02 21:58:54tim.peterssetnosy: + tim.peters
messages: + msg390104
2021-04-02 21:23:12gvanrossumsetmessages: + msg390101
2021-04-01 09:53:05serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg389967
2021-04-01 09:44:24BTaskayasetnosy: + BTaskaya
2021-03-31 22:40:57gvanrossumsetkeywords: + patch
stage: patch review
pull_requests: + pull_request23868
2021-03-31 22:40:32gvanrossumcreate