Issue 43684: Add combined opcodes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87850

classification

Title:	Add combined opcodes
Type:	performance	Stage:	patch review
Components:		Versions:	Python 3.10

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	BTaskaya, Mark.Shannon, brandtbucher, eric.snow, gregory.p.smith, gvanrossum, pablogsal, rhettinger, serhiy.storchaka, terry.reedy, tim.peters
Priority:	normal	Keywords:	patch

Created on 2021-03-31 22:40 by gvanrossum, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 25090	open	gvanrossum, 2021-03-31 22:40

Messages (11)
msg389939 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-03-31 22:40
I'm lining up some PRs (inspired by some of Mark Shannon's ideas) that add new opcodes which are straightforward combinations of existing opcodes. For example, ADD_INT is equivalent to LOAD_CONST + BINARY_ADD, for certain small (common) integer constants. Each of these adds only a minor speedup, but after a dozen or so of these the speedup is (hopefully) significant enough to warrant the opcode churn.
msg389967 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-04-01 09:53
We usually rejected such propositions unless there were evidences of significant effect on performance. I myself withdrawn several my patches after analyzing statistics. What percent of opcodes will be ADD_INT in the compiled bytecode? What percent of executed opcodes will be ADD_INT? Compare it with other opcodes, in particularly BINARY_ADD. What are microbenchmark results?
msg390101 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-04-02 21:23
Microbench results for ADD_INT: https://github.com/python/cpython/pull/25090#issuecomment-810441738 I have a few others like this lined up -- Mark recommended submitting them one at a time to make review easier. My philosophy here (which I learned from Tim Peters in the early 2000s) is that even though each individual improvement has no measurable effect on a general benchmark (as shown in the same comment), the combined effect of a number of tiny improvements can be significant. We have reached a point where there are very few individual changes possible that still have a significant effect, but we should not use this as an excuse to stop trying to improve performance. I have mislaid the results that made me think this particular opcode is a good candidate; I will go back to reproduce that result and get back to you here.
msg390104 - (view)	Author: Tim Peters (tim.peters) *	Date: 2021-04-02 21:58
""" My philosophy here (which I learned from Tim Peters in the early 2000s) is that even though each individual improvement has no measurable effect on a general benchmark (as shown in the same comment), the combined effect of a number of tiny improvements can be significant. """ And sometimes more so than the naïve sum of their individual contributions! Although this was often much clearer on older architectures and compilers (both more predictable). Indeed, in the old days I routinely checked "optimizations" in that _slowed_ microbenchmarks, provided they reduced the work on the critical path, and didn't make the code markedly harder to follow (but reducing the critical path usually makes the code clearer!). Because, sooner or later, compilers will get smart enough to see what I saw, and generate better code. And then these can compound. Like "oh! these three temp variables don't actually need to be materialized at all anymore, and suddenly I have few enough that do need to be materialized that _all_ of them can live in fast registers...". So, at some point, the next seemingly insignificant optimization checked in could _appear_ to have an "inexplicable" benefit, by breaking the bottleneck on _some_ inscrutable HW resource overtaxed on the critical path by the original code. So optimize for what a smart compiler will eventually do, not what they happen to do today ;-) Profile-guided optimization was a major step in that direction. Saddest to me: layers of indirection introduced to support marginal features, because "well, they don't slow down pybench by much at all". That's the opposite: over time, they can compound to do worse damage than the sum of their individual burdens. In the end, Everything Counts™.
msg390105 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-04-02 22:18
Statically, in the stdlib, about 1 in a 1000 opcodes is an ADD_INT candidate. I'll have to do some more work to come up with a dynamic count.
msg390109 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-04-02 22:43
> Statically, in the stdlib, about 1 in a 1000 opcodes > is an ADD_INT candidate. I would expect that ADD_INT would typically occur in loops, so even if it is a rare opcode statically, it will really count when it used.
msg390263 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-04-05 21:33
So dynamically, I see about 0.5% of all opcodes are ADD_INT, so higher than statically (which was 0.1%) but still small enough that I'm abandoning work on ADD_INT. I've got a branch where I'm adding POP_JUMP_IF_NONE and POP_JUMP_IF_NOT_NONE (which capture if x is None, if x is not None) but the static count there is still pretty low, around 0.2% statically. These numbers are remarkably stable -- I get 0.2% exactly for mypy, 0.18% for the stdlib. (A bit more in the azure-cli source code: 0.35%.) I expect that the dynamic count here would be a bit higher too: without running it, I'd expect around 1%, if we take it that the average loop does about 5 iterations -- my results for ADD_INT would lead to that conclusion, and this was independently verified by an Instagram blog post from 2017. (https://instagram-engineering.com/profiling-cpython-at-instagram-89d4cbeeb898, search for "loopyness") Is that enough to persevere? I really don't know. I have a few other potential opcode combinations, RETURN_CONST and RETURN_NONE, but since those (by definition) don't occur in a loop, I'm even less optimistic about the value of those. We really should optimize things like LOAD_FAST + LOAD_FAST, which occurs much more frequently. However the way to encode that combination is pretty gross, and I'd rather do that another time.
msg390292 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-04-06 08:01
Interesting. What code did you use to collect statistics? I used patches in issue27255. Perhaps it is worth to add an optionally compiled code for collecting dynamic opcode statistics permanently. We have around 125 opcodes, so 0.5% looks too small. But what is the range of that opcode? How many opcodes are more frequent? And what is the range of BINARY_ADD for comparison. RETURN_CONST or RETURN_NONE was already discussed before. They can save few bytes in bytecode, but they are never used in tight loops and the time difference is dwarfed by the frame creation overhead in any case. For frequently used pairs of opcodes we have the PREDICT() macros which reduces overhead.
msg390350 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-04-06 16:00
> Interesting. What code did you use to collect statistics? For static statistics I wrote my own script: https://github.com/python/cpython/pull/25090/files#diff-994d3592c951c78cbe71084d562590d1507ddfed767e2ec040f5e2610845a11c. I can add that to Tools permanently (I have a slightly improved version that can look for different things). > I used patches in issue27255. Perhaps it is worth to add an optionally compiled code for collecting dynamic opcode statistics permanently. We have that already: compile with -DDYNAMIC_EXECUTION_PROFILE -DDXPAIRS. There's a script (Tools/scripts/analyze_dxp.py) that works with such data, but it's primitive. Eric is working on improvements; I added my own hacky script to run an app and collect the data. I collected a few different sets of statistics: static stats for the stdlib and for mypy, dynamic stats for running mypy and a few of the benchmarks in pyperformance. Eric and I have plans to do this more systematically; we'll then publish our tools and results.
msg390373 - (view)	Author: Brandt Bucher (brandtbucher) *	Date: 2021-04-06 20:43
> I collected a few different sets of statistics: static stats for the stdlib and for mypy, dynamic stats for running mypy and a few of the benchmarks in pyperformance. I'm sure you've considered this, but I'd be very careful using opcode stats from the official benchmarks to inform work on these improvements. It's probably better to stick to in-sample data from the stdlib / mypy / etc. and use the benchmarks only for measuring out-of-sample results. Otherwise we may just end up hacking the benchmark suite. :)
msg390378 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-04-06 21:41
Yeah, I've received many warnings about benchmark hacking already, and we won't fall to that. For static statistics, it's easy enough to either download a thousand of the most popular projects from GitHub or PyPI and just try to compile all the .py files found there. For dynamic statistics, the problem is always to come up with a representative run. I know how to do that for mypy (just running mypy over itself should be pretty good), but we can't do that for random projects from PyPI or GitHub -- not only do I not want to run 3rd party code I haven't eyeballed reviewed carefully first, but usually there's no way to know how to even run it -- configuration, environment, input data all need to be fixed, and dependencies need to be installed. Running test suites (whether it's the stdlib tests or some other set of unit tests for some other code base) is likely to give skewed results, given how people often write tests (trying to hit each line of code at least once but as few times as possible so the tests run quickly). This is where benchmarks provide some limited value, since they come with a standard way to run non-interactively. I wish it was the case that static stats can be a good proxy for dynamic stats, since then we could just compute static stats for a large amount of 3rd party code, but I'm not too sure of that. So I would like to have more examples that we can measure both statically and dynamically -- if the dynamic numbers look sufficiently similar to the static numbers, we can see that as an indication that static numbers are a good indicator, and if they look different, it means that we're going to have to collect more examples in order to make progress.

History
Date	User	Action	Args
2022-04-11 14:59:43	admin	set	github: 87850
2021-04-06 21:41:49	gvanrossum	set	messages: + msg390378
2021-04-06 20:43:39	brandtbucher	set	nosy: + brandtbucher messages: + msg390373
2021-04-06 16:00:14	gvanrossum	set	messages: + msg390350
2021-04-06 08:01:22	serhiy.storchaka	set	messages: + msg390292
2021-04-06 07:55:19	gregory.p.smith	set	nosy: + gregory.p.smith
2021-04-05 21:33:05	gvanrossum	set	messages: + msg390263
2021-04-03 01:52:36	terry.reedy	set	nosy: + terry.reedy
2021-04-02 22:43:00	rhettinger	set	nosy: + rhettinger messages: + msg390109
2021-04-02 22:18:21	gvanrossum	set	messages: + msg390105
2021-04-02 21:58:54	tim.peters	set	nosy: + tim.peters messages: + msg390104
2021-04-02 21:23:12	gvanrossum	set	messages: + msg390101
2021-04-01 09:53:05	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg389967
2021-04-01 09:44:24	BTaskaya	set	nosy: + BTaskaya
2021-03-31 22:40:57	gvanrossum	set	keywords: + patch stage: patch review pull_requests: + pull_request23868
2021-03-31 22:40:32	gvanrossum	create