Title: Reduce overhead for cache hits in specialized opcodes.
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.11
Status: closed Resolution: fixed
Assigned To: Mark.Shannon Nosy List: Mark.Shannon, kj, lukasz.langa
Created on 2021-10-19 17:24 by Mark.Shannon, last changed 2021-10-20 18:54 by lukasz.langa.

PR 29092 merged Mark.Shannon, 2021-10-20 13:39
Author: Mark Shannon (Mark.Shannon) Date: 2021-10-19 17:24
Every time we get a cache hit in, e.g. LOAD_ATTR_CACHED, we increment the saturating counting. Takes a dependent load and a store, as well as the shift. For fast instructions like BINARY_ADD_FLOAT, this represents a significant portion of work done in the instruction.

If we don't bother to record the hit, we reduce the overhead of fast, specialized instructions.

The cost is that may have re-optimize more often.
For those instructions with high hit-to-miss ratios, which is most, this be barely measurable.
The cost for type unstable and un-optimizable instruction shouldn't be much changed.

Initial experiments show ~1% speedup.
Author: Ken Jin (kj) Date: 2021-10-20 10:43
Strong +1 from me.

Not to mention some instructions don't even need to read the _PyAdaptiveEntry apart from recording cache hits, so that's one more dependent load and store too.

Extremely cheap instructions off the top of my head:
Author: Łukasz Langa (lukasz.langa) Date: 2021-10-20 18:53
New changeset bc85eb7a4f16e9e2b6fb713be2466ebb132fd7f2 by Mark Shannon in branch 'main':
bpo-45527: Don't count cache hits, just misses. (GH-29092)
Author: Łukasz Langa (lukasz.langa) Date: 2021-10-20 18:54
Thanks, Mark! ✨ 🍰 ✨
