classification
Title: Caching infrastructure for the evaluation loop: specialised opcodes
Type: Stage: resolved
Components: C API Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, Guido.van.Rossum, Johan Dahlin, Mark.Shannon, barry, bismatrimony, corona10, gvanrossum, kj, methane, nascheme, pablogsal, yselivanov
Priority: normal Keywords: patch

Created on 2020-10-22 03:21 by pablogsal, last changed 2021-10-20 16:28 by kj. This issue is now closed.

Files
File name Uploaded Description Edit
results.zip methane, 2020-12-16 05:18
result.zip pablogsal, 2020-12-16 12:31
patch pablogsal, 2021-01-30 02:41
Pull Requests
URL Status Linked Edit
PR 22907 closed pablogsal, 2020-10-23 01:54
PR 23503 closed pablogsal, 2020-11-24 22:09
Messages (35)
msg379272 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-22 03:21
After https://bugs.python.org/issue42093 and https://bugs.python.org/issue26219 is being clear that we can leverage some cache for different information in the evaluation loop to speed up CPython. This observation is also based on the fact that although Python is dynamic, there is plenty of code that does not exercise said dynamism and therefore factoring out the "dynamic" parts of the execution by using a cache mechanism can yield excellent results. 

So far we have two big improvements in performance for caching LOAD_ATTR and LOAD_GLOBAL (in some cases up to 10-14%) but I think we can do much much more. Here are some observations of what I think we can do:

* Instead of adding more caches using the current mechanism, which adds some inlined code in every opcode in the evaluation loop, we can try to formalize some kind of caching mechanism that has some better API that will allow adding more opcodes in the future. Having the code inline in ceval.c is going to become difficult to maintain if we keep adding more stuff directly there.

* Instead of handling the specialization in the same opcode as the original one (LOAD_ATTR is doing the slow and the fast path) we could mutate the original code object and replacing the slow and generic opcodes for the more specialized ones and these will also be in charge of changing it back to the generic and slow ones if the assumptions that activated them appear.

Obviously, mutating code objects is scary, so we could have some "specialized" version of the bytecode in the cache and use that if is present. Ideas that we could do with this cached stuff:

- For binary operators, we can grab both operands, resolve the addition function and cache that together with the types and the version tags and if the types and the version tags are the same, use directly the addition function instead of resolving it.

- For loading methods, we could cache the bound method as proposed by Yury originally here: https://mail.python.org/pipermail/python-dev/2016-January/142945.html.

- We could also do the same for operations like "some_container[]" if the container is some builtin. We can substitute/specialize the opcode for someone that directly uses built-in operations instead of the generic BINARY_SUBSCR.

The plan will be:

- Making some infrastructure/framework for the caching that allows us to optimize/deoptimize individual opcodes.
- Refactor the existing specialization for LOAD_GLOBAL/LOAD_ATTR to use said infrastructure.
- Thinking of what operations could benefit from specialization and start adding them one by one.
msg379273 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-22 03:24
To clarify what I mean with:

> - We could also do the same for operations like "some_container[]" if the container is some builtin. We can substitute/specialize the opcode for someone that directly uses built-in operations instead of the generic BINARY_SUBSCR.

If a given function has a BINARY_SUBSCR opcode and when executing it a given number of times we see that the object to get the subscript is always a list, we change the BINARY_SUBSCR to BINARY_SUBSCR_FOR_LISTS and it that opcode we do a quick check for PyList_CheckExact and if is correct we call "Objects/listobject.c:list_subscript" directly and if is false we put back a BINARY_SUBSCR.
msg379274 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-22 03:27
Also, many of these ideas are not new, and many of them are inspired or taken from Yury's email (https://mail.python.org/pipermail/python-dev/2016-January/142945.html) but I wanted to add that I think that with some coordination between us we can achieve some excellent speedups for Python 3.10!
msg379276 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-10-22 04:38
Few thoughts in no particular order:

- I'd suggest implementing the cache for 2-3 more opcodes on top of the existing infrastructure to get more experience and then refactoring it to make it more generic.

- Generalizing LOAD_METHOD to work for methods with **kwargs, caching concrete operator implementations for opcodes like BINARY_ADD etc. are all possible on top of the current infra.

- Rewriting code objects in place is wrong, IMO: you always need to have a way to deoptimize the entire thing, so you need to keep the original one. It might be that you have well defined and static types for the first 10000 invocations and something entirely different on 10001. So IMO we need a SpecializedCode object with the necessary bailout guards. But that's not a simple thing to implement, so unless someone will be working on this fulltime for a long time I'd suggest working off what we have now. (That said I'd take a close look at what Dino is building).

- There are multiple different approaches we can choose for optimizing CPython, ranging from hidden classes to a full blown JIT. I hope someone will do them one day. But IMO the current simple "opcode cache" (I wish we had a better name) mechanism we have would allow us to squeeze up to 15-25% median improvement in our benchmarks with relatively limited dev time. Maybe that's good enough for 3.10.
msg379277 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-22 04:45
> - Rewriting code objects in place is wrong, IMO: you always need to have a way to deoptimize the entire thing, so you need to keep the original one. It might be that you have well defined and static types for the first 10000 invocations and something entirely different on 10001. So IMO we need a SpecializedCode object with the necessary bailout guards.

Imagine that we have a secondary copy of the bytecode in the cache inside the code object and we mutate that instead. The key difference with the current cache infrastructure is that we don't accumulate all the optimizations on the same opcode, which can be very verbose. Instead, we change the generic opcode to a more specialised to optimize and we change it back to deoptimize. The advantage is that BINARY_SUBSCRIPT for example won't be this gigantic block of text that will do different things depending if is specialising for dicts or lists or tuples, but we will have a different opcode for every of them, which I think is much easier to manage.
msg379278 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-10-22 05:29
> Imagine that we have a secondary copy of the bytecode in the cache inside the code object and we mutate that instead. The key difference with the current cache infrastructure is that we don't accumulate all the optimizations on the same opcode, which can be very verbose. Instead, we change the generic opcode to a more specialised to optimize and we change it back to deoptimize.

Yeah, I follow. As long as we keep the original list of opcodes we're good ;)
msg379282 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-22 09:37
FWIW, php7 is about 5x faster than Python on spectral norm benchmark.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/php-python3.html

There two major reasons:

* PHP uses scalar type for float and int
* PHP uses type-specialized bytecode (PHP8 will use JIT, but PHP7 dosn't)

Source code is here:
php: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/spectralnorm-php-1.html
Python: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/spectralnorm-python3-8.html

The most hot function is eval_A()

```
def eval_A(i, j): # i and j are int.
    ij = i + j
    return ij * (ij + 1) // 2 + i + 1
```

And its bytecode:

```
Disassembly of <code object eval_A at 0x107fd8500, file "x.py", line 1>:
  2           0 LOAD_FAST                0 (i)
              2 LOAD_FAST                1 (j)
              4 BINARY_ADD
              6 STORE_FAST               2 (ij)

  3           8 LOAD_FAST                2 (ij)
             10 LOAD_FAST                2 (ij)
             12 LOAD_CONST               1 (1)
             14 BINARY_ADD
             16 BINARY_MULTIPLY
             18 LOAD_CONST               2 (2)
             20 BINARY_FLOOR_DIVIDE
             22 LOAD_FAST                0 (i)
             24 BINARY_ADD
             26 LOAD_CONST               1 (1)
             28 BINARY_ADD
             30 RETURN_VALUE
```

My thoughts:

* bytecode specialized for `int op int` will some help.
* there are many incref/decref overhead.
  * multi operand bytecode (e.g. BINARY_ADD_FAST_FAST, BINARY_ADD_FAST_CONST, etc) will reduce refcount overhead.
msg379284 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-22 09:42
One more idea: BINARY_ADD_INT. Operand is int immediate.
This idea can be implemented without opcode cache. I will try it.
msg379333 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-10-22 20:03
> This idea can be implemented without opcode cache. I will try it.

I'd actually encourage trying to use the opcode cache because this way the optimization will be more generic. E.g. `decimal + decimal` would also be specialized via the cache, because you'd cache a pointer to the specific `+` operator implementation. I'm really not sure that adding specialized byte code is a good idea.

> * PHP uses scalar type for float and int

While at it, maybe we push the number of bits per int digit to 60?
msg380142 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-11-01 14:34
> because you'd cache a pointer to the specific `+` operator implementation

You should have a look at "INCA: Inline Caching meets Quickening in Python 3.3":
https://bugs.python.org/issue14757

Stefan Brunthaler wrote a paper on his work:
"Inline Caching Meets Quickening" (Published in ECOOP 2010)
https://publications.sba-research.org/publications/ecoop10.pdf
msg381777 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-11-24 22:12
I may not know the what is the average airspeed velocity of a laden swallow, but I know the average speed of adding a LOAD_METHOD opcode cache as in PR 23503 (measured with PGO + LTO + CPU isolation):

+-------------------------+--------------------------------------+-------------------------------------------+
| Benchmark               | 2020-11-24_18-08-master-0ec34cab9dd4 | 2020-11-24_19-41-load_method-0c43ca99644b |
+=========================+======================================+===========================================+
| pidigits                | 266 ms                               | 233 ms: 1.14x faster (-12%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| regex_v8                | 32.7 ms                              | 30.4 ms: 1.07x faster (-7%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| unpickle                | 17.7 us                              | 17.1 us: 1.04x faster (-4%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| scimark_sparse_mat_mult | 6.21 ms                              | 5.99 ms: 1.04x faster (-3%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| go                      | 283 ms                               | 274 ms: 1.03x faster (-3%)                |
+-------------------------+--------------------------------------+-------------------------------------------+
| mako                    | 20.4 ms                              | 19.9 ms: 1.03x faster (-3%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| pickle_pure_python      | 546 us                               | 531 us: 1.03x faster (-3%)                |
+-------------------------+--------------------------------------+-------------------------------------------+
| logging_simple          | 9.58 us                              | 9.34 us: 1.03x faster (-3%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| telco                   | 8.21 ms                              | 8.03 ms: 1.02x faster (-2%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| regex_effbot            | 3.97 ms                              | 3.88 ms: 1.02x faster (-2%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| regex_dna               | 267 ms                               | 261 ms: 1.02x faster (-2%)                |
+-------------------------+--------------------------------------+-------------------------------------------+
| pathlib                 | 21.2 ms                              | 21.7 ms: 1.02x slower (+2%)               |
+-------------------------+--------------------------------------+-------------------------------------------+
| xml_etree_iterparse     | 129 ms                               | 132 ms: 1.02x slower (+2%)                |
+-------------------------+--------------------------------------+-------------------------------------------+
| xml_etree_parse         | 186 ms                               | 193 ms: 1.03x slower (+3%)                |
+-------------------------+--------------------------------------+-------------------------------------------+

Not significant (46): scimark_fft; nbody; pickle; pickle_list; pickle_dict; logging_format; scimark_lu; deltablue; chaos; meteor_contest; sqlalchemy_imperative; scimark_sor; raytrace; scimark_monte_carlo; sympy_integrate; unpickle_pure_python; dulwich_log; logging_silent; nqueens; json_loads; crypto_pyaes; hexiom; django_template; sympy_str; regex_compile; sympy_expand; xml_etree_generate; unpack_sequence; spectral_norm; xml_etree_process; sqlite_synth; genshi_xml; pyflate; json_dumps; python_startup_no_site; 2to3; python_startup; genshi_text; unpickle_list; fannkuch; sympy_sum; sqlalchemy_declarative; chameleon; float; richards; tornado_http;
msg383034 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-12-15 04:21
pidigits and regex_v8 are LOAD_ATTR heavy benchmarks?
I will run benchmarks in my machine to confirm the results.
msg383051 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-15 13:08
> pidigits and regex_v8 are LOAD_ATTR heavy benchmarks?

The PR is for LOAD_METHOD infrastructure, not for LOAD_ATTR (There was an incorrect title in the PR that I corrected but the contents are correct).

> I will run benchmarks in my machine to confirm the results.

Thanks a lot, Inada-san
msg383116 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-12-16 05:18
$ ./python -m pyperf compare_to master.json load_method.json -G --min-speed=1
Slower (15):
- unpack_sequence: 63.2 ns +- 0.8 ns -> 68.5 ns +- 14.8 ns: 1.08x slower (+8%)
- pathlib: 23.1 ms +- 0.3 ms -> 24.4 ms +- 0.4 ms: 1.05x slower (+5%)
- scimark_fft: 418 ms +- 2 ms -> 433 ms +- 3 ms: 1.04x slower (+4%)
- scimark_monte_carlo: 113 ms +- 2 ms -> 116 ms +- 3 ms: 1.03x slower (+3%)
- unpickle_list: 5.30 us +- 0.03 us -> 5.44 us +- 0.28 us: 1.03x slower (+3%)
- nqueens: 107 ms +- 1 ms -> 109 ms +- 3 ms: 1.02x slower (+2%)
- spectral_norm: 154 ms +- 1 ms -> 157 ms +- 1 ms: 1.02x slower (+2%)
- float: 136 ms +- 1 ms -> 138 ms +- 1 ms: 1.02x slower (+2%)
- unpickle_pure_python: 324 us +- 3 us -> 329 us +- 4 us: 1.02x slower (+2%)
- genshi_xml: 71.9 ms +- 1.9 ms -> 73.1 ms +- 1.0 ms: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 5.54 ms +- 0.03 ms -> 5.63 ms +- 0.03 ms: 1.02x slower (+2%)
- regex_effbot: 3.45 ms +- 0.14 ms -> 3.49 ms +- 0.04 ms: 1.01x slower (+1%)
- sqlite_synth: 3.37 us +- 0.06 us -> 3.41 us +- 0.07 us: 1.01x slower (+1%)
- regex_v8: 27.0 ms +- 0.2 ms -> 27.3 ms +- 0.1 ms: 1.01x slower (+1%)
- pickle_dict: 29.7 us +- 0.1 us -> 30.1 us +- 0.2 us: 1.01x slower (+1%)

Faster (10):
- go: 256 ms +- 3 ms -> 246 ms +- 2 ms: 1.04x faster (-4%)
- pickle_pure_python: 497 us +- 5 us -> 478 us +- 5 us: 1.04x faster (-4%)
- logging_simple: 9.84 us +- 0.17 us -> 9.48 us +- 0.22 us: 1.04x faster (-4%)
- mako: 17.1 ms +- 0.2 ms -> 16.5 ms +- 0.2 ms: 1.03x faster (-3%)
- logging_format: 10.9 us +- 0.2 us -> 10.6 us +- 0.2 us: 1.03x faster (-3%)
- logging_silent: 201 ns +- 6 ns -> 198 ns +- 5 ns: 1.02x faster (-2%)
- meteor_contest: 125 ms +- 1 ms -> 123 ms +- 0 ms: 1.02x faster (-2%)
- json_loads: 32.0 us +- 0.3 us -> 31.6 us +- 0.2 us: 1.01x faster (-1%)
- sqlalchemy_imperative: 37.8 ms +- 0.9 ms -> 37.4 ms +- 1.0 ms: 1.01x faster (-1%)
- unpickle: 14.5 us +- 0.3 us -> 14.3 us +- 0.1 us: 1.01x faster (-1%)

Benchmark hidden because not significant (35): 2to3, chameleon, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, genshi_text, hexiom, json_dumps, nbody, pickle, pickle_list, pidigits, pyflate, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, richards, scimark_lu, scimark_sor, sqlalchemy_declarative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00 (slower)
msg383152 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-16 11:16
Oh, I am quite confused about what's going on with pidigits and regex_v8. I will try to run the benchmarks again. Did you compare the current master against the PR? If that's the case we should rebase the PR it first to make sure we are comparing it correctly.
msg383160 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-16 12:31
OK, I have repeated the benchmarks after rebasing and this is what I get:

venv ❯ python -m pyperf compare_to json/2020-12-16_11-20-master-8203c73f3bb1.json.gz json/2020-12-16_11-22-load_method-21b1566125b3.json.gz -G --min-speed=1
Slower (13):
- regex_v8: 30.5 ms +- 0.1 ms -> 32.8 ms +- 0.2 ms: 1.08x slower (+8%)
- pidigits: 253 ms +- 0 ms -> 264 ms +- 0 ms: 1.05x slower (+5%)
- richards: 84.1 ms +- 1.1 ms -> 87.4 ms +- 1.4 ms: 1.04x slower (+4%)
- fannkuch: 573 ms +- 2 ms -> 589 ms +- 14 ms: 1.03x slower (+3%)
- regex_effbot: 3.87 ms +- 0.03 ms -> 3.98 ms +- 0.05 ms: 1.03x slower (+3%)
- scimark_sor: 223 ms +- 4 ms -> 227 ms +- 2 ms: 1.02x slower (+2%)
- pathlib: 20.9 ms +- 0.2 ms -> 21.3 ms +- 0.3 ms: 1.02x slower (+2%)
- regex_compile: 209 ms +- 1 ms -> 212 ms +- 3 ms: 1.01x slower (+1%)
- xml_etree_process: 93.7 ms +- 0.7 ms -> 94.9 ms +- 1.1 ms: 1.01x slower (+1%)
- nqueens: 122 ms +- 1 ms -> 123 ms +- 1 ms: 1.01x slower (+1%)
- regex_dna: 263 ms +- 1 ms -> 266 ms +- 1 ms: 1.01x slower (+1%)
- django_template: 56.1 ms +- 0.4 ms -> 56.7 ms +- 0.4 ms: 1.01x slower (+1%)
- raytrace: 566 ms +- 7 ms -> 572 ms +- 6 ms: 1.01x slower (+1%)

Faster (15):
- unpack_sequence: 80.4 ns +- 0.9 ns -> 70.8 ns +- 0.8 ns: 1.14x faster (-12%)
- scimark_sparse_mat_mult: 6.26 ms +- 0.02 ms -> 5.97 ms +- 0.06 ms: 1.05x faster (-5%)
- pickle_pure_python: 547 us +- 5 us -> 523 us +- 5 us: 1.05x faster (-4%)
- unpickle: 17.7 us +- 0.1 us -> 17.0 us +- 0.2 us: 1.04x faster (-4%)
- mako: 20.0 ms +- 0.1 ms -> 19.5 ms +- 0.1 ms: 1.02x faster (-2%)
- unpickle_pure_python: 361 us +- 4 us -> 353 us +- 4 us: 1.02x faster (-2%)
- logging_simple: 9.59 us +- 0.14 us -> 9.39 us +- 0.12 us: 1.02x faster (-2%)
- scimark_fft: 462 ms +- 4 ms -> 455 ms +- 4 ms: 1.02x faster (-2%)
- chameleon: 11.6 ms +- 0.2 ms -> 11.4 ms +- 0.1 ms: 1.02x faster (-2%)
- pickle: 13.0 us +- 0.1 us -> 12.8 us +- 0.1 us: 1.02x faster (-1%)
- telco: 8.30 ms +- 0.23 ms -> 8.18 ms +- 0.14 ms: 1.01x faster (-1%)
- go: 277 ms +- 2 ms -> 274 ms +- 2 ms: 1.01x faster (-1%)
- pickle_list: 5.30 us +- 0.07 us -> 5.23 us +- 0.06 us: 1.01x faster (-1%)
- logging_format: 10.3 us +- 0.1 us -> 10.2 us +- 0.1 us: 1.01x faster (-1%)
- meteor_contest: 136 ms +- 0 ms -> 134 ms +- 1 ms: 1.01x faster (-1%)

Benchmark hidden because not significant (32): 2to3, chaos, crypto_pyaes, deltablue, dulwich_log, float, genshi_text, genshi_xml, hexiom, json_dumps, json_loads, logging_silent, nbody, pickle_dict, pyflate, python_startup, python_startup_no_site, scimark_lu, scimark_monte_carlo, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate
msg383162 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-16 12:33
I think I am closing the PR as it seems that the gains are not good enough (and there is quite a lot of noise by comparing the benchmarks together).
msg383260 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-12-17 19:43
> I think I am closing the PR as it seems that the gains are not good enough (and there is quite a lot of noise by comparing the benchmarks together).

IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.
msg383261 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-12-17 20:08
> IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

I will try to do some prototyping around that to see how much can we gain in that route. In any case, adding LOAD_METHOD support for all kinds of calls should be an improvement by itself even without caching, no?
msg383262 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2020-12-17 20:22
> I will try to do some prototyping around that to see how much can we gain in that route. In any case, adding LOAD_METHOD support for all kinds of calls should be an improvement by itself even without caching, no?

Exactly.

As one argument for generalizing of LOAD_METHOD is that I, for example, try not to use kwargs in hot paths because I know it will be slower, which feels very wrong. I shouldn't care of internal implementation details of CPython and focus on writing readable code.
msg384463 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-01-06 01:09
I've skimmed several papers by Stefan Brunthaler about something called Quickening that I first found via the Pyston blog, which linked to an ancient bpo issue with a patch created by Stefan (https://bugs.python.org/issue14757).

The gist seems to be to have extra opcodes that only work for certain situations (e.g. INT_BINARY_ADD). In a hot function we can rewrite opcodes with their specialized counterpart. The new opcode contains a guard that rewrites itself back if the guard fails (and then it stays unoptimized).

I think the idea is that it's pretty cheap to keep an extra pointer to a "alternate bytecode" array.
msg384466 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2021-01-06 03:42
> The gist seems to be to have extra opcodes that only work for certain situations (e.g. INT_BINARY_ADD). In a hot function we can rewrite opcodes with their specialized counterpart. The new opcode contains a guard that rewrites itself back if the guard fails (and then it stays unoptimized).

This is also roughly what I suggested in https://bugs.python.org/msg379333. Except that I don't think it's necessary to add new opcodes.
msg384470 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-01-06 05:03
Do we have good intuition or data about which operations need speeding up most? Everybody always assumes it's BINARY_ADD, but much Python code isn't actually numeric, and binary operations aren't all that common. (For example, IIRC at my previous employer the hottest code was some kind of web template expansion and another bottleneck was marshalling RPC requests and results.)

Optimizers usually evolve to be excellent at optimizing the benchmark du jour -- what benchmark do we want to use?
msg384474 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2021-01-06 06:17
> Do we have good intuition or data about which operations need speeding up most? Everybody always assumes it's BINARY_ADD, but much Python code isn't actually numeric, and binary operations aren't all that common.

IMO, we shouldn't focus too much on optimizing binops. Regular webapps/network kind of code wouldn't benefit from that, and scientific code that uses numpy/ml libraries already offsets most of expensive computation to outside of CPython eval. Instead, I think, we should continue focusing on lowering the cost of function/method calls and attribute access.
msg384497 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-01-06 12:32
I subscribe that but is also a matter of what are the optimizations with higher ratio impact/(cost+complexity). Caching the function pointers on binary operations or (better IMHO) comparisons is likely a good candidate
msg384526 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-01-06 17:43
Undoubtedly -- but impact should be measured on what typical users would see, not on narrow benchmarks.
msg384534 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-01-06 19:09
Agreed. In that regard the most standard thing that we have is the pyperformance suite, which are almost all macro benchmarks. Is also what is displayed in speed.python.org
msg385922 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-01-29 17:31
I had a simpler idea for an inline cache for LOAD_METHOD than GH-23503. The essential part goes like this (sorry for the hybrid C/Python):

if <optimized etc.>:
    if type == lm->type and type->tp_version_tag == lm->tp_version_tag:
        meth = lm->meth
        SET_TOP(meth)
        PUSH(obj)
        DISPATCH()

name = GETITEM(names, oparg)
meth_found = _PyObject_GetMethod(obj, name, &meth)
<error check>

if meth_found:
    SET_TOP(meth)
    PUSH(obj)
    if <optimizing etc.>:
        lm = ...
        lm->type = type
        lm->meth = meth

<etc.>

What am I missing? Why is the hash of the name needed?


Oh, it's probably because there could still be an overriding value in obj.__dict__. But certainly we could check for type == lm->type before the other checks (HasFeature and tp_version_tag).

But what if we only did this for classes without an instance dict? That could work for things like list.append and str.find.
msg385966 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-01-30 02:13
> What am I missing? Why is the hash of the name needed?

To speed up the call to get the method from the dictionary using _PyDict_GetItem_KnownHash. The reason I was not caching the method is that as you mention there could still be an overriding value in the dictionary.

> But what if we only did this for classes without an instance dict? 

This is an interesting idea. For PR23503 seems that the machinery was too costly that it was killing the advantages, but maybe we could simplify this for classes without dictionaries so we can still gain overall.
msg385967 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-01-30 02:41
I am attaching to this issue a patch with PR 23503 restricted to only classes without dict. I can measure a speedup but is super small (5%) in a function that keeps calling a method for a builtin:

def foo(obj):
    for _ in range(10000):
        res = obj.index(42)
    return res
msg385971 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-01-30 05:39
Thanks!

The loop overhead is presumably dramatic in that example. I think I measured a somewhat bigger speedup using timeit, which IIRC subtracts the cost of an empty loop. But you're right that 5% on a micro-benchmark is not very exciting.

I wonder if there are cases where the speedup is larger, e.g. in a class with __slots__ where the method is defined in a base class (also with __slots__). In that case LOAD_METHOD would require two dict lookups, both of which we'd avoid here.

This could be relevant if we had a hidden class mechanism (I'm playing with that but it's too soon to show anything). Let's keep your patch around for that.



Has anyone looked at storing the cache literally inline? If we had a way to rewrite bytecode that allowed us to move code (requiring us to fix up jumps) the cache data could be stored inline in the bytecode array (I guess that's why they call it "inline cache" :-). So e.g. LOAD_METHOD could have a layout like this:

LOAD_METHOD opcode
oparg
(padding)
optimized
type
tp_version_tag
meth

This would avoid the pointer chasing needed to find the opcache struct, and increase memory locality. Obviously the bytecode would have to be copied to mutable memory.

We could even make it so that "LOAD_METHOD with opcache entry" is a new opcode, so there would be less overhead for unoptimized code (and for optimized code :-).

We'd have to do something funky for concurrent invocations of the same function (due to threading or recursion), but that can be dealt with; e.g., the frame could hold a pointer to the bytecode array in use.
msg393295 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-05-09 04:03
> IMO you need to implement LOAD_METHOD support for all kinds of calls, including the ones that use kwargs, to see any improvement.

Recently I played around with that idea and extended LOAD/CALL_METHOD to keyword arguments (so CALL_FUNCTION_KW is replaced). I then reapplied Pablo's patch. Here's the pyperformance results (LTO + PGO):

> pyperf compare_to 2021-05-08_03-25-general_load_method-42fcad26a487.json.gz 2021-05-08_08-50-general_load_method-dd24d58ce940.json.gz -G --min-speed=1

Slower (5):
- pathlib: 50.3 ms +- 0.8 ms -> 51.9 ms +- 1.1 ms: 1.03x slower
- unpack_sequence: 141 ns +- 2 ns -> 144 ns +- 7 ns: 1.02x slower
- telco: 16.6 ms +- 0.6 ms -> 16.9 ms +- 0.5 ms: 1.02x slower
- unpickle: 33.4 us +- 0.5 us -> 33.9 us +- 0.8 us: 1.01x slower
- nqueens: 239 ms +- 2 ms -> 242 ms +- 2 ms: 1.01x slower

Faster (23):
- logging_silent: 438 ns +- 16 ns -> 411 ns +- 18 ns: 1.07x faster
- go: 572 ms +- 55 ms -> 539 ms +- 8 ms: 1.06x faster
- pickle_pure_python: 1.10 ms +- 0.01 ms -> 1.06 ms +- 0.02 ms: 1.04x faster
- meteor_contest: 274 ms +- 4 ms -> 265 ms +- 4 ms: 1.03x faster
- logging_simple: 20.6 us +- 0.3 us -> 19.9 us +- 0.4 us: 1.03x faster
- hexiom: 23.0 ms +- 1.0 ms -> 22.2 ms +- 0.2 ms: 1.03x faster
- mako: 37.0 ms +- 0.5 ms -> 35.9 ms +- 0.4 ms: 1.03x faster
- json_dumps: 32.3 ms +- 0.4 ms -> 31.4 ms +- 0.3 ms: 1.03x faster
- scimark_sor: 496 ms +- 7 ms -> 484 ms +- 7 ms: 1.03x faster
- spectral_norm: 372 ms +- 5 ms -> 363 ms +- 4 ms: 1.02x faster
- richards: 177 ms +- 5 ms -> 173 ms +- 4 ms: 1.02x faster
- django_template: 114 ms +- 1 ms -> 112 ms +- 1 ms: 1.02x faster
- dulwich_log: 170 ms +- 2 ms -> 167 ms +- 2 ms: 1.02x faster
- pyflate: 1.65 sec +- 0.02 sec -> 1.62 sec +- 0.02 sec: 1.02x faster
- xml_etree_iterparse: 282 ms +- 5 ms -> 278 ms +- 3 ms: 1.02x faster
- unpickle_list: 12.1 us +- 0.1 us -> 11.9 us +- 0.2 us: 1.02x faster
- nbody: 343 ms +- 5 ms -> 338 ms +- 7 ms: 1.02x faster
- logging_format: 22.6 us +- 0.4 us -> 22.3 us +- 0.5 us: 1.01x faster
- sympy_str: 833 ms +- 7 ms -> 822 ms +- 7 ms: 1.01x faster
- float: 271 ms +- 3 ms -> 268 ms +- 5 ms: 1.01x faster
- scimark_fft: 1.05 sec +- 0.01 sec -> 1.03 sec +- 0.01 sec: 1.01x faster
- deltablue: 17.9 ms +- 0.3 ms -> 17.6 ms +- 0.4 ms: 1.01x faster
- scimark_lu: 403 ms +- 9 ms -> 399 ms +- 9 ms: 1.01x faster

Benchmark hidden because not significant (30): (...)
Geometric mean: 1.01x faster

Most benchmarks don't use keyword arguments. I'm going to try extend LOAD_METHOD to *args and **kwargs next.

For anyone interested, here's the diff (unstable) https://github.com/python/cpython/compare/main...Fidget-Spinner:general_load_method

I hope this helps :).
msg393348 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-05-10 03:39
Moving the needle on the pyperformance benchmarks is really hard!
msg404481 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-10-20 15:59
Closing this as we have been implementing this idea already with the adaptative interpreter
msg404491 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-10-20 16:28
For future reference, the following opcodes specialized via the PEP 659 specializing adaptive interpreter:

- LOAD_GLOBAL Issue44338
- LOAD_ATTR Issue44337
- STORE_ATTR Issue44826 (2% faster pyperformance)
- BINARY_SUBSCR Issue26280 (2% faster pyperformance)
- LOAD_METHOD Issue44889 (2% faster pyperformance)
- CALL_FUNCTION Issue44525 (2% faster pyperformance)
- BINARY_ADD Issue44945 (1% faster pyperformance)
- BINARY_MULTIPLY Issue45367 (10% faster nbody, 0% faster pyperformance)

Combined instructions:
Issue44900 (2% faster pyperformance)
History
Date User Action Args
2021-10-20 16:28:15kjsetstatus: open -> closed
resolution: fixed
messages: + msg404491

stage: patch review -> resolved
2021-10-20 15:59:52pablogsalsetmessages: + msg404481
2021-05-10 03:39:27gvanrossumsetmessages: + msg393348
2021-05-09 04:03:05kjsetnosy: + kj
messages: + msg393295
2021-01-30 05:39:42gvanrossumsetmessages: + msg385971
2021-01-30 02:42:00pablogsalsetfiles: + patch

messages: + msg385967
2021-01-30 02:13:28pablogsalsetmessages: + msg385966
2021-01-29 17:31:14gvanrossumsetmessages: + msg385922
2021-01-27 15:16:17pablogsalsetmessages: - msg385782
2021-01-27 15:01:56bismatrimonysetnosy: + bismatrimony
messages: + msg385782
2021-01-09 13:29:22Johan Dahlinsetnosy: + Johan Dahlin
2021-01-06 19:09:31pablogsalsetmessages: + msg384534
2021-01-06 17:43:06gvanrossumsetmessages: + msg384526
2021-01-06 12:32:34pablogsalsetmessages: + msg384497
2021-01-06 06:17:03yselivanovsetmessages: + msg384474
2021-01-06 05:03:37gvanrossumsetmessages: + msg384470
2021-01-06 03:42:49yselivanovsetmessages: + msg384466
2021-01-06 01:09:02gvanrossumsetnosy: + gvanrossum
messages: + msg384463
2021-01-05 02:45:02Guido.van.Rossumsetnosy: + Guido.van.Rossum
2020-12-17 20:22:48yselivanovsetmessages: + msg383262
2020-12-17 20:08:33pablogsalsetmessages: + msg383261
2020-12-17 19:43:54yselivanovsetmessages: + msg383260
2020-12-16 12:33:04pablogsalsetmessages: + msg383162
2020-12-16 12:31:56pablogsalsetfiles: + result.zip

messages: + msg383160
2020-12-16 11:16:26pablogsalsetmessages: + msg383152
2020-12-16 05:18:39methanesetfiles: + results.zip

messages: + msg383116
2020-12-15 13:08:54pablogsalsetmessages: + msg383051
2020-12-15 04:21:01methanesetmessages: + msg383034
2020-11-24 22:19:07vstinnersetnosy: - vstinner
2020-11-24 22:12:20pablogsalsetmessages: + msg381777
2020-11-24 22:09:34pablogsalsetpull_requests: + pull_request22392
2020-11-15 22:01:28BTaskayasetnosy: + BTaskaya
2020-11-01 14:34:09vstinnersetmessages: + msg380142
2020-10-23 01:54:59pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request21838
2020-10-22 20:03:33yselivanovsetmessages: + msg379333
2020-10-22 20:03:29yselivanovsetmessages: - msg379330
2020-10-22 19:48:47yselivanovsetmessages: + msg379330
2020-10-22 16:42:52barrysetnosy: + barry
2020-10-22 10:42:56corona10setnosy: + corona10
2020-10-22 09:42:40methanesetmessages: + msg379284
2020-10-22 09:37:46methanesetmessages: + msg379282
2020-10-22 05:29:50yselivanovsetmessages: + msg379278
2020-10-22 04:45:10pablogsalsetmessages: + msg379277
2020-10-22 04:38:39yselivanovsetmessages: + msg379276
2020-10-22 03:27:38pablogsalsetmessages: + msg379274
2020-10-22 03:24:42pablogsalsetmessages: + msg379273
2020-10-22 03:21:38pablogsalcreate