Speedup method calls 1.2x #70298

1st1 · 2016-01-14T17:09:07Z

BPO	26110
Nosy	@Yhg1s, @brettcannon, @jcea, @ncoghlan, @vstinner, @benjaminp, @DinoV, @methane, @serhiy-storchaka, @soltysh, @jdemeyer, @1st1, @jstasiak, @MojoVampire, @Vgr255, @florinpapa, @efahl, @dopplershift, @Fidget-Spinner
PRs	[Do Not Merge] Convert `Misc/NEWS` so that it is managed by towncrier #552 bpo-26110: Add `CALL_METHOD_KW` opcode to speedup method calls with keywords #26014 bpo-26110: Document `CALL_METHOD_KW` #26159
Files	call_method_1.patch call_method_2.patch call_method_3.patch call_method_4.patch: updated for Python 3.7 default.json.gz callmethod4.json.gz call_method_5.patch default.json.gz patched.json.gz callmethod-doc.patch callmethod-doc2.patch callmethod-doc3.patch call-method-doc4.patch call-method-doc5.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/1st1'
closed_at = <Date 2017-01-16.08:23:59.319>
created_at = <Date 2016-01-14.17:09:06.979>
labels = ['interpreter-core', '3.7', 'performance']
title = 'Speedup method calls 1.2x'
updated_at = <Date 2021-05-16.09:26:18.303>
user = 'https://github.com/1st1'

bugs.python.org fields:

activity = <Date 2021-05-16.09:26:18.303>
actor = 'kj'
assignee = 'yselivanov'
closed = True
closed_date = <Date 2017-01-16.08:23:59.319>
closer = 'methane'
components = ['Interpreter Core']
creation = <Date 2016-01-14.17:09:06.979>
creator = 'yselivanov'
dependencies = []
files = ['41614', '43599', '43601', '45786', '45787', '45788', '45866', '45887', '45888', '45889', '45891', '45896', '45963', '46294']
hgrepos = []
issue_num = 26110
keywords = ['patch']
message_count = 31.0
messages = ['258204', '258205', '258206', '258211', '258220', '258222', '259333', '259339', '264659', '269662', '269665', '282620', '282621', '282622', '282627', '283075', '283077', '283093', '283094', '283147', '283148', '283149', '283160', '283161', '283166', '283168', '283177', '285415', '285511', '285544', '345335']
nosy_count = 22.0
nosy_names = ['twouters', 'brett.cannon', 'jcea', 'ncoghlan', 'vstinner', 'benjamin.peterson', 'dino.viehland', 'methane', 'santagada', 'python-dev', 'serhiy.storchaka', 'maciej.szulik', 'jdemeyer', 'yselivanov', 'jstasiak', 'josh.r', 'abarry', 'alecsandru.patrascu', 'florin.papa', 'eric.fahlgren', 'Ryan May', 'kj']
pr_nums = ['552', '26014', '26159']
priority = 'normal'
resolution = 'fixed'
stage = 'commit review'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue26110'
versions = ['Python 3.7']

The text was updated successfully, but these errors were encountered:

1st1 · 2016-01-14T17:09:04Z

This issue supersedes issue bpo-6033. I decided to open a new one, since the patch is about Python 3.6 (not 2.7) and is written from scratch.

The idea is to add new opcodes to avoid instantiation of BoundMethods. The patch only affects method calls of Python functions with positional arguments.

I'm working on the attached patch in this repo: https://github.com/1st1/cpython/tree/call_meth2

If the patch gets accepted, I'll update it with the docs etc.

Performance Improvements
------------------------

Method calls in micro-benchmarks are 1.2x faster:

### call_method ###
Avg: 0.330371 -> 0.281452: 1.17x faster

### call_method_slots ###
Avg: 0.333489 -> 0.280612: 1.19x faster

### call_method_unknown ###
Avg: 0.304949 -> 0.251623: 1.21x faster

Improvements in mini-benchmarks, such as Richards are less impressive, I'd say it's 3-7% improvement. The full output of benchmarks/perf.py is here: https://gist.github.com/1st1/e00f11586329f68fd490

When the full benchmarks suite is run, some of them report that they were slow. When I ran them separately several times, they all show no real slowdowns. It's just some of them (such as nbody) are highly unstable.

It's actually possible to improve the performance another 1-3% if we fuse __PyObject_GetMethod with ceval/LOAD_METHOD code. I've tried this here: https://github.com/1st1/cpython/tree/call_meth4, however I don't like to have so many details of object.c into ceval.c.

Changes in the Core
-------------------

Two new opcodes are added -- LOAD_METHOD and CALL_METHOD. Whenever compiler sees a method call "obj.method(..)" with positional arguments it compiles it as follows:

  LOAD_FAST(obj)
  LOAD_METHOD(method)
  {call arguments}
  CALL_METHOD

LOAD_METHOD implementation in ceval looks up "method" on obj's type, and checks that it wasn't overridden in obj.__dict__. Apparently, even with the __dict__ check this is still faster then creating a BoundMethod instance etc.

If the method is found and not overridden, LOAD_METHOD pushes the resolved method object, and 'obj'. If the method was overridden, the resolved method object and NULL are pushed to the stack.

CALL_METHOD then looks at the two stack values after call arguments. If the first one isn't NULL, it means that we have a method call.

Why CALL_METHOD?
----------------

It's actually possible to hack CALL_FUNCTION to support LOAD_METHOD. I've tried this approach in https://github.com/1st1/cpython/tree/call_meth3. It looks like that adding extra checks in CALL_FUNCTION have negative impact on many benchmarks. It's really easier to add another opcode.

Why only pure-Python methods?
-----------------------------

LOAD_METHOD atm works only with methods defined in pure Python. C methods, such as list.append are still wrapped into a descriptor, that creates a PyCFunction object on each attribute access. I've tried to do that in https://github.com/1st1/cpython/tree/call_cmeth. It does impact C method calls in a positive way, although my implementation is very hacky. It still uses LOAD_METHOD and CALL_METHOD opcodes, so my idea is to consider merging this patch first, and then introduce the necessary refactoring of PyCFunction and MethodDesctiptors in a separate issue.

Why only calls with positional arguments?
-----------------------------------------

As showed in "Why CALL_METHOD?", making CALL_FUNCTION to work with LOAD_METHOD slows it down. For keyword and var-arg calls we have three more opcodes -- CALL_FUNCTION_VAR, CALL_FUNCTION_KW, and CALL_FUNCTION_VAR_KW. I suspect that making them work with LOAD_METHOD would slow them down too, which will probably require us to add three (!) more opcodes for LOAD_METHOD.

And these kind of calls require much more overhead anyways, I don't expect them to be as optimizable as positional arg calls.

vstinner · 2016-01-14T17:44:01Z

For keyword and var-arg calls we have three more opcodes -- CALL_FUNCTION_VAR, CALL_FUNCTION_KW, and CALL_FUNCTION_VAR_KW. I suspect that making them work with LOAD_METHOD would slow them down too, which will probably require us to add three (!) more opcodes for LOAD_METHOD.

I don't think that it's an issue to add 3 more opcodes for performance. If you prefer to limit the number of opcodes, you can pass a flag in arguments. For example, use 2 bytes for 2 arguments instead of only 1?

See also the (now old) WPython project which proposed kind of CISC instructions:
https://code.google.com/p/wpython/

1st1 · 2016-01-14T17:51:33Z

If you prefer to limit the number of opcodes, you can pass a flag in arguments. For example, use 2 bytes for 2 arguments instead of only 1?

I tried two approaches:

Push one more leading NULL to the stack in LOAD_METHOD (which makes it push 3 vals). Then, in CALL_FUNCTION, detect when the callable is NULL and to the stuff that CALL_METHOD does.
Use a bitflag on oparg for CALL_FUNCTION and CALL_* opcodes. This adds EXTENDED_ARG opcode overhead.

Long story short, the only option would be to add dedicated opcodes to work with LOAD_METHOD. However, I'd prefer to first merge this patch, as it's relatively small and easy to review, and then focus on improving other things (keyword/*arg calls, C methods, etc). This is just a first step.

gvanrossum · 2016-01-14T18:37:55Z

I like this idea! I like the limitations to positional-only calls. I do think that it would be nice if we could speed up C calls too -- today, s.startswith('abc') is slower than s[:3] == 'abc' precisely because of the lookup. But I'm all for doing this one step at a time, so we can be sure it is solid before taking the next step(s).

1st1 · 2016-01-14T21:12:39Z

I like this idea! I like the limitations to positional-only calls. I do think that it would be nice if we could speed up C calls too -- today, s.startswith('abc') is slower than s[:3] == 'abc' precisely because of the lookup. But I'm all for doing this one step at a time, so we can be sure it is solid before taking the next step(s).

Yes, I think we can make <built-in type>.method(..) calls much faster with LOAD_METHOD. Since those types lack __dict__ and their __class__ is read-only, we can use a far better optimized code path without extra lookups and creation of BoundMethod/PyCFunction objects.

vstinner · 2016-01-14T21:16:21Z

I'm happy to see people working on optimizing CPython ;-)

1st1 · 2016-02-01T17:09:22Z

For those interested in reviewing this patch at some point: please wait until I upload a new version. The current patch is somewhat outdated.

alecsandrupatrascu · 2016-02-01T18:11:32Z

Yury, thank you for the heads up! Here at Intel, in the Dynamic Scripting Languages Optimization Team, we can help the community with reviewing and measuring this patch in our quiet and stable environment, the same one that we use to provide public CPython daily measurements. We will wait for your update.

gvanrossum · 2016-05-02T18:44:45Z

This patch doesn't apply cleanly any more. Is it easy to update?

methane · 2016-07-01T12:18:34Z

Updated, based on 102241:908b801f8a62

methane · 2016-07-01T12:46:57Z

Oops, previous patch doesn't update magic number in PC/launcher.c
Should I update it? Or don't touch it to avoid additional conflicts?

serhiy-storchaka · 2016-12-07T12:53:18Z

Please increase the magic number by 10. We need to reserve few numbers for the case of bytecode bug fixes in 3.6.

serhiy-storchaka · 2016-12-07T13:50:48Z

Added comments on Rietveld. Please document new bytecodes in the dis module documentation and What's New.

methane · 2016-12-07T14:37:58Z

$ ./python-default -m perf compare_to default.json callmethod4.json -G
Slower (7):
- pickle_dict: 66.0 us +- 4.6 us -> 77.0 us +- 5.9 us: 1.17x slower
- json_loads: 63.7 us +- 0.7 us -> 68.4 us +- 1.4 us: 1.07x slower
- unpack_sequence: 120 ns +- 2 ns -> 125 ns +- 3 ns: 1.04x slower
- scimark_lu: 499 ms +- 12 ms -> 514 ms +- 24 ms: 1.03x slower
- scimark_monte_carlo: 272 ms +- 10 ms -> 278 ms +- 9 ms: 1.02x slower
- scimark_sor: 517 ms +- 9 ms -> 526 ms +- 10 ms: 1.02x slower
- regex_effbot: 5.25 ms +- 0.15 ms -> 5.27 ms +- 0.17 ms: 1.00x slower

Faster (52):

call_method: 18.1 ms +- 0.1 ms -> 14.8 ms +- 0.1 ms: 1.22x faster
call_method_slots: 17.9 ms +- 0.2 ms -> 14.7 ms +- 0.2 ms: 1.22x faster
call_method_unknown: 20.0 ms +- 0.9 ms -> 16.5 ms +- 0.1 ms: 1.21x faster
xml_etree_parse: 304 ms +- 13 ms -> 283 ms +- 6 ms: 1.07x faster
float: 310 ms +- 4 ms -> 292 ms +- 4 ms: 1.06x faster
call_simple: 14.2 ms +- 0.6 ms -> 13.4 ms +- 0.3 ms: 1.06x faster
pickle_pure_python: 1.34 ms +- 0.02 ms -> 1.27 ms +- 0.02 ms: 1.05x faster
richards: 189 ms +- 3 ms -> 180 ms +- 7 ms: 1.05x faster
logging_silent: 793 ns +- 14 ns -> 754 ns +- 9 ns: 1.05x faster
deltablue: 18.4 ms +- 0.2 ms -> 17.6 ms +- 0.1 ms: 1.05x faster
go: 623 ms +- 5 ms -> 594 ms +- 5 ms: 1.05x faster
genshi_xml: 202 ms +- 2 ms -> 193 ms +- 3 ms: 1.04x faster
fannkuch: 1.10 sec +- 0.02 sec -> 1.05 sec +- 0.02 sec: 1.04x faster
raytrace: 1.38 sec +- 0.02 sec -> 1.32 sec +- 0.00 sec: 1.04x faster
django_template: 419 ms +- 5 ms -> 402 ms +- 4 ms: 1.04x faster
hexiom: 24.2 ms +- 0.2 ms -> 23.3 ms +- 0.2 ms: 1.04x faster
xml_etree_iterparse: 225 ms +- 4 ms -> 216 ms +- 5 ms: 1.04x faster
spectral_norm: 282 ms +- 14 ms -> 271 ms +- 4 ms: 1.04x faster
python_startup_no_site: 9.83 ms +- 0.02 ms -> 9.46 ms +- 0.03 ms: 1.04x faster
python_startup: 16.4 ms +- 0.0 ms -> 15.7 ms +- 0.0 ms: 1.04x faster
unpickle_pure_python: 918 us +- 24 us -> 884 us +- 24 us: 1.04x faster
unpickle: 33.0 us +- 0.4 us -> 31.8 us +- 0.4 us: 1.04x faster
dulwich_log: 158 ms +- 1 ms -> 153 ms +- 1 ms: 1.04x faster
html5lib: 234 ms +- 6 ms -> 225 ms +- 6 ms: 1.04x faster
scimark_fft: 735 ms +- 19 ms -> 709 ms +- 22 ms: 1.04x faster
sqlalchemy_declarative: 325 ms +- 4 ms -> 314 ms +- 3 ms: 1.04x faster
mako: 45.4 ms +- 0.3 ms -> 43.8 ms +- 0.7 ms: 1.04x faster
genshi_text: 90.1 ms +- 0.9 ms -> 87.0 ms +- 0.9 ms: 1.04x faster
tornado_http: 388 ms +- 4 ms -> 375 ms +- 4 ms: 1.03x faster
2to3: 758 ms +- 4 ms -> 734 ms +- 4 ms: 1.03x faster
scimark_sparse_mat_mult: 8.75 ms +- 0.25 ms -> 8.48 ms +- 0.39 ms: 1.03x faster
json_dumps: 29.9 ms +- 0.4 ms -> 29.0 ms +- 0.3 ms: 1.03x faster
sqlite_synth: 10.1 us +- 0.2 us -> 9.80 us +- 0.19 us: 1.03x faster
regex_dna: 286 ms +- 6 ms -> 278 ms +- 4 ms: 1.03x faster
crypto_pyaes: 235 ms +- 2 ms -> 228 ms +- 2 ms: 1.03x faster
pidigits: 320 ms +- 0 ms -> 311 ms +- 1 ms: 1.03x faster
logging_format: 38.1 us +- 0.5 us -> 37.1 us +- 0.4 us: 1.03x faster
meteor_contest: 206 ms +- 3 ms -> 201 ms +- 1 ms: 1.03x faster
xml_etree_process: 248 ms +- 4 ms -> 241 ms +- 4 ms: 1.03x faster
chameleon: 30.9 ms +- 0.4 ms -> 30.1 ms +- 0.4 ms: 1.03x faster
logging_simple: 32.8 us +- 0.4 us -> 32.0 us +- 0.3 us: 1.03x faster
sqlalchemy_imperative: 61.3 ms +- 1.6 ms -> 59.9 ms +- 1.6 ms: 1.02x faster
chaos: 308 ms +- 2 ms -> 301 ms +- 2 ms: 1.02x faster
unpickle_list: 7.90 us +- 0.04 us -> 7.73 us +- 0.10 us: 1.02x faster
sympy_integrate: 47.9 ms +- 0.3 ms -> 46.9 ms +- 0.4 ms: 1.02x faster
sympy_sum: 216 ms +- 6 ms -> 212 ms +- 6 ms: 1.02x faster
sympy_str: 481 ms +- 5 ms -> 473 ms +- 2 ms: 1.02x faster
sympy_expand: 1.08 sec +- 0.01 sec -> 1.06 sec +- 0.01 sec: 1.02x faster
regex_compile: 449 ms +- 3 ms -> 442 ms +- 5 ms: 1.02x faster
nqueens: 275 ms +- 3 ms -> 271 ms +- 3 ms: 1.02x faster
pathlib: 51.6 ms +- 0.6 ms -> 50.9 ms +- 0.7 ms: 1.01x faster
pickle_list: 8.85 us +- 0.09 us -> 8.79 us +- 0.11 us: 1.01x faster

Benchmark hidden because not significant (5): nbody, pickle, regex_v8, telco, xml_etree_generate

1st1 · 2016-12-07T15:16:46Z

Please don't merge this without my review.

serhiy-storchaka · 2016-12-13T06:09:22Z

Technically the patch LGTM. But we should find the cause of the regression in some benchmarks.

And would be nice to extend the optimization to C functions.

In any case this optimization is worth mentioning in What's New.

methane · 2016-12-13T06:36:04Z

Technically the patch LGTM. But we should find the cause of the regression in some benchmarks.

The benchmark is on Sandy Bridge (Core i5 2400) and I didn't use PGO build.
perf_event reported branch-miss rate increase at cpickle's save function.

I'll rerun benchmark with PGO build. I hope PGO is friendly with CPU branch
prediction, like L1/L2 cache.

Anyway, recent amd64 CPUs have more large branch history.

And would be nice to extend the optimization to C functions.
In any case this optimization is worth mentioning in What's New.

I'll do them.

methane · 2016-12-13T10:41:51Z

And would be nice to extend the optimization to C functions.

I tried it but skipping creating PyCFunction seems impossible for now.

My current idea is adding new tp_fastcall slot which has same signature
to _PyFunction_FastCallDict or _PyCFunction_FastCallDict.

If MethodDescrObject implement it, we can skip temporary PyCFunction object and
argument tuple.

But I think it should be separated issue. Patch is large enough already.

serhiy-storchaka · 2016-12-13T11:08:58Z

But I think it should be separated issue.

Agreed if that so hard.

1st1 · 2016-12-13T23:40:53Z

Inada-san, when I tested the patch last time, I think there was a regression somewhere, related to the descriptor protocol. Have you fixed that one?

python-dev · 2016-12-14T00:04:23Z

New changeset 64afd5cab40a by Yury Selivanov in branch 'default':
Issue bpo-26110: Add LOAD_METHOD/CALL_METHOD opcodes.
https://hg.python.org/cpython/rev/64afd5cab40a

1st1 · 2016-12-14T00:07:43Z

Inada-san, when I tested the patch last time, I think there was a regression somewhere, related to the descriptor protocol. Have you fixed that one?

Seems to be either fixed, or maybe those bugs were related to my opcode cache patch. Anyways, I decided to commit the patch to 3.7, otherwise it might miss the commit window as it did for 3.6. Let's fix any regressions right in the repo now.

My current idea is adding new tp_fastcall slot which has same signature
to _PyFunction_FastCallDict or _PyCFunction_FastCallDict.

I like this idea!

Thanks Inada-san for pushing this patch through, and thanks to Serhiy for reviewing it.

methane · 2016-12-14T03:50:49Z

I'm working on changing stack layout slightly

After benchmark with PGO build, I'll post it.

methane · 2016-12-14T04:02:31Z

I haven't noticed the patch is committed already.

Changing stack layout slightly is for easy to document, not for performance.
Please don't close this issue until adding document in Doc/library/dis.rst

methane · 2016-12-14T06:26:46Z

PGO benchmark result
https://gist.github.com/methane/64ab60c38324423ef27af70cc6db1dfd

methane · 2016-12-14T06:45:39Z

This patch modify stack layout slightly and adds document in Doc/library/dis.rst

vstinner · 2016-12-14T09:11:25Z

INADA Naoki: "My current idea is adding new tp_fastcall slot which has same signature to _PyFunction_FastCallDict or _PyCFunction_FastCallDict."

FYI I'm working on a solution to avoid tuple and dict to pass parameters to tp_new, tp_init and tp_call. I have PoC implementations (yeah, more than one). Please come to me directly if you are interested, this issue is not the right place to discuss.

vstinner · 2017-01-13T16:10:34Z

I just created the issue bpo-29263: "Implement LOAD_METHOD/CALL_METHOD for C functions".

methane · 2017-01-15T15:36:47Z

Yury, could you review this?

python-dev · 2017-01-16T08:23:41Z

New changeset a6241b2073c6 by INADA Naoki in branch 'default':
Issue bpo-26110: Add document for LOAD_METHOD and CALL_METHOD opcode.
https://hg.python.org/cpython/rev/a6241b2073c6

jdemeyer · 2019-06-12T12:49:13Z

Note that this idea has been generalized by PEP-590: any type can support this optimization by setting the Py_TPFLAGS_METHOD_DESCRIPTOR flag.

1st1 self-assigned this Jan 14, 2016

1st1 added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 14, 2016

methane added the 3.7 (EOL) end of life label Dec 7, 2016

methane closed this as completed Jan 16, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup method calls 1.2x #70298

Speedup method calls 1.2x #70298

1st1 commented Jan 14, 2016

1st1 commented Jan 14, 2016

vstinner commented Jan 14, 2016

1st1 commented Jan 14, 2016

gvanrossum commented Jan 14, 2016

1st1 commented Jan 14, 2016

vstinner commented Jan 14, 2016

1st1 commented Feb 1, 2016

alecsandrupatrascu mannequin commented Feb 1, 2016

gvanrossum commented May 2, 2016

methane commented Jul 1, 2016

methane commented Jul 1, 2016

serhiy-storchaka commented Dec 7, 2016

serhiy-storchaka commented Dec 7, 2016

methane commented Dec 7, 2016

1st1 commented Dec 7, 2016

serhiy-storchaka commented Dec 13, 2016

methane commented Dec 13, 2016

methane commented Dec 13, 2016

serhiy-storchaka commented Dec 13, 2016

1st1 commented Dec 13, 2016

python-dev mannequin commented Dec 14, 2016

1st1 commented Dec 14, 2016

methane commented Dec 14, 2016

methane commented Dec 14, 2016

methane commented Dec 14, 2016

methane commented Dec 14, 2016

vstinner commented Dec 14, 2016

vstinner commented Jan 13, 2017

methane commented Jan 15, 2017

python-dev mannequin commented Jan 16, 2017

jdemeyer commented Jun 12, 2019

Speedup method calls 1.2x #70298

Speedup method calls 1.2x #70298

Comments

1st1 commented Jan 14, 2016

1st1 commented Jan 14, 2016

vstinner commented Jan 14, 2016

1st1 commented Jan 14, 2016

gvanrossum commented Jan 14, 2016

1st1 commented Jan 14, 2016

vstinner commented Jan 14, 2016

1st1 commented Feb 1, 2016

alecsandrupatrascu mannequin commented Feb 1, 2016

gvanrossum commented May 2, 2016

methane commented Jul 1, 2016

methane commented Jul 1, 2016

serhiy-storchaka commented Dec 7, 2016

serhiy-storchaka commented Dec 7, 2016

methane commented Dec 7, 2016

1st1 commented Dec 7, 2016

serhiy-storchaka commented Dec 13, 2016

methane commented Dec 13, 2016

methane commented Dec 13, 2016

serhiy-storchaka commented Dec 13, 2016

1st1 commented Dec 13, 2016

python-dev mannequin commented Dec 14, 2016

1st1 commented Dec 14, 2016

methane commented Dec 14, 2016

methane commented Dec 14, 2016

methane commented Dec 14, 2016

methane commented Dec 14, 2016

vstinner commented Dec 14, 2016

vstinner commented Jan 13, 2017

methane commented Jan 15, 2017

python-dev mannequin commented Jan 16, 2017

jdemeyer commented Jun 12, 2019