classification
Title: Implement LOAD_METHOD/CALL_METHOD for C functions
Type: performance Stage: resolved
Components: Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, inada.naoki, python-dev, serhiy.storchaka, yselivanov
Priority: normal Keywords: patch

Created on 2017-01-13 16:09 by haypo, last changed 2017-02-02 23:13 by inada.naoki. This issue is now closed.

Files
File name Uploaded Description Edit
loadmethod-fastcall.patch inada.naoki, 2017-01-14 05:36 review
loadmethod-methoddescr.patch inada.naoki, 2017-02-02 05:10 review
loadmethod-methoddescr-2.patch inada.naoki, 2017-02-02 16:48 review
Messages (18)
msg285414 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 16:09
The issue #29259 implements tp_fastcall on method_descriptor (PyMethodDescr_Type). According to http://bugs.python.org/issue26110#msg283093 it would allow to implement LOAD_METHOD and CALL_METHOD for C functions.
msg285457 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-14 05:36
This is proof of concept patch to support LOAD_METHOD & CALL_METHOD for METH_FASTCALL methods.
(This patch should be applied after fastcall.patch)

$ ./python -m perf timeit --compare-to `pwd`/python-fastcall -s "d = b''" -- "d.decode()"
python-fastcall: ..................... 91.0 ns +- 1.0 ns
python: ..................... 80.3 ns +- 0.3 ns

Median +- std dev: [python-fastcall] 91.0 ns +- 1.0 ns -> [python] 80.3 ns +- 0.3 ns: 1.13x faster

$ ./python -m perf timeit --compare-to `pwd`/python-fastcall -s "d = b''" -- "d.decode('ascii')"
python-fastcall: ..................... 116 ns +- 1 ns
python: ..................... 106 ns +- 1 ns

Median +- std dev: [python-fastcall] 116 ns +- 1 ns -> [python] 106 ns +- 1 ns: 1.10x faster


Since PyCFunction is lighter than PyMethodObject, performance benefit seems smaller than
Python method (up to 20%).

Sadly speaking, there are only few METH_FASTCALL in builtin type.
msg285459 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-14 06:36
Maybe, we should do

* Make clinic use more METH_FASTCALL
* Use clinic more in builtin methods

before trying this optimization.
msg285711 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-18 09:39
New changeset 0b219348ec9e by Victor Stinner in branch 'default':
Optimize methoddescr_call(): avoid temporary PyCFunction
https://hg.python.org/cpython/rev/0b219348ec9e
msg286651 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-01 16:49
INADA Naoki: "Maybe, we should do: * Make clinic use more METH_FASTCALL * Use clinic more in builtin methods; before trying this optimization."

I created the issue #29286 "Use METH_FASTCALL in str methods" which is now fixed. I started to propose patches for other funnctions, types and modules.

Would you mind to rebase your loadmethod-fastcall.patch patch Naoki?

With my change 0b219348ec9e, I'm not sure that your change on methoddescr_call() is still needed.

I tried to rebase your change, but I failed to get the "1.10x faster" result. Maybe I did something wrong, maybe my change 0b219348ec9e already optimized method calls for C functions?
msg286728 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-02-02 01:17
LOAD_METHOD support was based on tp_fastcall.

Without LOAD_METHOD support, methoddescr_get is called and
stack layout after LOAD_METHOD is:

  NULL, (bound)PyCFunction, arg1, arg2, ... argN

With LOAD_METHOD support, stack layout is:

  PyMethodDescrObject, self, arg1, ... argN

And tp_call (or tp_fastcall) of method descriptor is called when CALL_METHOD.

Without tp_fastcall, it means pack&unpack tuple happens.
It is more heavy than creating temporary PyCFunction.

---

Other ideas to support LOAD_METHOD for C function are:

* Add _PyMethodDescr_FastCallKeywords and call it in call_function.
* Create unbound PyCFunction and cache it
msg286731 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-02-02 01:27
> * Add _PyMethodDescr_FastCallKeywords and call it in call_function.

This option seems to be the easiest to implement.
msg286732 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-02 01:45
INADA Naoki: "LOAD_METHOD support was based on tp_fastcall. (...) Other ideas to support LOAD_METHOD for C function are: * Add _PyMethodDescr_FastCallKeywords and call it in call_function."

It seems like a very small subset of tp_fastcall. If it provides a significant speedup, I really like the idea :-) I would be happy to see a "subset of tp_fastcall" merged. At least my whole tp_fastcall experiment wouldn't be worthless :-)

What would be the main change which would provide a speedup? Avoid the creation of a temporary tuple to pass positional arguments?
msg286733 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-02 01:49
> * Create unbound PyCFunction and cache it

Does it mean create a new private type, similar to PyCFunction_Type (PyCFunctionObject) but without the self attribute?

IMHO creating a new type is complex. I'm not sure that it's worth it, especially if the _PyMethodDescr_FastCallKeywords() idea is simpler to implement and "fast enough".
msg286739 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-02-02 05:10
I'm running pyperformance.
msg286740 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-02-02 05:17
Here is the result. I'll investigate mako result.

+ ../python.default -m perf compare_to default.json patched.json -G --min-speed=1
Slower (20):
- mako: 40.9 ms +- 0.4 ms -> 44.7 ms +- 0.6 ms: 1.09x slower (+9%)
- chaos: 298 ms +- 3 ms -> 308 ms +- 2 ms: 1.03x slower (+3%)
- telco: 19.7 ms +- 0.6 ms -> 20.2 ms +- 0.5 ms: 1.03x slower (+3%)
- scimark_sor: 478 ms +- 7 ms -> 491 ms +- 11 ms: 1.03x slower (+3%)
- xml_etree_process: 238 ms +- 3 ms -> 245 ms +- 4 ms: 1.03x slower (+3%)
- logging_simple: 31.1 us +- 0.3 us -> 31.9 us +- 0.3 us: 1.03x slower (+3%)
- logging_format: 36.4 us +- 0.3 us -> 37.2 us +- 0.4 us: 1.02x slower (+2%)
- nqueens: 264 ms +- 3 ms -> 270 ms +- 3 ms: 1.02x slower (+2%)
- fannkuch: 1.07 sec +- 0.02 sec -> 1.09 sec +- 0.03 sec: 1.02x slower (+2%)
- django_template: 401 ms +- 5 ms -> 409 ms +- 3 ms: 1.02x slower (+2%)
- call_method_slots: 14.6 ms +- 0.2 ms -> 14.9 ms +- 0.1 ms: 1.02x slower (+2%)
- meteor_contest: 199 ms +- 2 ms -> 203 ms +- 2 ms: 1.02x slower (+2%)
- logging_silent: 735 ns +- 9 ns -> 746 ns +- 11 ns: 1.02x slower (+2%)
- sqlalchemy_declarative: 314 ms +- 4 ms -> 318 ms +- 4 ms: 1.01x slower (+1%)
- genshi_text: 89.1 ms +- 1.3 ms -> 90.2 ms +- 1.3 ms: 1.01x slower (+1%)
- unpack_sequence: 131 ns +- 2 ns -> 133 ns +- 2 ns: 1.01x slower (+1%)
- call_method: 15.1 ms +- 0.2 ms -> 15.3 ms +- 0.1 ms: 1.01x slower (+1%)
- scimark_monte_carlo: 263 ms +- 6 ms -> 266 ms +- 6 ms: 1.01x slower (+1%)
- unpickle_pure_python: 834 us +- 16 us -> 843 us +- 15 us: 1.01x slower (+1%)
- pathlib: 51.2 ms +- 0.9 ms -> 51.7 ms +- 0.7 ms: 1.01x slower (+1%)

Faster (11):
- scimark_lu: 457 ms +- 49 ms -> 407 ms +- 7 ms: 1.12x faster (-11%)
- chameleon: 30.2 ms +- 0.6 ms -> 28.8 ms +- 0.4 ms: 1.05x faster (-5%)
- xml_etree_parse: 314 ms +- 12 ms -> 303 ms +- 8 ms: 1.04x faster (-4%)
- sqlite_synth: 10.3 us +- 0.2 us -> 9.94 us +- 0.25 us: 1.03x faster (-3%)
- pickle_pure_python: 1.28 ms +- 0.02 ms -> 1.25 ms +- 0.02 ms: 1.03x faster (-3%)
- xml_etree_iterparse: 223 ms +- 5 ms -> 218 ms +- 7 ms: 1.02x faster (-2%)
- scimark_fft: 730 ms +- 19 ms -> 716 ms +- 18 ms: 1.02x faster (-2%)
- nbody: 243 ms +- 5 ms -> 239 ms +- 6 ms: 1.02x faster (-2%)
- tornado_http: 377 ms +- 4 ms -> 373 ms +- 4 ms: 1.01x faster (-1%)
- sympy_expand: 1.02 sec +- 0.01 sec -> 1.01 sec +- 0.01 sec: 1.01x faster (-1%)
- unpickle: 32.1 us +- 0.3 us -> 31.8 us +- 0.2 us: 1.01x faster (-1%)

Benchmark hidden because not significant (33):
msg286741 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-02-02 05:20
microbench:
$ ./python.patched -m perf timeit --compare-to `pwd`/python.default -s "d = b''" -- "d.decode('ascii')"
python.default: ..................... 109 ns +- 1 ns
python.patched: ..................... 102 ns +- 1 ns

Median +- std dev: [python.default] 109 ns +- 1 ns -> [python.patched] 102 ns +- 1 ns: 1.08x faster (-7%)

$ ./python.patched -m perf timeit --compare-to `pwd`/python.default -s "l = [42]" -- "l.count(42)"                            
python.default: ..................... 66.0 ns +- 0.9 ns
python.patched: ..................... 60.0 ns +- 0.3 ns

Median +- std dev: [python.default] 66.0 ns +- 0.9 ns -> [python.patched] 60.0 ns +- 0.3 ns: 1.10x faster (-9%)
msg286757 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-02 10:04
I tried to benchmark b''.decode(encoding='ascii'): CALL_METHOD is not used for this call, but CALL_FUNCTION_KW :-/ So the call is not affected by your patch.

I also ran a quick benchmark on loadmethod-methoddescr.patch:

b''.decode(): 71.1 ns +- 0.5 ns -> 65.4 ns +- 0.2 ns: 1.09x faster (-8%)
b''.decode('utf8'): 92.8 ns +- 0.4 ns -> 85.5 ns +- 0.3 ns: 1.09x faster (-8%)

I confirm the speedup.
msg286762 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-02-02 11:10
I confirmed bm_mako performance degrade is caused by L1 cache miss.
(I didn't use --with-optimization)
https://gist.github.com/methane/b33edbf1f123ae026e704b0e005c3606
msg286764 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-02 11:20
Naoki> I confirmed bm_mako performance degrade is caused by L1 cache miss.

I know this performance instability very well, the issue is called "code placement":
https://haypo.github.io/journey-to-stable-benchmark-deadcode.html

I tried to fight it with GCC __attribute__((hot)) in the issue #28618, but it doesn't fix the issue (at least, not completely):
http://bugs.python.org/issue28618#msg281459

In my experience, the best fix is PGO. Slowly, I consider that it's worthless to try to "fight" against code placement, and that benchmark results are only reliable when PGO compilation was used. Otherwise, you should ignore small performance differences. Problem: What is the threshold? 5%? 10%? I already noticed a difference up to 70% only caused by code placement!
https://haypo.github.io/analysis-python-performance-issue.html

--

I ran benchmarks on loadmethod-methoddescr.patch on haypo@speed-python with LTO+PGO. If you only show performance difference of at least 5%, only 3 benchmarks are significant and are faster:
---
haypo@speed-python$ python3 -m perf compare_to ~/benchmarks/*762a93935afd*json loadmethod-methoddesc_ref_762a93935afd.json  -G --min-speed=5
Faster (3):
- regex_v8: 50.3 ms +- 0.4 ms -> 43.2 ms +- 0.3 ms: 1.17x faster (-14%)
- scimark_monte_carlo: 230 ms +- 6 ms -> 208 ms +- 4 ms: 1.11x faster (-10%)
- scimark_lu: 390 ms +- 17 ms -> 370 ms +- 13 ms: 1.05x faster (-5%)

Benchmark hidden because not significant (61): (...)
---

In my experience, regex_v8 and scimark_* benchmarks are not really reliable. I'm not surprised to not see a major speedup on performance, since the speedup on microbenchmarks is only 10% faster.

IHMO 10% faster on method calls is significant enough, since it's a core, very common, and widely used Python feature.

--

To be more explicitl:  loadmethod-methoddescr.patch LGTM except of minor comments on the review.
msg286809 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-02 16:59
loadmethod-methoddescr-2.patch LGTM.
msg286832 - (view) Author: Roundup Robot (python-dev) Date: 2017-02-02 22:43
New changeset 135a9a0c09f9 by INADA Naoki in branch 'default':
Issue #29263: LOAD_METHOD support for C methods
https://hg.python.org/cpython/rev/135a9a0c09f9
msg286833 - (view) Author: Roundup Robot (python-dev) Date: 2017-02-02 23:00
New changeset 48bde5d2cc5b9e98f55cd23ee57f3996d81caeb5 by INADA Naoki in branch 'master':
Issue #29263: LOAD_METHOD support for C methods
https://github.com/python/cpython/commit/48bde5d2cc5b9e98f55cd23ee57f3996d81caeb5
History
Date User Action Args
2017-02-02 23:13:20inada.naokisetstatus: open -> closed
dependencies: - Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects
resolution: fixed
stage: resolved
2017-02-02 23:00:21python-devsetmessages: + msg286833
2017-02-02 22:43:14python-devsetmessages: + msg286832
2017-02-02 16:59:31hayposetmessages: + msg286809
2017-02-02 16:48:15inada.naokisetfiles: + loadmethod-methoddescr-2.patch
2017-02-02 11:20:59hayposetmessages: + msg286764
2017-02-02 11:10:28inada.naokisetmessages: + msg286762
2017-02-02 10:12:12serhiy.storchakasetnosy: + serhiy.storchaka
2017-02-02 10:04:40hayposetmessages: + msg286757
2017-02-02 05:20:42inada.naokisetmessages: + msg286741
2017-02-02 05:17:42inada.naokisetmessages: + msg286740
2017-02-02 05:10:56inada.naokisetfiles: + loadmethod-methoddescr.patch

messages: + msg286739
2017-02-02 01:49:33hayposetmessages: + msg286733
2017-02-02 01:45:44hayposetmessages: + msg286732
2017-02-02 01:27:59yselivanovsetmessages: + msg286731
2017-02-02 01:17:26inada.naokisetmessages: + msg286728
2017-02-01 16:49:10hayposetmessages: + msg286651
2017-01-18 09:39:51python-devsetnosy: + python-dev
messages: + msg285711
2017-01-14 06:36:15inada.naokisetmessages: + msg285459
2017-01-14 05:36:13inada.naokisetfiles: + loadmethod-fastcall.patch
keywords: + patch
messages: + msg285457
2017-01-13 16:10:46hayposetdependencies: + Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects
2017-01-13 16:09:52haypocreate