Issue 29263: Implement LOAD_METHOD/CALL_METHOD for C functions

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73449

classification

Title:	Implement LOAD_METHOD/CALL_METHOD for C functions
Type:	performance	Stage:	resolved
Components:		Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	methane, python-dev, serhiy.storchaka, vstinner, yselivanov
Priority:	normal	Keywords:	patch

Created on 2017-01-13 16:09 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
loadmethod-fastcall.patch	methane, 2017-01-14 05:36		review
loadmethod-methoddescr.patch	methane, 2017-02-02 05:10		review
loadmethod-methoddescr-2.patch	methane, 2017-02-02 16:48		review

Messages (18)
msg285414 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-13 16:09
The issue #29259 implements tp_fastcall on method_descriptor (PyMethodDescr_Type). According to http://bugs.python.org/issue26110#msg283093 it would allow to implement LOAD_METHOD and CALL_METHOD for C functions.
msg285457 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-14 05:36
This is proof of concept patch to support LOAD_METHOD & CALL_METHOD for METH_FASTCALL methods. (This patch should be applied after fastcall.patch) $ ./python -m perf timeit --compare-to `pwd`/python-fastcall -s "d = b''" -- "d.decode()" python-fastcall: ..................... 91.0 ns +- 1.0 ns python: ..................... 80.3 ns +- 0.3 ns Median +- std dev: [python-fastcall] 91.0 ns +- 1.0 ns -> [python] 80.3 ns +- 0.3 ns: 1.13x faster $ ./python -m perf timeit --compare-to `pwd`/python-fastcall -s "d = b''" -- "d.decode('ascii')" python-fastcall: ..................... 116 ns +- 1 ns python: ..................... 106 ns +- 1 ns Median +- std dev: [python-fastcall] 116 ns +- 1 ns -> [python] 106 ns +- 1 ns: 1.10x faster Since PyCFunction is lighter than PyMethodObject, performance benefit seems smaller than Python method (up to 20%). Sadly speaking, there are only few METH_FASTCALL in builtin type.
msg285459 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-14 06:36
Maybe, we should do * Make clinic use more METH_FASTCALL * Use clinic more in builtin methods before trying this optimization.
msg285711 - (view)	Author: Roundup Robot (python-dev)	Date: 2017-01-18 09:39
New changeset 0b219348ec9e by Victor Stinner in branch 'default': Optimize methoddescr_call(): avoid temporary PyCFunction https://hg.python.org/cpython/rev/0b219348ec9e
msg286651 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-01 16:49
INADA Naoki: "Maybe, we should do: * Make clinic use more METH_FASTCALL * Use clinic more in builtin methods; before trying this optimization." I created the issue #29286 "Use METH_FASTCALL in str methods" which is now fixed. I started to propose patches for other funnctions, types and modules. Would you mind to rebase your loadmethod-fastcall.patch patch Naoki? With my change 0b219348ec9e, I'm not sure that your change on methoddescr_call() is still needed. I tried to rebase your change, but I failed to get the "1.10x faster" result. Maybe I did something wrong, maybe my change 0b219348ec9e already optimized method calls for C functions?
msg286728 - (view)	Author: Inada Naoki (methane) *	Date: 2017-02-02 01:17
LOAD_METHOD support was based on tp_fastcall. Without LOAD_METHOD support, methoddescr_get is called and stack layout after LOAD_METHOD is: NULL, (bound)PyCFunction, arg1, arg2, ... argN With LOAD_METHOD support, stack layout is: PyMethodDescrObject, self, arg1, ... argN And tp_call (or tp_fastcall) of method descriptor is called when CALL_METHOD. Without tp_fastcall, it means pack&unpack tuple happens. It is more heavy than creating temporary PyCFunction. --- Other ideas to support LOAD_METHOD for C function are: * Add _PyMethodDescr_FastCallKeywords and call it in call_function. * Create unbound PyCFunction and cache it
msg286731 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2017-02-02 01:27
> * Add _PyMethodDescr_FastCallKeywords and call it in call_function. This option seems to be the easiest to implement.
msg286732 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-02 01:45
INADA Naoki: "LOAD_METHOD support was based on tp_fastcall. (...) Other ideas to support LOAD_METHOD for C function are: * Add _PyMethodDescr_FastCallKeywords and call it in call_function." It seems like a very small subset of tp_fastcall. If it provides a significant speedup, I really like the idea :-) I would be happy to see a "subset of tp_fastcall" merged. At least my whole tp_fastcall experiment wouldn't be worthless :-) What would be the main change which would provide a speedup? Avoid the creation of a temporary tuple to pass positional arguments?
msg286733 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-02 01:49
> * Create unbound PyCFunction and cache it Does it mean create a new private type, similar to PyCFunction_Type (PyCFunctionObject) but without the self attribute? IMHO creating a new type is complex. I'm not sure that it's worth it, especially if the _PyMethodDescr_FastCallKeywords() idea is simpler to implement and "fast enough".
msg286739 - (view)	Author: Inada Naoki (methane) *	Date: 2017-02-02 05:10
I'm running pyperformance.
msg286740 - (view)	Author: Inada Naoki (methane) *	Date: 2017-02-02 05:17
Here is the result. I'll investigate mako result. + ../python.default -m perf compare_to default.json patched.json -G --min-speed=1 Slower (20): - mako: 40.9 ms +- 0.4 ms -> 44.7 ms +- 0.6 ms: 1.09x slower (+9%) - chaos: 298 ms +- 3 ms -> 308 ms +- 2 ms: 1.03x slower (+3%) - telco: 19.7 ms +- 0.6 ms -> 20.2 ms +- 0.5 ms: 1.03x slower (+3%) - scimark_sor: 478 ms +- 7 ms -> 491 ms +- 11 ms: 1.03x slower (+3%) - xml_etree_process: 238 ms +- 3 ms -> 245 ms +- 4 ms: 1.03x slower (+3%) - logging_simple: 31.1 us +- 0.3 us -> 31.9 us +- 0.3 us: 1.03x slower (+3%) - logging_format: 36.4 us +- 0.3 us -> 37.2 us +- 0.4 us: 1.02x slower (+2%) - nqueens: 264 ms +- 3 ms -> 270 ms +- 3 ms: 1.02x slower (+2%) - fannkuch: 1.07 sec +- 0.02 sec -> 1.09 sec +- 0.03 sec: 1.02x slower (+2%) - django_template: 401 ms +- 5 ms -> 409 ms +- 3 ms: 1.02x slower (+2%) - call_method_slots: 14.6 ms +- 0.2 ms -> 14.9 ms +- 0.1 ms: 1.02x slower (+2%) - meteor_contest: 199 ms +- 2 ms -> 203 ms +- 2 ms: 1.02x slower (+2%) - logging_silent: 735 ns +- 9 ns -> 746 ns +- 11 ns: 1.02x slower (+2%) - sqlalchemy_declarative: 314 ms +- 4 ms -> 318 ms +- 4 ms: 1.01x slower (+1%) - genshi_text: 89.1 ms +- 1.3 ms -> 90.2 ms +- 1.3 ms: 1.01x slower (+1%) - unpack_sequence: 131 ns +- 2 ns -> 133 ns +- 2 ns: 1.01x slower (+1%) - call_method: 15.1 ms +- 0.2 ms -> 15.3 ms +- 0.1 ms: 1.01x slower (+1%) - scimark_monte_carlo: 263 ms +- 6 ms -> 266 ms +- 6 ms: 1.01x slower (+1%) - unpickle_pure_python: 834 us +- 16 us -> 843 us +- 15 us: 1.01x slower (+1%) - pathlib: 51.2 ms +- 0.9 ms -> 51.7 ms +- 0.7 ms: 1.01x slower (+1%) Faster (11): - scimark_lu: 457 ms +- 49 ms -> 407 ms +- 7 ms: 1.12x faster (-11%) - chameleon: 30.2 ms +- 0.6 ms -> 28.8 ms +- 0.4 ms: 1.05x faster (-5%) - xml_etree_parse: 314 ms +- 12 ms -> 303 ms +- 8 ms: 1.04x faster (-4%) - sqlite_synth: 10.3 us +- 0.2 us -> 9.94 us +- 0.25 us: 1.03x faster (-3%) - pickle_pure_python: 1.28 ms +- 0.02 ms -> 1.25 ms +- 0.02 ms: 1.03x faster (-3%) - xml_etree_iterparse: 223 ms +- 5 ms -> 218 ms +- 7 ms: 1.02x faster (-2%) - scimark_fft: 730 ms +- 19 ms -> 716 ms +- 18 ms: 1.02x faster (-2%) - nbody: 243 ms +- 5 ms -> 239 ms +- 6 ms: 1.02x faster (-2%) - tornado_http: 377 ms +- 4 ms -> 373 ms +- 4 ms: 1.01x faster (-1%) - sympy_expand: 1.02 sec +- 0.01 sec -> 1.01 sec +- 0.01 sec: 1.01x faster (-1%) - unpickle: 32.1 us +- 0.3 us -> 31.8 us +- 0.2 us: 1.01x faster (-1%) Benchmark hidden because not significant (33):
msg286741 - (view)	Author: Inada Naoki (methane) *	Date: 2017-02-02 05:20
microbench: $ ./python.patched -m perf timeit --compare-to `pwd`/python.default -s "d = b''" -- "d.decode('ascii')" python.default: ..................... 109 ns +- 1 ns python.patched: ..................... 102 ns +- 1 ns Median +- std dev: [python.default] 109 ns +- 1 ns -> [python.patched] 102 ns +- 1 ns: 1.08x faster (-7%) $ ./python.patched -m perf timeit --compare-to `pwd`/python.default -s "l = [42]" -- "l.count(42)" python.default: ..................... 66.0 ns +- 0.9 ns python.patched: ..................... 60.0 ns +- 0.3 ns Median +- std dev: [python.default] 66.0 ns +- 0.9 ns -> [python.patched] 60.0 ns +- 0.3 ns: 1.10x faster (-9%)
msg286757 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-02 10:04
I tried to benchmark b''.decode(encoding='ascii'): CALL_METHOD is not used for this call, but CALL_FUNCTION_KW :-/ So the call is not affected by your patch. I also ran a quick benchmark on loadmethod-methoddescr.patch: b''.decode(): 71.1 ns +- 0.5 ns -> 65.4 ns +- 0.2 ns: 1.09x faster (-8%) b''.decode('utf8'): 92.8 ns +- 0.4 ns -> 85.5 ns +- 0.3 ns: 1.09x faster (-8%) I confirm the speedup.
msg286762 - (view)	Author: Inada Naoki (methane) *	Date: 2017-02-02 11:10
I confirmed bm_mako performance degrade is caused by L1 cache miss. (I didn't use --with-optimization) https://gist.github.com/methane/b33edbf1f123ae026e704b0e005c3606
msg286764 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-02 11:20
Naoki> I confirmed bm_mako performance degrade is caused by L1 cache miss. I know this performance instability very well, the issue is called "code placement": https://haypo.github.io/journey-to-stable-benchmark-deadcode.html I tried to fight it with GCC __attribute__((hot)) in the issue #28618, but it doesn't fix the issue (at least, not completely): http://bugs.python.org/issue28618#msg281459 In my experience, the best fix is PGO. Slowly, I consider that it's worthless to try to "fight" against code placement, and that benchmark results are only reliable when PGO compilation was used. Otherwise, you should ignore small performance differences. Problem: What is the threshold? 5%? 10%? I already noticed a difference up to 70% only caused by code placement! https://haypo.github.io/analysis-python-performance-issue.html -- I ran benchmarks on loadmethod-methoddescr.patch on haypo@speed-python with LTO+PGO. If you only show performance difference of at least 5%, only 3 benchmarks are significant and are faster: --- haypo@speed-python$ python3 -m perf compare_to ~/benchmarks/762a93935afdjson loadmethod-methoddesc_ref_762a93935afd.json -G --min-speed=5 Faster (3): - regex_v8: 50.3 ms +- 0.4 ms -> 43.2 ms +- 0.3 ms: 1.17x faster (-14%) - scimark_monte_carlo: 230 ms +- 6 ms -> 208 ms +- 4 ms: 1.11x faster (-10%) - scimark_lu: 390 ms +- 17 ms -> 370 ms +- 13 ms: 1.05x faster (-5%) Benchmark hidden because not significant (61): (...) --- In my experience, regex_v8 and scimark_* benchmarks are not really reliable. I'm not surprised to not see a major speedup on performance, since the speedup on microbenchmarks is only 10% faster. IHMO 10% faster on method calls is significant enough, since it's a core, very common, and widely used Python feature. -- To be more explicitl: loadmethod-methoddescr.patch LGTM except of minor comments on the review.
msg286809 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-02 16:59
loadmethod-methoddescr-2.patch LGTM.
msg286832 - (view)	Author: Roundup Robot (python-dev)	Date: 2017-02-02 22:43
New changeset 135a9a0c09f9 by INADA Naoki in branch 'default': Issue #29263: LOAD_METHOD support for C methods https://hg.python.org/cpython/rev/135a9a0c09f9
msg286833 - (view)	Author: Roundup Robot (python-dev)	Date: 2017-02-02 23:00
New changeset 48bde5d2cc5b9e98f55cd23ee57f3996d81caeb5 by INADA Naoki in branch 'master': Issue #29263: LOAD_METHOD support for C methods https://github.com/python/cpython/commit/48bde5d2cc5b9e98f55cd23ee57f3996d81caeb5

History
Date	User	Action	Args
2022-04-11 14:58:42	admin	set	github: 73449
2017-02-02 23:13:20	methane	set	status: open -> closed dependencies: - Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects resolution: fixed stage: resolved
2017-02-02 23:00:21	python-dev	set	messages: + msg286833
2017-02-02 22:43:14	python-dev	set	messages: + msg286832
2017-02-02 16:59:31	vstinner	set	messages: + msg286809
2017-02-02 16:48:15	methane	set	files: + loadmethod-methoddescr-2.patch
2017-02-02 11:20:59	vstinner	set	messages: + msg286764
2017-02-02 11:10:28	methane	set	messages: + msg286762
2017-02-02 10:12:12	serhiy.storchaka	set	nosy: + serhiy.storchaka
2017-02-02 10:04:40	vstinner	set	messages: + msg286757
2017-02-02 05:20:42	methane	set	messages: + msg286741
2017-02-02 05:17:42	methane	set	messages: + msg286740
2017-02-02 05:10:56	methane	set	files: + loadmethod-methoddescr.patch messages: + msg286739
2017-02-02 01:49:33	vstinner	set	messages: + msg286733
2017-02-02 01:45:44	vstinner	set	messages: + msg286732
2017-02-02 01:27:59	yselivanov	set	messages: + msg286731
2017-02-02 01:17:26	methane	set	messages: + msg286728
2017-02-01 16:49:10	vstinner	set	messages: + msg286651
2017-01-18 09:39:51	python-dev	set	nosy: + python-dev messages: + msg285711
2017-01-14 06:36:15	methane	set	messages: + msg285459
2017-01-14 05:36:13	methane	set	files: + loadmethod-fastcall.patch keywords: + patch messages: + msg285457
2017-01-13 16:10:46	vstinner	set	dependencies: + Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects
2017-01-13 16:09:52	vstinner	create