msg285414 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-01-13 16:09 |
The issue #29259 implements tp_fastcall on method_descriptor (PyMethodDescr_Type). According to http://bugs.python.org/issue26110#msg283093 it would allow to implement LOAD_METHOD and CALL_METHOD for C functions.
|
msg285457 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-01-14 05:36 |
This is proof of concept patch to support LOAD_METHOD & CALL_METHOD for METH_FASTCALL methods.
(This patch should be applied after fastcall.patch)
$ ./python -m perf timeit --compare-to `pwd`/python-fastcall -s "d = b''" -- "d.decode()"
python-fastcall: ..................... 91.0 ns +- 1.0 ns
python: ..................... 80.3 ns +- 0.3 ns
Median +- std dev: [python-fastcall] 91.0 ns +- 1.0 ns -> [python] 80.3 ns +- 0.3 ns: 1.13x faster
$ ./python -m perf timeit --compare-to `pwd`/python-fastcall -s "d = b''" -- "d.decode('ascii')"
python-fastcall: ..................... 116 ns +- 1 ns
python: ..................... 106 ns +- 1 ns
Median +- std dev: [python-fastcall] 116 ns +- 1 ns -> [python] 106 ns +- 1 ns: 1.10x faster
Since PyCFunction is lighter than PyMethodObject, performance benefit seems smaller than
Python method (up to 20%).
Sadly speaking, there are only few METH_FASTCALL in builtin type.
|
msg285459 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-01-14 06:36 |
Maybe, we should do
* Make clinic use more METH_FASTCALL
* Use clinic more in builtin methods
before trying this optimization.
|
msg285711 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2017-01-18 09:39 |
New changeset 0b219348ec9e by Victor Stinner in branch 'default':
Optimize methoddescr_call(): avoid temporary PyCFunction
https://hg.python.org/cpython/rev/0b219348ec9e
|
msg286651 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-01 16:49 |
INADA Naoki: "Maybe, we should do: * Make clinic use more METH_FASTCALL * Use clinic more in builtin methods; before trying this optimization."
I created the issue #29286 "Use METH_FASTCALL in str methods" which is now fixed. I started to propose patches for other funnctions, types and modules.
Would you mind to rebase your loadmethod-fastcall.patch patch Naoki?
With my change 0b219348ec9e, I'm not sure that your change on methoddescr_call() is still needed.
I tried to rebase your change, but I failed to get the "1.10x faster" result. Maybe I did something wrong, maybe my change 0b219348ec9e already optimized method calls for C functions?
|
msg286728 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-02-02 01:17 |
LOAD_METHOD support was based on tp_fastcall.
Without LOAD_METHOD support, methoddescr_get is called and
stack layout after LOAD_METHOD is:
NULL, (bound)PyCFunction, arg1, arg2, ... argN
With LOAD_METHOD support, stack layout is:
PyMethodDescrObject, self, arg1, ... argN
And tp_call (or tp_fastcall) of method descriptor is called when CALL_METHOD.
Without tp_fastcall, it means pack&unpack tuple happens.
It is more heavy than creating temporary PyCFunction.
---
Other ideas to support LOAD_METHOD for C function are:
* Add _PyMethodDescr_FastCallKeywords and call it in call_function.
* Create unbound PyCFunction and cache it
|
msg286731 - (view) |
Author: Yury Selivanov (yselivanov) * |
Date: 2017-02-02 01:27 |
> * Add _PyMethodDescr_FastCallKeywords and call it in call_function.
This option seems to be the easiest to implement.
|
msg286732 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-02 01:45 |
INADA Naoki: "LOAD_METHOD support was based on tp_fastcall. (...) Other ideas to support LOAD_METHOD for C function are: * Add _PyMethodDescr_FastCallKeywords and call it in call_function."
It seems like a very small subset of tp_fastcall. If it provides a significant speedup, I really like the idea :-) I would be happy to see a "subset of tp_fastcall" merged. At least my whole tp_fastcall experiment wouldn't be worthless :-)
What would be the main change which would provide a speedup? Avoid the creation of a temporary tuple to pass positional arguments?
|
msg286733 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-02 01:49 |
> * Create unbound PyCFunction and cache it
Does it mean create a new private type, similar to PyCFunction_Type (PyCFunctionObject) but without the self attribute?
IMHO creating a new type is complex. I'm not sure that it's worth it, especially if the _PyMethodDescr_FastCallKeywords() idea is simpler to implement and "fast enough".
|
msg286739 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-02-02 05:10 |
I'm running pyperformance.
|
msg286740 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-02-02 05:17 |
Here is the result. I'll investigate mako result.
+ ../python.default -m perf compare_to default.json patched.json -G --min-speed=1
Slower (20):
- mako: 40.9 ms +- 0.4 ms -> 44.7 ms +- 0.6 ms: 1.09x slower (+9%)
- chaos: 298 ms +- 3 ms -> 308 ms +- 2 ms: 1.03x slower (+3%)
- telco: 19.7 ms +- 0.6 ms -> 20.2 ms +- 0.5 ms: 1.03x slower (+3%)
- scimark_sor: 478 ms +- 7 ms -> 491 ms +- 11 ms: 1.03x slower (+3%)
- xml_etree_process: 238 ms +- 3 ms -> 245 ms +- 4 ms: 1.03x slower (+3%)
- logging_simple: 31.1 us +- 0.3 us -> 31.9 us +- 0.3 us: 1.03x slower (+3%)
- logging_format: 36.4 us +- 0.3 us -> 37.2 us +- 0.4 us: 1.02x slower (+2%)
- nqueens: 264 ms +- 3 ms -> 270 ms +- 3 ms: 1.02x slower (+2%)
- fannkuch: 1.07 sec +- 0.02 sec -> 1.09 sec +- 0.03 sec: 1.02x slower (+2%)
- django_template: 401 ms +- 5 ms -> 409 ms +- 3 ms: 1.02x slower (+2%)
- call_method_slots: 14.6 ms +- 0.2 ms -> 14.9 ms +- 0.1 ms: 1.02x slower (+2%)
- meteor_contest: 199 ms +- 2 ms -> 203 ms +- 2 ms: 1.02x slower (+2%)
- logging_silent: 735 ns +- 9 ns -> 746 ns +- 11 ns: 1.02x slower (+2%)
- sqlalchemy_declarative: 314 ms +- 4 ms -> 318 ms +- 4 ms: 1.01x slower (+1%)
- genshi_text: 89.1 ms +- 1.3 ms -> 90.2 ms +- 1.3 ms: 1.01x slower (+1%)
- unpack_sequence: 131 ns +- 2 ns -> 133 ns +- 2 ns: 1.01x slower (+1%)
- call_method: 15.1 ms +- 0.2 ms -> 15.3 ms +- 0.1 ms: 1.01x slower (+1%)
- scimark_monte_carlo: 263 ms +- 6 ms -> 266 ms +- 6 ms: 1.01x slower (+1%)
- unpickle_pure_python: 834 us +- 16 us -> 843 us +- 15 us: 1.01x slower (+1%)
- pathlib: 51.2 ms +- 0.9 ms -> 51.7 ms +- 0.7 ms: 1.01x slower (+1%)
Faster (11):
- scimark_lu: 457 ms +- 49 ms -> 407 ms +- 7 ms: 1.12x faster (-11%)
- chameleon: 30.2 ms +- 0.6 ms -> 28.8 ms +- 0.4 ms: 1.05x faster (-5%)
- xml_etree_parse: 314 ms +- 12 ms -> 303 ms +- 8 ms: 1.04x faster (-4%)
- sqlite_synth: 10.3 us +- 0.2 us -> 9.94 us +- 0.25 us: 1.03x faster (-3%)
- pickle_pure_python: 1.28 ms +- 0.02 ms -> 1.25 ms +- 0.02 ms: 1.03x faster (-3%)
- xml_etree_iterparse: 223 ms +- 5 ms -> 218 ms +- 7 ms: 1.02x faster (-2%)
- scimark_fft: 730 ms +- 19 ms -> 716 ms +- 18 ms: 1.02x faster (-2%)
- nbody: 243 ms +- 5 ms -> 239 ms +- 6 ms: 1.02x faster (-2%)
- tornado_http: 377 ms +- 4 ms -> 373 ms +- 4 ms: 1.01x faster (-1%)
- sympy_expand: 1.02 sec +- 0.01 sec -> 1.01 sec +- 0.01 sec: 1.01x faster (-1%)
- unpickle: 32.1 us +- 0.3 us -> 31.8 us +- 0.2 us: 1.01x faster (-1%)
Benchmark hidden because not significant (33):
|
msg286741 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-02-02 05:20 |
microbench:
$ ./python.patched -m perf timeit --compare-to `pwd`/python.default -s "d = b''" -- "d.decode('ascii')"
python.default: ..................... 109 ns +- 1 ns
python.patched: ..................... 102 ns +- 1 ns
Median +- std dev: [python.default] 109 ns +- 1 ns -> [python.patched] 102 ns +- 1 ns: 1.08x faster (-7%)
$ ./python.patched -m perf timeit --compare-to `pwd`/python.default -s "l = [42]" -- "l.count(42)"
python.default: ..................... 66.0 ns +- 0.9 ns
python.patched: ..................... 60.0 ns +- 0.3 ns
Median +- std dev: [python.default] 66.0 ns +- 0.9 ns -> [python.patched] 60.0 ns +- 0.3 ns: 1.10x faster (-9%)
|
msg286757 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-02 10:04 |
I tried to benchmark b''.decode(encoding='ascii'): CALL_METHOD is not used for this call, but CALL_FUNCTION_KW :-/ So the call is not affected by your patch.
I also ran a quick benchmark on loadmethod-methoddescr.patch:
b''.decode(): 71.1 ns +- 0.5 ns -> 65.4 ns +- 0.2 ns: 1.09x faster (-8%)
b''.decode('utf8'): 92.8 ns +- 0.4 ns -> 85.5 ns +- 0.3 ns: 1.09x faster (-8%)
I confirm the speedup.
|
msg286762 - (view) |
Author: Inada Naoki (methane) * |
Date: 2017-02-02 11:10 |
I confirmed bm_mako performance degrade is caused by L1 cache miss.
(I didn't use --with-optimization)
https://gist.github.com/methane/b33edbf1f123ae026e704b0e005c3606
|
msg286764 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-02 11:20 |
Naoki> I confirmed bm_mako performance degrade is caused by L1 cache miss.
I know this performance instability very well, the issue is called "code placement":
https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
I tried to fight it with GCC __attribute__((hot)) in the issue #28618, but it doesn't fix the issue (at least, not completely):
http://bugs.python.org/issue28618#msg281459
In my experience, the best fix is PGO. Slowly, I consider that it's worthless to try to "fight" against code placement, and that benchmark results are only reliable when PGO compilation was used. Otherwise, you should ignore small performance differences. Problem: What is the threshold? 5%? 10%? I already noticed a difference up to 70% only caused by code placement!
https://haypo.github.io/analysis-python-performance-issue.html
--
I ran benchmarks on loadmethod-methoddescr.patch on haypo@speed-python with LTO+PGO. If you only show performance difference of at least 5%, only 3 benchmarks are significant and are faster:
---
haypo@speed-python$ python3 -m perf compare_to ~/benchmarks/*762a93935afd*json loadmethod-methoddesc_ref_762a93935afd.json -G --min-speed=5
Faster (3):
- regex_v8: 50.3 ms +- 0.4 ms -> 43.2 ms +- 0.3 ms: 1.17x faster (-14%)
- scimark_monte_carlo: 230 ms +- 6 ms -> 208 ms +- 4 ms: 1.11x faster (-10%)
- scimark_lu: 390 ms +- 17 ms -> 370 ms +- 13 ms: 1.05x faster (-5%)
Benchmark hidden because not significant (61): (...)
---
In my experience, regex_v8 and scimark_* benchmarks are not really reliable. I'm not surprised to not see a major speedup on performance, since the speedup on microbenchmarks is only 10% faster.
IHMO 10% faster on method calls is significant enough, since it's a core, very common, and widely used Python feature.
--
To be more explicitl: loadmethod-methoddescr.patch LGTM except of minor comments on the review.
|
msg286809 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-02-02 16:59 |
loadmethod-methoddescr-2.patch LGTM.
|
msg286832 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2017-02-02 22:43 |
New changeset 135a9a0c09f9 by INADA Naoki in branch 'default':
Issue #29263: LOAD_METHOD support for C methods
https://hg.python.org/cpython/rev/135a9a0c09f9
|
msg286833 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2017-02-02 23:00 |
New changeset 48bde5d2cc5b9e98f55cd23ee57f3996d81caeb5 by INADA Naoki in branch 'master':
Issue #29263: LOAD_METHOD support for C methods
https://github.com/python/cpython/commit/48bde5d2cc5b9e98f55cd23ee57f3996d81caeb5
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:42 | admin | set | github: 73449 |
2017-02-02 23:13:20 | methane | set | status: open -> closed dependencies:
- Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects resolution: fixed stage: resolved |
2017-02-02 23:00:21 | python-dev | set | messages:
+ msg286833 |
2017-02-02 22:43:14 | python-dev | set | messages:
+ msg286832 |
2017-02-02 16:59:31 | vstinner | set | messages:
+ msg286809 |
2017-02-02 16:48:15 | methane | set | files:
+ loadmethod-methoddescr-2.patch |
2017-02-02 11:20:59 | vstinner | set | messages:
+ msg286764 |
2017-02-02 11:10:28 | methane | set | messages:
+ msg286762 |
2017-02-02 10:12:12 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka
|
2017-02-02 10:04:40 | vstinner | set | messages:
+ msg286757 |
2017-02-02 05:20:42 | methane | set | messages:
+ msg286741 |
2017-02-02 05:17:42 | methane | set | messages:
+ msg286740 |
2017-02-02 05:10:56 | methane | set | files:
+ loadmethod-methoddescr.patch
messages:
+ msg286739 |
2017-02-02 01:49:33 | vstinner | set | messages:
+ msg286733 |
2017-02-02 01:45:44 | vstinner | set | messages:
+ msg286732 |
2017-02-02 01:27:59 | yselivanov | set | messages:
+ msg286731 |
2017-02-02 01:17:26 | methane | set | messages:
+ msg286728 |
2017-02-01 16:49:10 | vstinner | set | messages:
+ msg286651 |
2017-01-18 09:39:51 | python-dev | set | nosy:
+ python-dev messages:
+ msg285711
|
2017-01-14 06:36:15 | methane | set | messages:
+ msg285459 |
2017-01-14 05:36:13 | methane | set | files:
+ loadmethod-fastcall.patch keywords:
+ patch messages:
+ msg285457
|
2017-01-13 16:10:46 | vstinner | set | dependencies:
+ Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects |
2017-01-13 16:09:52 | vstinner | create | |