Issue29358
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2017-01-24 08:25 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 72 | vstinner, 2017-01-24 08:25 |
Messages (10) | |||
---|---|---|---|
msg286151 - (view) | Author: STINNER Victor (vstinner) * | Date: 2017-01-24 08:25 | |
After #29259 "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects", the two last slots which still use the (args: tuple, kwargs: dict) calling convention are tp_new and tp_init, two major slots to instanciate objects. I implemented tp_fastnew/tp_fastinit on top of the issue #29259 pull request (tp_fastcall). The implementation is a WIP, it's just complete enough to start benchmarking the code. Example of benchmarks on the two types currently optimized in my WIP fast_init branch, list and _asyncio.Future: --- haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List()' --duplicate=100 --compare-to=../default-ref/python Median +- std dev: [ref] 81.9 ns +- 0.2 ns -> [fast_init] 69.3 ns +- 0.4 ns: 1.18x faster (-15%) haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List((1,2,3))' --duplicate=100 --compare-to=../default-ref/python Median +- std dev: [ref] 137 ns +- 6 ns -> [fast_init] 107 ns +- 0 ns: 1.28x faster (-22%) haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python Median +- std dev: [ref] 411 ns +- 20 ns -> [fast_init] 355 ns +- 18 ns: 1.16x faster (-14%) --- The speedup of tp_fastnew + tp_fastinit is between 1.16x faster and 1.28x faster. The question is now if it is worth it. Warning: The code is not fully optimized and is likely to have subtle bugs. The pull request is not ready for a review, but you may take a look if you would like to have an idea of the required changes. The most tricky changes are made in typeobject.c to support backward compatibility (call tp_new if tp_fastnew is not set) and stable API (support Python 3.6 PyTypeObject without the two slots). Note: tp_fastnew and tp_fastinit slots are the expected end of my large FASTCALL optimization project. |
|||
msg286161 - (view) | Author: STINNER Victor (vstinner) * | Date: 2017-01-24 09:24 | |
Serhiy Storchaka: "Calling _PyStack_AsDict() with non-string or non-unique keywords means that the var-keyword argument was first unpacked to arrays of keyword names and values and then converted back to a dict. Seems something is done non-efficiently." (msg286159 of the issue #29360) Python code from test_dict: dict(**invalid) Bytecode: LOAD_GLOBAL 1 (dict) BUILD_TUPLE 0 LOAD_FAST 1 (invalid) CALL_FUNCTION_EX 1 Call stack: * _PyEval_EvalFrameDefault() * do_call_core() * PyObject_Call() * _Py_RawFastCallDict() -- conversion from dict to stack+kwnames * type_call() * call_init() * _PyStack_AsDict() -- convertsion from stack+kwnames to dict Oh right, there are two conversions using a temporary FASTCALL format for (keyword) arguments. This code was not upstream yet, it comes from the pull request of this issue. Maybe we need two flavors of type_call(): type_call(args: tuple, kwargs: tuple) if tp_fastinit isn't set, type_fastcall(stack, nargs, kwnames) (FASTCALL) if tp_fastinit is set. But it seems that the logic should be implemented in PyObject_Call() and _PyObject_FastCallDict(), it cannot be implemented in type_call(), it's too late. For best performances (avoid any kind of conversion and avoid any temporary object), we should implement something like that. |
|||
msg286203 - (view) | Author: Yury Selivanov (yselivanov) * | Date: 2017-01-24 18:50 | |
Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement. |
|||
msg286207 - (view) | Author: Josh Rosenberg (josh.r) * | Date: 2017-01-24 19:30 | |
So just to be clear, the issue with non-string or non-unique keywords is purely about performance, right, not correctness/non-crashiness? Non-string keywords, while technically accepted by CPython, are at best barely legal by the language standard. The language standard for calls specifies that even using the **expression syntax "expression must evaluate to a mapping, the contents of which are treated as additional keyword arguments" and keywords_arguments only allows identifier=expression, where identifier is a legal variable name (and therefore, a str). If misuse of ** unpacking is slower, but still works, and correct usage is significantly faster, I wouldn't consider an inability to fix the performance for the misuse case a blocking issue; nice to have, but making common, good code faster is worth making rare, bad code slower. |
|||
msg286216 - (view) | Author: STINNER Victor (vstinner) * | Date: 2017-01-24 21:03 | |
First benchmark result on the fast_init compared to the default branch. Warning: the patch is still a WIP, there are maybe performance issues (in this case, this benchmark should help to identify them). --- haypo@speed-python$ python3 -m perf compare_to 2017-01-23_14-02-default-cebc9c7ad195.json fast_init-0_ref_cebc9c7ad195.json -G --min-speed=2 Slower (6): - telco: 13.0 ms +- 0.2 ms -> 13.8 ms +- 0.3 ms: 1.06x slower (+6%) - scimark_sor: 391 ms +- 7 ms -> 413 ms +- 11 ms: 1.06x slower (+6%) - fannkuch: 938 ms +- 7 ms -> 973 ms +- 13 ms: 1.04x slower (+4%) - pickle_dict: 53.7 us +- 3.4 us -> 55.5 us +- 5.0 us: 1.03x slower (+3%) - sqlite_synth: 8.00 us +- 0.12 us -> 8.25 us +- 0.14 us: 1.03x slower (+3%) - scimark_monte_carlo: 202 ms +- 5 ms -> 207 ms +- 5 ms: 1.03x slower (+3%) Faster (19): - scimark_lu: 362 ms +- 13 ms -> 335 ms +- 17 ms: 1.08x faster (-8%) - chameleon: 23.5 ms +- 0.3 ms -> 21.7 ms +- 0.3 ms: 1.08x faster (-7%) - xml_etree_process: 179 ms +- 3 ms -> 168 ms +- 3 ms: 1.07x faster (-6%) - pickle_pure_python: 1.03 ms +- 0.02 ms -> 982 us +- 11 us: 1.05x faster (-5%) - regex_dna: 283 ms +- 1 ms -> 271 ms +- 1 ms: 1.05x faster (-4%) - chaos: 239 ms +- 2 ms -> 230 ms +- 3 ms: 1.04x faster (-4%) - django_template: 356 ms +- 5 ms -> 343 ms +- 5 ms: 1.04x faster (-4%) - call_simple: 10.7 ms +- 0.4 ms -> 10.3 ms +- 0.3 ms: 1.04x faster (-3%) - hexiom: 18.0 ms +- 0.1 ms -> 17.4 ms +- 0.1 ms: 1.04x faster (-3%) - pathlib: 42.3 ms +- 0.5 ms -> 40.9 ms +- 0.4 ms: 1.03x faster (-3%) - regex_compile: 379 ms +- 2 ms -> 367 ms +- 4 ms: 1.03x faster (-3%) - go: 501 ms +- 7 ms -> 485 ms +- 4 ms: 1.03x faster (-3%) - xml_etree_generate: 208 ms +- 4 ms -> 202 ms +- 3 ms: 1.03x faster (-3%) - deltablue: 14.5 ms +- 0.3 ms -> 14.1 ms +- 0.2 ms: 1.03x faster (-3%) - unpickle_pure_python: 652 us +- 8 us -> 634 us +- 9 us: 1.03x faster (-3%) - richards: 147 ms +- 2 ms -> 143 ms +- 4 ms: 1.03x faster (-3%) - nqueens: 213 ms +- 2 ms -> 208 ms +- 2 ms: 1.03x faster (-2%) - raytrace: 1.08 sec +- 0.01 sec -> 1.05 sec +- 0.02 sec: 1.02x faster (-2%) - unpickle: 31.0 us +- 0.4 us -> 30.3 us +- 1.0 us: 1.02x faster (-2%) Benchmark hidden because not significant (34): (...) Ignored benchmarks (5) of 2017-01-23_14-02-default-cebc9c7ad195.json: mako, sympy_expand, sympy_integrate, sympy_str, sympy_sum --- Reminder: The fast_init branch is based on the issue #29259 (tp_fastcall), so the result combines tp_fastcall and tp_fastnew/fastinit. |
|||
msg286271 - (view) | Author: STINNER Victor (vstinner) * | Date: 2017-01-25 17:42 | |
I implemented more optimizations: * type_call() uses FASTCALL * object_new() and object_init() use FASTCALL * _asyncio.Future new & init use FASTCALL As a result, create an _asyncio.Future is now 1.45x faster: haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python Median +- std dev: [ref] 411 ns +- 8 ns -> [fast_init] 283 ns +- 10 ns: 1.45x faster (-31%) Yury> Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement. While bm_richards.py doesn't call _PyStack_AsTuple() anyone in the benchmark (it seems like it full uses FASTCALL), it seems like there is simply zero impact on performance :-( $ ./python -m perf compare_to richards_ref.json richards_fastinit.json Benchmark hidden because not significant (1): richards |
|||
msg286290 - (view) | Author: Roundup Robot (python-dev) | Date: 2017-01-26 02:18 | |
New changeset 8feb6a5ce4c6 by Victor Stinner in branch 'default': Issue #29358: Add postcondition checks on types https://hg.python.org/cpython/rev/8feb6a5ce4c6 |
|||
msg286315 - (view) | Author: STINNER Victor (vstinner) * | Date: 2017-01-26 14:14 | |
While they are obvious speedup on microbenchmarks, it doesn't speedup "macro" benchmarks from performance as much as I expected. The changes required in typeobject.c to support "tp_new or tp_fastnew" and "tp_init or tp_fastinit" are large and very complex. I'm not sure that my code is correct. update_one_slot(), add_operators() and PyType_Ready() functions are complex beast, sadly almost not documented, except such comment... /* The __new__ wrapper is not a wrapper descriptor, so must be special-cased differently. If we don't do this, creating an instance will always use slot_tp_new which will look up __new__ in the MRO which will call tp_new_wrapper which will look through the base classes looking for a static base and call its tp_new (usually PyType_GenericNew), after performing various sanity checks and constructing a new argument list. Cut all that nonsense short -- this speeds up instance creation tremendously. */ specific = (void *)type->tp_new; /* XXX I'm not 100% sure that there isn't a hole in this reasoning that requires additional sanity checks. I'll buy the first person to point out a bug in this reasoning a beer. */ How am I supposed to be confident in front of such coment :-D I expect that calling a functions (tp_call) is a more common operation than instanciating a new object (tp_new + tp_init). So I don't think that the overall change is worth it. For all these reaons, I close the issue as REJECTED. |
|||
msg286316 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2017-01-26 14:52 | |
As for update_one_slot() see also issue5322 and issue25731. |
|||
msg286317 - (view) | Author: STINNER Victor (vstinner) * | Date: 2017-01-26 14:58 | |
Serhiy Storchaka: > As for update_one_slot() see also issue5322 and issue25731. Oh, thanks for the pointers! Now I understand much better these bugs. I'm quite sure that they are still flaws in this code when a type is modified after PyType_Ready(), like sysmodule.c:: /* prevent user from creating new instances */ FlagsType.tp_init = NULL; FlagsType.tp_new = NULL; But each time I had to dig into typeobject.c, my head is going to explode :-D I may try to understand one more time, but not today ;-) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:42 | admin | set | github: 73544 |
2017-01-26 14:58:30 | vstinner | set | messages: + msg286317 |
2017-01-26 14:52:42 | serhiy.storchaka | set | messages: + msg286316 |
2017-01-26 14:14:22 | vstinner | set | status: open -> closed resolution: rejected messages: + msg286315 |
2017-01-26 02:18:06 | python-dev | set | nosy:
+ python-dev messages: + msg286290 |
2017-01-25 17:42:16 | vstinner | set | messages: + msg286271 |
2017-01-24 21:03:35 | vstinner | set | messages: + msg286216 |
2017-01-24 19:30:34 | josh.r | set | nosy:
+ josh.r messages: + msg286207 |
2017-01-24 18:50:12 | yselivanov | set | messages: + msg286203 |
2017-01-24 09:24:34 | vstinner | set | messages: + msg286161 |
2017-01-24 08:25:29 | vstinner | create |