Issue 29358: Add tp_fastnew and tp_fastinit to PyTypeObject, 15-20% faster object instanciation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73544

classification

Title:	Add tp_fastnew and tp_fastinit to PyTypeObject, 15-20% faster object instanciation
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	josh.r, methane, python-dev, rhettinger, serhiy.storchaka, vstinner, yselivanov
Priority:	normal	Keywords:

Created on 2017-01-24 08:25 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 72		vstinner, 2017-01-24 08:25

Messages (10)
msg286151 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-24 08:25
After #29259 "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects", the two last slots which still use the (args: tuple, kwargs: dict) calling convention are tp_new and tp_init, two major slots to instanciate objects. I implemented tp_fastnew/tp_fastinit on top of the issue #29259 pull request (tp_fastcall). The implementation is a WIP, it's just complete enough to start benchmarking the code. Example of benchmarks on the two types currently optimized in my WIP fast_init branch, list and _asyncio.Future: --- haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List()' --duplicate=100 --compare-to=../default-ref/python Median +- std dev: [ref] 81.9 ns +- 0.2 ns -> [fast_init] 69.3 ns +- 0.4 ns: 1.18x faster (-15%) haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List((1,2,3))' --duplicate=100 --compare-to=../default-ref/python Median +- std dev: [ref] 137 ns +- 6 ns -> [fast_init] 107 ns +- 0 ns: 1.28x faster (-22%) haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python Median +- std dev: [ref] 411 ns +- 20 ns -> [fast_init] 355 ns +- 18 ns: 1.16x faster (-14%) --- The speedup of tp_fastnew + tp_fastinit is between 1.16x faster and 1.28x faster. The question is now if it is worth it. Warning: The code is not fully optimized and is likely to have subtle bugs. The pull request is not ready for a review, but you may take a look if you would like to have an idea of the required changes. The most tricky changes are made in typeobject.c to support backward compatibility (call tp_new if tp_fastnew is not set) and stable API (support Python 3.6 PyTypeObject without the two slots). Note: tp_fastnew and tp_fastinit slots are the expected end of my large FASTCALL optimization project.
msg286161 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-24 09:24
Serhiy Storchaka: "Calling _PyStack_AsDict() with non-string or non-unique keywords means that the var-keyword argument was first unpacked to arrays of keyword names and values and then converted back to a dict. Seems something is done non-efficiently." (msg286159 of the issue #29360) Python code from test_dict: dict(*invalid) Bytecode: LOAD_GLOBAL 1 (dict) BUILD_TUPLE 0 LOAD_FAST 1 (invalid) CALL_FUNCTION_EX 1 Call stack: _PyEval_EvalFrameDefault() * do_call_core() * PyObject_Call() * _Py_RawFastCallDict() -- conversion from dict to stack+kwnames * type_call() * call_init() * _PyStack_AsDict() -- convertsion from stack+kwnames to dict Oh right, there are two conversions using a temporary FASTCALL format for (keyword) arguments. This code was not upstream yet, it comes from the pull request of this issue. Maybe we need two flavors of type_call(): type_call(args: tuple, kwargs: tuple) if tp_fastinit isn't set, type_fastcall(stack, nargs, kwnames) (FASTCALL) if tp_fastinit is set. But it seems that the logic should be implemented in PyObject_Call() and _PyObject_FastCallDict(), it cannot be implemented in type_call(), it's too late. For best performances (avoid any kind of conversion and avoid any temporary object), we should implement something like that.
msg286203 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2017-01-24 18:50
Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement.
msg286207 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2017-01-24 19:30
So just to be clear, the issue with non-string or non-unique keywords is purely about performance, right, not correctness/non-crashiness? Non-string keywords, while technically accepted by CPython, are at best barely legal by the language standard. The language standard for calls specifies that even using the expression syntax "expression must evaluate to a mapping, the contents of which are treated as additional keyword arguments" and keywords_arguments only allows identifier=expression, where identifier is a legal variable name (and therefore, a str). If misuse of unpacking is slower, but still works, and correct usage is significantly faster, I wouldn't consider an inability to fix the performance for the misuse case a blocking issue; nice to have, but making common, good code faster is worth making rare, bad code slower.
msg286216 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-24 21:03
First benchmark result on the fast_init compared to the default branch. Warning: the patch is still a WIP, there are maybe performance issues (in this case, this benchmark should help to identify them). --- haypo@speed-python$ python3 -m perf compare_to 2017-01-23_14-02-default-cebc9c7ad195.json fast_init-0_ref_cebc9c7ad195.json -G --min-speed=2 Slower (6): - telco: 13.0 ms +- 0.2 ms -> 13.8 ms +- 0.3 ms: 1.06x slower (+6%) - scimark_sor: 391 ms +- 7 ms -> 413 ms +- 11 ms: 1.06x slower (+6%) - fannkuch: 938 ms +- 7 ms -> 973 ms +- 13 ms: 1.04x slower (+4%) - pickle_dict: 53.7 us +- 3.4 us -> 55.5 us +- 5.0 us: 1.03x slower (+3%) - sqlite_synth: 8.00 us +- 0.12 us -> 8.25 us +- 0.14 us: 1.03x slower (+3%) - scimark_monte_carlo: 202 ms +- 5 ms -> 207 ms +- 5 ms: 1.03x slower (+3%) Faster (19): - scimark_lu: 362 ms +- 13 ms -> 335 ms +- 17 ms: 1.08x faster (-8%) - chameleon: 23.5 ms +- 0.3 ms -> 21.7 ms +- 0.3 ms: 1.08x faster (-7%) - xml_etree_process: 179 ms +- 3 ms -> 168 ms +- 3 ms: 1.07x faster (-6%) - pickle_pure_python: 1.03 ms +- 0.02 ms -> 982 us +- 11 us: 1.05x faster (-5%) - regex_dna: 283 ms +- 1 ms -> 271 ms +- 1 ms: 1.05x faster (-4%) - chaos: 239 ms +- 2 ms -> 230 ms +- 3 ms: 1.04x faster (-4%) - django_template: 356 ms +- 5 ms -> 343 ms +- 5 ms: 1.04x faster (-4%) - call_simple: 10.7 ms +- 0.4 ms -> 10.3 ms +- 0.3 ms: 1.04x faster (-3%) - hexiom: 18.0 ms +- 0.1 ms -> 17.4 ms +- 0.1 ms: 1.04x faster (-3%) - pathlib: 42.3 ms +- 0.5 ms -> 40.9 ms +- 0.4 ms: 1.03x faster (-3%) - regex_compile: 379 ms +- 2 ms -> 367 ms +- 4 ms: 1.03x faster (-3%) - go: 501 ms +- 7 ms -> 485 ms +- 4 ms: 1.03x faster (-3%) - xml_etree_generate: 208 ms +- 4 ms -> 202 ms +- 3 ms: 1.03x faster (-3%) - deltablue: 14.5 ms +- 0.3 ms -> 14.1 ms +- 0.2 ms: 1.03x faster (-3%) - unpickle_pure_python: 652 us +- 8 us -> 634 us +- 9 us: 1.03x faster (-3%) - richards: 147 ms +- 2 ms -> 143 ms +- 4 ms: 1.03x faster (-3%) - nqueens: 213 ms +- 2 ms -> 208 ms +- 2 ms: 1.03x faster (-2%) - raytrace: 1.08 sec +- 0.01 sec -> 1.05 sec +- 0.02 sec: 1.02x faster (-2%) - unpickle: 31.0 us +- 0.4 us -> 30.3 us +- 1.0 us: 1.02x faster (-2%) Benchmark hidden because not significant (34): (...) Ignored benchmarks (5) of 2017-01-23_14-02-default-cebc9c7ad195.json: mako, sympy_expand, sympy_integrate, sympy_str, sympy_sum --- Reminder: The fast_init branch is based on the issue #29259 (tp_fastcall), so the result combines tp_fastcall and tp_fastnew/fastinit.
msg286271 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-25 17:42
I implemented more optimizations: * type_call() uses FASTCALL * object_new() and object_init() use FASTCALL * _asyncio.Future new & init use FASTCALL As a result, create an _asyncio.Future is now 1.45x faster: haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python Median +- std dev: [ref] 411 ns +- 8 ns -> [fast_init] 283 ns +- 10 ns: 1.45x faster (-31%) Yury> Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement. While bm_richards.py doesn't call _PyStack_AsTuple() anyone in the benchmark (it seems like it full uses FASTCALL), it seems like there is simply zero impact on performance :-( $ ./python -m perf compare_to richards_ref.json richards_fastinit.json Benchmark hidden because not significant (1): richards
msg286290 - (view)	Author: Roundup Robot (python-dev)	Date: 2017-01-26 02:18
New changeset 8feb6a5ce4c6 by Victor Stinner in branch 'default': Issue #29358: Add postcondition checks on types https://hg.python.org/cpython/rev/8feb6a5ce4c6
msg286315 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-26 14:14
While they are obvious speedup on microbenchmarks, it doesn't speedup "macro" benchmarks from performance as much as I expected. The changes required in typeobject.c to support "tp_new or tp_fastnew" and "tp_init or tp_fastinit" are large and very complex. I'm not sure that my code is correct. update_one_slot(), add_operators() and PyType_Ready() functions are complex beast, sadly almost not documented, except such comment... /* The __new__ wrapper is not a wrapper descriptor, so must be special-cased differently. If we don't do this, creating an instance will always use slot_tp_new which will look up __new__ in the MRO which will call tp_new_wrapper which will look through the base classes looking for a static base and call its tp_new (usually PyType_GenericNew), after performing various sanity checks and constructing a new argument list. Cut all that nonsense short -- this speeds up instance creation tremendously. / specific = (void )type->tp_new; /* XXX I'm not 100% sure that there isn't a hole in this reasoning that requires additional sanity checks. I'll buy the first person to point out a bug in this reasoning a beer. */ How am I supposed to be confident in front of such coment :-D I expect that calling a functions (tp_call) is a more common operation than instanciating a new object (tp_new + tp_init). So I don't think that the overall change is worth it. For all these reaons, I close the issue as REJECTED.
msg286316 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-01-26 14:52
As for update_one_slot() see also issue5322 and issue25731.
msg286317 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-26 14:58
Serhiy Storchaka: > As for update_one_slot() see also issue5322 and issue25731. Oh, thanks for the pointers! Now I understand much better these bugs. I'm quite sure that they are still flaws in this code when a type is modified after PyType_Ready(), like sysmodule.c:: /* prevent user from creating new instances */ FlagsType.tp_init = NULL; FlagsType.tp_new = NULL; But each time I had to dig into typeobject.c, my head is going to explode :-D I may try to understand one more time, but not today ;-)

History
Date	User	Action	Args
2022-04-11 14:58:42	admin	set	github: 73544
2017-01-26 14:58:30	vstinner	set	messages: + msg286317
2017-01-26 14:52:42	serhiy.storchaka	set	messages: + msg286316
2017-01-26 14:14:22	vstinner	set	status: open -> closed resolution: rejected messages: + msg286315
2017-01-26 02:18:06	python-dev	set	nosy: + python-dev messages: + msg286290
2017-01-25 17:42:16	vstinner	set	messages: + msg286271
2017-01-24 21:03:35	vstinner	set	messages: + msg286216
2017-01-24 19:30:34	josh.r	set	nosy: + josh.r messages: + msg286207
2017-01-24 18:50:12	yselivanov	set	messages: + msg286203
2017-01-24 09:24:34	vstinner	set	messages: + msg286161
2017-01-24 08:25:29	vstinner	create