classification
Title: Add tp_fastnew and tp_fastinit to PyTypeObject, 15-20% faster object instanciation
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: haypo, inada.naoki, josh.r, python-dev, rhettinger, serhiy.storchaka, yselivanov
Priority: normal Keywords:

Created on 2017-01-24 08:25 by haypo, last changed 2017-01-26 14:58 by haypo. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 72 haypo, 2017-01-24 08:25
Messages (10)
msg286151 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-24 08:25
After #29259 "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects", the two last slots which still use the (args: tuple, kwargs: dict) calling convention are tp_new and tp_init, two major slots to instanciate objects.

I implemented tp_fastnew/tp_fastinit on top of the issue #29259 pull request (tp_fastcall). The implementation is a WIP, it's just complete enough to start benchmarking the code.

Example of benchmarks on the two types currently optimized in my WIP fast_init branch, list and _asyncio.Future:
---
haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List()' --duplicate=100 --compare-to=../default-ref/python 
Median +- std dev: [ref] 81.9 ns +- 0.2 ns -> [fast_init] 69.3 ns +- 0.4 ns: 1.18x faster (-15%)

haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List((1,2,3))' --duplicate=100 --compare-to=../default-ref/python 
Median +- std dev: [ref] 137 ns +- 6 ns -> [fast_init] 107 ns +- 0 ns: 1.28x faster (-22%)

haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python
Median +- std dev: [ref] 411 ns +- 20 ns -> [fast_init] 355 ns +- 18 ns: 1.16x faster (-14%)
---

The speedup of tp_fastnew + tp_fastinit is between 1.16x faster and 1.28x faster. The question is now if it is worth it.

Warning: The code is not fully optimized and is likely to have subtle bugs. The pull request is not ready for a review, but you may take a look if you would like to have an idea of the required changes. The most tricky changes are made in typeobject.c to support backward compatibility (call tp_new if tp_fastnew is not set) and stable API (support Python 3.6 PyTypeObject without the two slots).

Note: tp_fastnew and tp_fastinit slots are the expected end of my large FASTCALL optimization project.
msg286161 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-24 09:24
Serhiy Storchaka: "Calling _PyStack_AsDict() with non-string or non-unique keywords means that the var-keyword argument was first unpacked to arrays of keyword names and values and then converted back to a dict. Seems something is done non-efficiently." (msg286159 of the issue #29360)

Python code from test_dict:

   dict(**invalid)

Bytecode:

    LOAD_GLOBAL              1 (dict)
    BUILD_TUPLE              0
    LOAD_FAST                1 (invalid)
    CALL_FUNCTION_EX         1

Call stack:

* _PyEval_EvalFrameDefault()
* do_call_core()
* PyObject_Call()
* _Py_RawFastCallDict() -- conversion from dict to stack+kwnames
* type_call()
* call_init()
* _PyStack_AsDict() -- convertsion from stack+kwnames to dict

Oh right, there are two conversions using a temporary FASTCALL format for (keyword) arguments.

This code was not upstream yet, it comes from the pull request of this issue.

Maybe we need two flavors of type_call(): type_call(args: tuple, kwargs: tuple) if tp_fastinit isn't set, type_fastcall(stack, nargs, kwnames) (FASTCALL) if tp_fastinit is set.

But it seems that the logic should be implemented in PyObject_Call() and _PyObject_FastCallDict(), it cannot be implemented in type_call(), it's too late.

For best performances (avoid any kind of conversion and avoid any temporary object), we should implement something like that.
msg286203 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-01-24 18:50
Great. Can you share the Richards benchmark results?  It specifically stresses class instantiation, so we should be able to see the improvement.
msg286207 - (view) Author: Josh Rosenberg (josh.r) * Date: 2017-01-24 19:30
So just to be clear, the issue with non-string or non-unique keywords is purely about performance, right, not correctness/non-crashiness? Non-string keywords, while technically accepted by CPython, are at best barely legal by the language standard. The language standard for calls specifies that even using the **expression syntax "expression must evaluate to a mapping, the contents of which are treated as additional keyword arguments" and keywords_arguments only allows identifier=expression, where identifier is a legal variable name (and therefore, a str).

If misuse of ** unpacking is slower, but still works, and correct usage is significantly faster, I wouldn't consider an inability to fix the performance for the misuse case a blocking issue; nice to have, but making common, good code faster is worth making rare, bad code slower.
msg286216 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-24 21:03
First benchmark result on the fast_init compared to the default branch. Warning: the patch is still a WIP, there are maybe performance issues (in this case, this benchmark should help to identify them).
---
haypo@speed-python$ python3 -m perf compare_to 2017-01-23_14-02-default-cebc9c7ad195.json fast_init-0_ref_cebc9c7ad195.json -G --min-speed=2
Slower (6):
- telco: 13.0 ms +- 0.2 ms -> 13.8 ms +- 0.3 ms: 1.06x slower (+6%)
- scimark_sor: 391 ms +- 7 ms -> 413 ms +- 11 ms: 1.06x slower (+6%)
- fannkuch: 938 ms +- 7 ms -> 973 ms +- 13 ms: 1.04x slower (+4%)
- pickle_dict: 53.7 us +- 3.4 us -> 55.5 us +- 5.0 us: 1.03x slower (+3%)
- sqlite_synth: 8.00 us +- 0.12 us -> 8.25 us +- 0.14 us: 1.03x slower (+3%)
- scimark_monte_carlo: 202 ms +- 5 ms -> 207 ms +- 5 ms: 1.03x slower (+3%)

Faster (19):
- scimark_lu: 362 ms +- 13 ms -> 335 ms +- 17 ms: 1.08x faster (-8%)
- chameleon: 23.5 ms +- 0.3 ms -> 21.7 ms +- 0.3 ms: 1.08x faster (-7%)
- xml_etree_process: 179 ms +- 3 ms -> 168 ms +- 3 ms: 1.07x faster (-6%)
- pickle_pure_python: 1.03 ms +- 0.02 ms -> 982 us +- 11 us: 1.05x faster (-5%)
- regex_dna: 283 ms +- 1 ms -> 271 ms +- 1 ms: 1.05x faster (-4%)
- chaos: 239 ms +- 2 ms -> 230 ms +- 3 ms: 1.04x faster (-4%)
- django_template: 356 ms +- 5 ms -> 343 ms +- 5 ms: 1.04x faster (-4%)
- call_simple: 10.7 ms +- 0.4 ms -> 10.3 ms +- 0.3 ms: 1.04x faster (-3%)
- hexiom: 18.0 ms +- 0.1 ms -> 17.4 ms +- 0.1 ms: 1.04x faster (-3%)
- pathlib: 42.3 ms +- 0.5 ms -> 40.9 ms +- 0.4 ms: 1.03x faster (-3%)
- regex_compile: 379 ms +- 2 ms -> 367 ms +- 4 ms: 1.03x faster (-3%)
- go: 501 ms +- 7 ms -> 485 ms +- 4 ms: 1.03x faster (-3%)
- xml_etree_generate: 208 ms +- 4 ms -> 202 ms +- 3 ms: 1.03x faster (-3%)
- deltablue: 14.5 ms +- 0.3 ms -> 14.1 ms +- 0.2 ms: 1.03x faster (-3%)
- unpickle_pure_python: 652 us +- 8 us -> 634 us +- 9 us: 1.03x faster (-3%)
- richards: 147 ms +- 2 ms -> 143 ms +- 4 ms: 1.03x faster (-3%)
- nqueens: 213 ms +- 2 ms -> 208 ms +- 2 ms: 1.03x faster (-2%)
- raytrace: 1.08 sec +- 0.01 sec -> 1.05 sec +- 0.02 sec: 1.02x faster (-2%)
- unpickle: 31.0 us +- 0.4 us -> 30.3 us +- 1.0 us: 1.02x faster (-2%)

Benchmark hidden because not significant (34): (...)
Ignored benchmarks (5) of 2017-01-23_14-02-default-cebc9c7ad195.json: mako, sympy_expand, sympy_integrate, sympy_str, sympy_sum
---

Reminder: The fast_init branch is based on the issue #29259 (tp_fastcall), so the result combines tp_fastcall and tp_fastnew/fastinit.
msg286271 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-25 17:42
I implemented more optimizations:

* type_call() uses FASTCALL
* object_new() and object_init() use FASTCALL
* _asyncio.Future new & init use FASTCALL

As a result, create an _asyncio.Future is now 1.45x faster:

haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python 
Median +- std dev: [ref] 411 ns +- 8 ns -> [fast_init] 283 ns +- 10 ns: 1.45x faster (-31%)


Yury> Great. Can you share the Richards benchmark results?  It specifically stresses class instantiation, so we should be able to see the improvement.

While bm_richards.py doesn't call _PyStack_AsTuple() anyone in the benchmark (it seems like it full uses FASTCALL), it seems like there is simply zero impact on performance :-(

$ ./python -m perf compare_to richards_ref.json richards_fastinit.json 
Benchmark hidden because not significant (1): richards
msg286290 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-26 02:18
New changeset 8feb6a5ce4c6 by Victor Stinner in branch 'default':
Issue #29358: Add postcondition checks on types
https://hg.python.org/cpython/rev/8feb6a5ce4c6
msg286315 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-26 14:14
While they are obvious speedup on microbenchmarks, it doesn't speedup "macro" benchmarks from performance as much as I expected.

The changes required in typeobject.c to support "tp_new or tp_fastnew" and "tp_init or tp_fastinit" are large and very complex. I'm not sure that my code is correct. update_one_slot(), add_operators() and PyType_Ready() functions are complex beast, sadly almost not documented, except such comment...

            /* The __new__ wrapper is not a wrapper descriptor,
               so must be special-cased differently.
               If we don't do this, creating an instance will
               always use slot_tp_new which will look up
               __new__ in the MRO which will call tp_new_wrapper
               which will look through the base classes looking
               for a static base and call its tp_new (usually
               PyType_GenericNew), after performing various
               sanity checks and constructing a new argument
               list.  Cut all that nonsense short -- this speeds
               up instance creation tremendously. */
            specific = (void *)type->tp_new;
            /* XXX I'm not 100% sure that there isn't a hole
               in this reasoning that requires additional
               sanity checks.  I'll buy the first person to
               point out a bug in this reasoning a beer. */

How am I supposed to be confident in front of such coment :-D

I expect that calling a functions (tp_call) is a more common operation than instanciating a new object (tp_new + tp_init). So I don't think that the overall change is worth it.

For all these reaons, I close the issue as REJECTED.
msg286316 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-26 14:52
As for update_one_slot() see also issue5322 and issue25731.
msg286317 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-26 14:58
Serhiy Storchaka:
> As for update_one_slot() see also issue5322 and issue25731.

Oh, thanks for the pointers! Now I understand much better these bugs.

I'm quite sure that they are still flaws in this code when a type is modified after PyType_Ready(), like sysmodule.c::

    /* prevent user from creating new instances */
    FlagsType.tp_init = NULL;
    FlagsType.tp_new = NULL;


But each time I had to dig into typeobject.c, my head is going to explode :-D I may try to understand one more time, but not today ;-)
History
Date User Action Args
2017-01-26 14:58:30hayposetmessages: + msg286317
2017-01-26 14:52:42serhiy.storchakasetmessages: + msg286316
2017-01-26 14:14:22hayposetstatus: open -> closed
resolution: rejected
messages: + msg286315
2017-01-26 02:18:06python-devsetnosy: + python-dev
messages: + msg286290
2017-01-25 17:42:16hayposetmessages: + msg286271
2017-01-24 21:03:35hayposetmessages: + msg286216
2017-01-24 19:30:34josh.rsetnosy: + josh.r
messages: + msg286207
2017-01-24 18:50:12yselivanovsetmessages: + msg286203
2017-01-24 09:24:34hayposetmessages: + msg286161
2017-01-24 08:25:29haypocreate