Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tp_fastnew and tp_fastinit to PyTypeObject, 15-20% faster object instanciation #73544

Closed
vstinner opened this issue Jan 24, 2017 · 10 comments
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@vstinner
Copy link
Member

BPO 29358
Nosy @rhettinger, @vstinner, @methane, @serhiy-storchaka, @1st1, @MojoVampire
PRs
  • bpo-29863: Add json.COMPACT constant #72
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-01-26.14:14:22.680>
    created_at = <Date 2017-01-24.08:25:29.831>
    labels = ['interpreter-core', '3.7', 'performance']
    title = 'Add tp_fastnew and tp_fastinit to PyTypeObject, 15-20% faster object instanciation'
    updated_at = <Date 2017-01-26.14:58:30.895>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2017-01-26.14:58:30.895>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-01-26.14:14:22.680>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2017-01-24.08:25:29.831>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 29358
    keywords = []
    message_count = 10.0
    messages = ['286151', '286161', '286203', '286207', '286216', '286271', '286290', '286315', '286316', '286317']
    nosy_count = 7.0
    nosy_names = ['rhettinger', 'vstinner', 'methane', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'josh.r']
    pr_nums = ['72']
    priority = 'normal'
    resolution = 'rejected'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue29358'
    versions = ['Python 3.7']

    @vstinner
    Copy link
    Member Author

    After bpo-29259 "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects", the two last slots which still use the (args: tuple, kwargs: dict) calling convention are tp_new and tp_init, two major slots to instanciate objects.

    I implemented tp_fastnew/tp_fastinit on top of the issue bpo-29259 pull request (tp_fastcall). The implementation is a WIP, it's just complete enough to start benchmarking the code.

    Example of benchmarks on the two types currently optimized in my WIP fast_init branch, list and _asyncio.Future:
    ---
    haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List()' --duplicate=100 --compare-to=../default-ref/python
    Median +- std dev: [ref] 81.9 ns +- 0.2 ns -> [fast_init] 69.3 ns +- 0.4 ns: 1.18x faster (-15%)

    haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List((1,2,3))' --duplicate=100 --compare-to=../default-ref/python
    Median +- std dev: [ref] 137 ns +- 6 ns -> [fast_init] 107 ns +- 0 ns: 1.28x faster (-22%)

    haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python
    Median +- std dev: [ref] 411 ns +- 20 ns -> [fast_init] 355 ns +- 18 ns: 1.16x faster (-14%)
    ---

    The speedup of tp_fastnew + tp_fastinit is between 1.16x faster and 1.28x faster. The question is now if it is worth it.

    Warning: The code is not fully optimized and is likely to have subtle bugs. The pull request is not ready for a review, but you may take a look if you would like to have an idea of the required changes. The most tricky changes are made in typeobject.c to support backward compatibility (call tp_new if tp_fastnew is not set) and stable API (support Python 3.6 PyTypeObject without the two slots).

    Note: tp_fastnew and tp_fastinit slots are the expected end of my large FASTCALL optimization project.

    @vstinner vstinner added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 24, 2017
    @vstinner
    Copy link
    Member Author

    Serhiy Storchaka: "Calling _PyStack_AsDict() with non-string or non-unique keywords means that the var-keyword argument was first unpacked to arrays of keyword names and values and then converted back to a dict. Seems something is done non-efficiently." (msg286159 of the issue bpo-29360)

    Python code from test_dict:

       dict(**invalid)

    Bytecode:

    LOAD_GLOBAL              1 (dict)
    BUILD_TUPLE              0
    LOAD_FAST                1 (invalid)
    CALL_FUNCTION_EX         1
    

    Call stack:

    • _PyEval_EvalFrameDefault()
    • do_call_core()
    • PyObject_Call()
    • _Py_RawFastCallDict() -- conversion from dict to stack+kwnames
    • type_call()
    • call_init()
    • _PyStack_AsDict() -- convertsion from stack+kwnames to dict

    Oh right, there are two conversions using a temporary FASTCALL format for (keyword) arguments.

    This code was not upstream yet, it comes from the pull request of this issue.

    Maybe we need two flavors of type_call(): type_call(args: tuple, kwargs: tuple) if tp_fastinit isn't set, type_fastcall(stack, nargs, kwnames) (FASTCALL) if tp_fastinit is set.

    But it seems that the logic should be implemented in PyObject_Call() and _PyObject_FastCallDict(), it cannot be implemented in type_call(), it's too late.

    For best performances (avoid any kind of conversion and avoid any temporary object), we should implement something like that.

    @1st1
    Copy link
    Member

    1st1 commented Jan 24, 2017

    Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Jan 24, 2017

    So just to be clear, the issue with non-string or non-unique keywords is purely about performance, right, not correctness/non-crashiness? Non-string keywords, while technically accepted by CPython, are at best barely legal by the language standard. The language standard for calls specifies that even using the **expression syntax "expression must evaluate to a mapping, the contents of which are treated as additional keyword arguments" and keywords_arguments only allows identifier=expression, where identifier is a legal variable name (and therefore, a str).

    If misuse of ** unpacking is slower, but still works, and correct usage is significantly faster, I wouldn't consider an inability to fix the performance for the misuse case a blocking issue; nice to have, but making common, good code faster is worth making rare, bad code slower.

    @vstinner
    Copy link
    Member Author

    First benchmark result on the fast_init compared to the default branch. Warning: the patch is still a WIP, there are maybe performance issues (in this case, this benchmark should help to identify them).
    ---
    haypo@speed-python$ python3 -m perf compare_to 2017-01-23_14-02-default-cebc9c7ad195.json fast_init-0_ref_cebc9c7ad195.json -G --min-speed=2
    Slower (6):

    • telco: 13.0 ms +- 0.2 ms -> 13.8 ms +- 0.3 ms: 1.06x slower (+6%)
    • scimark_sor: 391 ms +- 7 ms -> 413 ms +- 11 ms: 1.06x slower (+6%)
    • fannkuch: 938 ms +- 7 ms -> 973 ms +- 13 ms: 1.04x slower (+4%)
    • pickle_dict: 53.7 us +- 3.4 us -> 55.5 us +- 5.0 us: 1.03x slower (+3%)
    • sqlite_synth: 8.00 us +- 0.12 us -> 8.25 us +- 0.14 us: 1.03x slower (+3%)
    • scimark_monte_carlo: 202 ms +- 5 ms -> 207 ms +- 5 ms: 1.03x slower (+3%)

    Faster (19):

    • scimark_lu: 362 ms +- 13 ms -> 335 ms +- 17 ms: 1.08x faster (-8%)
    • chameleon: 23.5 ms +- 0.3 ms -> 21.7 ms +- 0.3 ms: 1.08x faster (-7%)
    • xml_etree_process: 179 ms +- 3 ms -> 168 ms +- 3 ms: 1.07x faster (-6%)
    • pickle_pure_python: 1.03 ms +- 0.02 ms -> 982 us +- 11 us: 1.05x faster (-5%)
    • regex_dna: 283 ms +- 1 ms -> 271 ms +- 1 ms: 1.05x faster (-4%)
    • chaos: 239 ms +- 2 ms -> 230 ms +- 3 ms: 1.04x faster (-4%)
    • django_template: 356 ms +- 5 ms -> 343 ms +- 5 ms: 1.04x faster (-4%)
    • call_simple: 10.7 ms +- 0.4 ms -> 10.3 ms +- 0.3 ms: 1.04x faster (-3%)
    • hexiom: 18.0 ms +- 0.1 ms -> 17.4 ms +- 0.1 ms: 1.04x faster (-3%)
    • pathlib: 42.3 ms +- 0.5 ms -> 40.9 ms +- 0.4 ms: 1.03x faster (-3%)
    • regex_compile: 379 ms +- 2 ms -> 367 ms +- 4 ms: 1.03x faster (-3%)
    • go: 501 ms +- 7 ms -> 485 ms +- 4 ms: 1.03x faster (-3%)
    • xml_etree_generate: 208 ms +- 4 ms -> 202 ms +- 3 ms: 1.03x faster (-3%)
    • deltablue: 14.5 ms +- 0.3 ms -> 14.1 ms +- 0.2 ms: 1.03x faster (-3%)
    • unpickle_pure_python: 652 us +- 8 us -> 634 us +- 9 us: 1.03x faster (-3%)
    • richards: 147 ms +- 2 ms -> 143 ms +- 4 ms: 1.03x faster (-3%)
    • nqueens: 213 ms +- 2 ms -> 208 ms +- 2 ms: 1.03x faster (-2%)
    • raytrace: 1.08 sec +- 0.01 sec -> 1.05 sec +- 0.02 sec: 1.02x faster (-2%)
    • unpickle: 31.0 us +- 0.4 us -> 30.3 us +- 1.0 us: 1.02x faster (-2%)

    Benchmark hidden because not significant (34): (...)
    Ignored benchmarks (5) of 2017-01-23_14-02-default-cebc9c7ad195.json: mako, sympy_expand, sympy_integrate, sympy_str, sympy_sum
    ---

    Reminder: The fast_init branch is based on the issue bpo-29259 (tp_fastcall), so the result combines tp_fastcall and tp_fastnew/fastinit.

    @vstinner
    Copy link
    Member Author

    I implemented more optimizations:

    • type_call() uses FASTCALL
    • object_new() and object_init() use FASTCALL
    • _asyncio.Future new & init use FASTCALL

    As a result, create an _asyncio.Future is now 1.45x faster:

    haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python
    Median +- std dev: [ref] 411 ns +- 8 ns -> [fast_init] 283 ns +- 10 ns: 1.45x faster (-31%)

    Yury> Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement.

    While bm_richards.py doesn't call _PyStack_AsTuple() anyone in the benchmark (it seems like it full uses FASTCALL), it seems like there is simply zero impact on performance :-(

    $ ./python -m perf compare_to richards_ref.json richards_fastinit.json 
    Benchmark hidden because not significant (1): richards

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 26, 2017

    New changeset 8feb6a5ce4c6 by Victor Stinner in branch 'default':
    Issue bpo-29358: Add postcondition checks on types
    https://hg.python.org/cpython/rev/8feb6a5ce4c6

    @vstinner
    Copy link
    Member Author

    While they are obvious speedup on microbenchmarks, it doesn't speedup "macro" benchmarks from performance as much as I expected.

    The changes required in typeobject.c to support "tp_new or tp_fastnew" and "tp_init or tp_fastinit" are large and very complex. I'm not sure that my code is correct. update_one_slot(), add_operators() and PyType_Ready() functions are complex beast, sadly almost not documented, except such comment...

            /* The __new__ wrapper is not a wrapper descriptor,
               so must be special-cased differently.
               If we don't do this, creating an instance will
               always use slot_tp_new which will look up
               __new__ in the MRO which will call tp_new_wrapper
               which will look through the base classes looking
               for a static base and call its tp_new (usually
               PyType_GenericNew), after performing various
               sanity checks and constructing a new argument
               list.  Cut all that nonsense short -- this speeds
               up instance creation tremendously. */
            specific = (void *)type->tp_new;
            /* XXX I'm not 100% sure that there isn't a hole
               in this reasoning that requires additional
               sanity checks.  I'll buy the first person to
               point out a bug in this reasoning a beer. */
    

    How am I supposed to be confident in front of such coment :-D

    I expect that calling a functions (tp_call) is a more common operation than instanciating a new object (tp_new + tp_init). So I don't think that the overall change is worth it.

    For all these reaons, I close the issue as REJECTED.

    @serhiy-storchaka
    Copy link
    Member

    As for update_one_slot() see also bpo-5322 and bpo-25731.

    @vstinner
    Copy link
    Member Author

    Serhiy Storchaka:

    As for update_one_slot() see also bpo-5322 and bpo-25731.

    Oh, thanks for the pointers! Now I understand much better these bugs.

    I'm quite sure that they are still flaws in this code when a type is modified after PyType_Ready(), like sysmodule.c::

    /* prevent user from creating new instances */
    FlagsType.tp_init = NULL;
    FlagsType.tp_new = NULL;
    

    But each time I had to dig into typeobject.c, my head is going to explode :-D I may try to understand one more time, but not today ;-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants