Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuple creation is too slow #67695

Closed
serhiy-storchaka opened this issue Feb 24, 2015 · 25 comments
Closed

Tuple creation is too slow #67695

serhiy-storchaka opened this issue Feb 24, 2015 · 25 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@serhiy-storchaka
Copy link
Member

BPO 23507
Nosy @rhettinger, @pitrou, @scoder, @vstinner, @ezio-melotti, @serhiy-storchaka, @1st1, @jstasiak, @serprex
Files
  • reuse_argtuples.patch
  • reuse_argtuples_2.patch
  • bench_builtins.py
  • reuse_argtuples_3.patch
  • stack_overflow.py
  • bench_builtins.py
  • reuse_argtuples_4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-12-18.13:02:51.224>
    created_at = <Date 2015-02-24.08:08:32.468>
    labels = ['interpreter-core', 'performance']
    title = 'Tuple creation is too slow'
    updated_at = <Date 2016-12-18.13:02:51.222>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2016-12-18.13:02:51.222>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-12-18.13:02:51.224>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2015-02-24.08:08:32.468>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['38223', '38244', '42569', '42571', '45725', '45726', '45953']
    hgrepos = []
    issue_num = 23507
    keywords = ['patch']
    message_count = 25.0
    messages = ['236480', '236654', '236656', '236661', '236691', '236692', '236696', '240693', '240845', '240862', '263991', '264012', '264026', '264061', '282179', '282180', '282186', '282187', '282188', '282190', '282195', '282214', '282217', '282229', '283554']
    nosy_count = 10.0
    nosy_names = ['rhettinger', 'pitrou', 'scoder', 'vstinner', 'ezio.melotti', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'jstasiak', 'Demur Rumed']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue23507'
    versions = ['Python 3.5']

    @serhiy-storchaka
    Copy link
    Member Author

    Currently tuples creating uses free lists. But this optimization is not enough. When reuse cached tuple for arguments of repeatedly called functions, the performance of some builtins can be significantly increased.

    $ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(filter(f, s))"
    Unpatched: 1000 loops, best of 3: 773 usec per loop
    Patched:   1000 loops, best of 3: 558 usec per loop
    
    $ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(map(f, s))"
    Unpatched: 1000 loops, best of 3: 689 usec per loop
    Patched:   1000 loops, best of 3: 556 usec per loop
    
    $ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "sorted(s, key=f)"
    Unpatched: 1000 loops, best of 3: 758 usec per loop
    Patched:   1000 loops, best of 3: 550 usec per loop

    The same effect can be achieved for itertools functions.

    I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all.

    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 24, 2015
    @serhiy-storchaka serhiy-storchaka changed the title Tuple creation is two slow Tuple creation is too slow Feb 24, 2015
    @rhettinger
    Copy link
    Contributor

    I had done something like this for zip() long ago and had gotten a nice speed boost. I remember looking at whether it should be applied elsewhere but don't remember why I decided against it.

    @rhettinger rhettinger self-assigned this Feb 26, 2015
    @serhiy-storchaka
    Copy link
    Member Author

    Note the check for Py_REFCNT() == 1. Whithout it the code would be incorrect.

    @serhiy-storchaka
    Copy link
    Member Author

    Fixed bugs found by Victor. Thank you Victor.

    Yes, zip() already uses similar approach.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 26, 2015

    Maybe we want a facility to create on-stack static-size tuples?

    @pitrou
    Copy link
    Member

    pitrou commented Feb 26, 2015

    How many functions can benefit from this approach, though?

    @serhiy-storchaka
    Copy link
    Member Author

    Maybe we want a facility to create on-stack static-size tuples?

    There is no guarantee that called function doesn't save the reference to args.

    How many functions can benefit from this approach, though?

    From hand-writing caching? Any function, that repeatedly call other functions many times. In additional to sorted()/list.sort() with the key argument, filter() and map(), these are min() and max() with the key argument, some iterators from itertools (groupby(), dropwhile(), takewhile(), accumulate(), filterfalse()). May be some functions that are thin wrappers around special method (e.g. round()).

    But it is more interesting to investigate why PyTuple_New() is so slow in comparison with hand-written caching. It it can be speeded up, then mauch more code can benefit from this.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 13, 2015

    By the way, is PyTuple_New() really the culprit here? Or is it the valist handling in PyObject_CallFunctionObjArgs()?

    @rhettinger
    Copy link
    Contributor

    Unassigning because I don't have an interest in going further down this path. I used this technique with zip() because tuple reuse nicely benefited the common case of "for a,b,c in zip(alist, blist, clist): ...".

    On another thread, I did see mention of removing the memset() from tuple creation and that seems like a promising direction.

    @rhettinger rhettinger removed their assignment Apr 14, 2015
    @serhiy-storchaka
    Copy link
    Member Author

    map doesn't use PyObject_CallFunctionObjArgs(). If compare results for map with results for filter and list.sort, seems that PyTuple_New() is responsible for 24% and PyObject_CallFunctionObjArgs() adds other 14%.

    @vstinner
    Copy link
    Member

    Hi, I started to work on a new top-down approach avoiding completly temporary tuple/dict: "FASTCALL", issue bpo-26814. As you may expect, the patch is much larger that Serhiy's patches attached to his issue, but IMHO the final code is much simpler: it doesn't have to use complex tricks on the reference count or setting tuple items to NULL (which lead to segfaults like the issue bpo-26811).

    Antoine Pitrou: "Maybe we want a facility to create on-stack static-size tuples?"

    My current implementation of FASTCALL uses a buffer allocated on the stack to handle calls with up to 10 parameters (10 positional parameters or 5 keyword parameters, since a keyword uses 2 PyObject*).

    Antoine Pitrou: "How many functions can benefit from this approach, though?"

    In my experimental FASTCALL branch, slot wrappers, PyObject_Call*() functions (except of PyObject_Call()!), ceval.c, etc. benefit from FASTCALL. The question is not really how much benefit from FASTCALL, but more how much changes are worth to avoid temporary tuple/dict. For example, right now I decided to not add a FASTCALL flavor of tp_new and tp_init (instanciate a class or create a new type) because it becomes much more complex when you have to handle inheritance.

    Maybe my FASTCALL requires too many changes and is overkill. Maybe we need to find a compromise between FASTCALL (issue bpo-26814) and Serhiy's changes which are limited to a few functions.

    Today, it's too early to decide, but I have fun with my FASTCALL experiment ;-) Slot wrappers are 40% faster, getting a namedtuple attribute is 25% faster (whereas Serhiy already optimized this specific case!), etc.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 22, 2016

    msg264009: Serhiy Storchaka "Could you compare filter(), map() and sorted() performance with your patch and with bpo-23507 patch?"

    Here you have the result of bench_builtins.py. The performance look to be the same, even if I expect better performance with less trivial callbacks (where fastcall would avoid the creation of other temporary tuples).

    ------------------------------------------------+-------------+--------------
    Tests                                           |   argtuples |      fastcall
    ------------------------------------------------+-------------+--------------
    filter(lambda x: x, range(1000))                | 80.2 us (*) | 72.7 us (-9%)
    map(lambda x: x, range(1000))                   | 79.5 us (*) |       79.9 us
    map(lambda x, y: x+y, range(1000), range(1000)) |  111 us (*) |        109 us
    sorted(list, key=lambda x: x)                   | 82.6 us (*) | 75.7 us (-8%)
    sorted(list)                                    | 15.7 us (*) |       15.6 us
    ------------------------------------------------+-------------+--------------
    Total                                           |  369 us (*) |        353 us
    ------------------------------------------------+-------------+--------------
    

    Comparison to the original Python 3.6:

    ------------------------------------------------+-------------+----------------+---------------
    Tests                                           |    original |      argtuples |       fastcall
    ------------------------------------------------+-------------+----------------+---------------
    filter(lambda x: x, range(1000))                |  109 us (*) | 80.2 us (-27%) | 72.7 us (-33%)
    map(lambda x: x, range(1000))                   | 96.9 us (*) | 79.5 us (-18%) | 79.9 us (-18%)
    map(lambda x, y: x+y, range(1000), range(1000)) |  126 us (*) |  111 us (-12%) |  109 us (-13%)
    sorted(list, key=lambda x: x)                   |  114 us (*) | 82.6 us (-28%) | 75.7 us (-34%)
    sorted(list)                                    | 16.1 us (*) |        15.7 us |        15.6 us
    ------------------------------------------------+-------------+----------------+---------------
    Total                                           |  463 us (*) |  369 us (-20%) |  353 us (-24%)
    ------------------------------------------------+-------------+----------------+---------------
    

    The difference is more on the changes in bltinmodule.c. Example with filter_next():

    Using fastcall:

    -            good = PyObject_CallFunctionObjArgs(lz-\>func, item, NULL);
    +            good = PyObject_CallArg1(lz-\>func, item);

    reuse_argtuples_2.patch:

    -            good = PyObject_CallFunctionObjArgs(lz->func,
    -                                                item, NULL);
    +            PyObject *argtuple = lz->argtuple;
    +            lz->argtuple = NULL;
    +            if (argtuple == NULL) {
    +                argtuple = PyTuple_New(1);
    +                if (argtuple == NULL) {
    +                    Py_DECREF(item);
    +                    return NULL;
    +                }
    +                Py_INCREF(argtuple);
    +            }
    +            assert(Py_REFCNT(argtuple) == 2);
    +            assert(Py_SIZE(argtuple) == 1);
    +            assert(PyTuple_GET_ITEM(argtuple, 0) == NULL);
    +            PyTuple_SET_ITEM(argtuple, 0, item);
    +            good = PyObject_Call(lz->func, argtuple, NULL);
    +            if (Py_REFCNT(argtuple) == 2) {
    +                PyTuple_SET_ITEM(argtuple, 0, NULL);
    +                if (lz->argtuple == NULL)
    +                    lz->argtuple = argtuple;
    +                else {
    +                    Py_DECREF(argtuple);
    +                    Py_DECREF(argtuple);
    +                }
    +            }
    +            else {
    +                Py_INCREF(item);
    +                Py_DECREF(argtuple);
    +                Py_DECREF(argtuple);
    +            }

    (reuse_argtuples_2.patch requires a few other changes related to filter_next().)

    @vstinner
    Copy link
    Member

    reuse_argtuples_3.patch: rebased reuse_argtuples_2.patch.

    @serprex
    Copy link
    Mannequin

    serprex mannequin commented Apr 23, 2016

    This code could be made to look a lot less hackish if the potential tuple reuse was abstracted by a PyTuple_UniqueOrNew(PyTuple) which returns its argument if REFCNT==1 otherwise allocates a tuple of the same size & returns that. It decrefs PyTuple if REFCNT != 1

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 1, 2016

    New changeset b9c9691c72c5 by Victor Stinner in branch 'default':
    Replace PyObject_CallFunctionObjArgs() with fastcall
    https://hg.python.org/cpython/rev/b9c9691c72c5

    @serhiy-storchaka
    Copy link
    Member Author

    I think it would be better to optimize PyObject_CallFunctionObjArgs().

    @serhiy-storchaka
    Copy link
    Member Author

    Here is an example that shows increased stack consumption.

    With old code it counts to 74700, with the patch it counts only to 65400.

    Please revert your changes.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 1, 2016

    I rewrote bench_builtins.py to use my new perf module.

    Python 3.7 is between 1.27x and 1.42x faster than Python 3.6, but sorted(list) is slower. I don't know why sorted(list) is slower. It doesn't use a key function, and so should not be impacted by FASTCALL changes made since Python 3.6.

    Benchmark results of Python 3.7 (include latest FASTCALL patches) compared to Python 3.5 (no FASTCALL):

    filter(lambda x: x, range(1000)): Median +- std dev: [3.5] 126 us +- 7 us -> [3.7] 94.6 us +- 4.8 us: 1.34x faster (-25%)
    map(lambda x, y: x+y, range(1000), range(1000)): Median +- std dev: [3.5] 163 us +- 10 us -> [3.7] 115 us +- 7 us: 1.42x faster (-30%)
    map(lambda x: x, range(1000)): Median +- std dev: [3.5] 115 us +- 6 us -> [3.7] 90.9 us +- 5.4 us: 1.27x faster (-21%)
    sorted(list): Median +- std dev: [3.5] 17.5 us +- 1.0 us -> [3.7] 19.7 us +- 1.1 us: 1.12x slower (+12%)
    sorted(list, key=lambda x: x): Median +- std dev: [3.5] 122 us +- 10 us -> [3.7] 91.8 us +- 5.7 us: 1.33x faster (-25%)

    Note: you need the development version of perf to run the script (future version 0.9.2).

    @vstinner
    Copy link
    Member

    vstinner commented Dec 1, 2016

    Serhiy: "I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all."

    A new "fast call" calling convention was added to Python 3.6 to avoid the creation of temporary tuples to pass function arguments. Benchmarks confirm that Python 3.7 is faster than Python 3.5 of functions that Serhiy suggested to optimize. I confirm that I was also surprised that fast call has a significant effect on performance!

    Fast calls don't require complex code like reuse_argtuples_3.patch which requires to check carefully the reference counter. In the past, a similar optimization on property_descr_get() introduced complex bugs: see issue bpo-24276 and issue bpo-26811. Fast calls don't use such issues.

    The implementation of fast calls is currently incomplete: tp_new, tp_init and tp_call slots still require a tuple and a dict for positional and keyword arguments. I'm working on an enhancement for Python 3.7 to support also fast calls for these slots. In the meanwhile, callbacks using tp_call don't benefit yet on fast calls.

    That's why I had to keep the tuple/refcount optimization in property_descr_get(), to not introduce a performance regression.

    I consider that this issue can now be closed as "fixed".

    @vstinner vstinner closed this as completed Dec 1, 2016
    @serhiy-storchaka
    Copy link
    Member Author

    Victor, your changes introduced a regression.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 1, 2016

    Serhiy Storchaka: "Victor, your changes introduced a regression."

    I suppose that you are refering to sorted(list) which seems to be slower. Do you have a specific change in mind? As I wrote, it doesn't use a key function, and so should not be impacted by my work on fast calls. I opened a new issue bpo-28852 to track this performance regression.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 2, 2016

    I opened a new issue bpo-28852 to track this performance regression.

    So can I close again this issue?

    @vstinner vstinner closed this as completed Dec 2, 2016
    @serhiy-storchaka
    Copy link
    Member Author

    No, I'm referring to the crashing. The code that worked before your changes now are crashing. Please revert your changes and find a way to optimize a function calling without increasing stack consumption.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 2, 2016

    haypo: "status: open -> closed"

    Oops, it's a mistake, sorry. I only wanted to ask the question, not to close the issue.

    Serhiy Storchaka: "No, I'm referring to the crashing. The code that worked before your changes now are crashing."

    Sorry, I didn't notice that you attached a script! I opened the issue bpo-28858.

    @serhiy-storchaka
    Copy link
    Member Author

    Here is rebased patch (just for history). Caching argtuple is not faster (or even slower) than using fast call.

    Regression in stack consumption is fixed in bpo-28858.

    Thanks Victor!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants