Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid copy in call_function_var when no extra stack args are passed #70989

Closed
llllllllll mannequin opened this issue Apr 19, 2016 · 20 comments
Closed

Avoid copy in call_function_var when no extra stack args are passed #70989

llllllllll mannequin opened this issue Apr 19, 2016 · 20 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@llllllllll
Copy link
Mannequin

llllllllll mannequin commented Apr 19, 2016

BPO 26802
Nosy @vstinner, @vadmium, @serhiy-storchaka, @1st1, @MojoVampire, @llllllllll
Files
  • call-function-var.patch
  • call-function-var.patch
  • call-function-var-3.patch
  • bench.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-04-19.20:47:10.666>
    created_at = <Date 2016-04-19.01:10:12.132>
    labels = ['interpreter-core', 'performance']
    title = 'Avoid copy in call_function_var when no extra stack args are passed'
    updated_at = <Date 2016-04-20.07:55:49.849>
    user = 'https://github.com/llllllllll'

    bugs.python.org fields:

    activity = <Date 2016-04-20.07:55:49.849>
    actor = 'vstinner'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-04-19.20:47:10.666>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2016-04-19.01:10:12.132>
    creator = 'llllllllll'
    dependencies = []
    files = ['42512', '42515', '42521', '42524']
    hgrepos = []
    issue_num = 26802
    keywords = ['patch']
    message_count = 20.0
    messages = ['263702', '263712', '263717', '263741', '263742', '263750', '263751', '263754', '263765', '263770', '263771', '263772', '263774', '263775', '263777', '263778', '263782', '263784', '263808', '263809']
    nosy_count = 7.0
    nosy_names = ['vstinner', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'llllllllll']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26802'
    versions = ['Python 3.6']

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 19, 2016

    When star unpacking positions into a function we can avoid a copy iff there are no extra arguments coming from the stack. Syntactically this looks like: f(*args)

    This reduces the overhead of the call by about 25% (~30ns on my machine) based on some light profiling of just * unpacking. I can imagine this having a slight impact on code that uses this feature in a loop.

    The cost is minimal when we need to make the copy because the int != 0 check is dwarfed by the allocation. I did not notice a significant change between the slow path and the original code.

    The macro benchmark suite did not show a difference on my machine but this is not much more complex code.

    @llllllllll llllllllll mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 19, 2016
    @SilentGhost SilentGhost mannequin added the type-bug An unexpected behavior, bug, or error label Apr 19, 2016
    @serhiy-storchaka
    Copy link
    Member

    stararg can be not exact tuple, but an instance of tuple subclass.

    Could you provide a microbenchmark that shows the effect of the optimization?

    @serhiy-storchaka serhiy-storchaka added performance Performance or resource usage and removed type-bug An unexpected behavior, bug, or error labels Apr 19, 2016
    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 19, 2016

    in _PyEval_EvalCodeWithName we will have already pulled the underlying array out of the tuple subclass and the then we will reconstruct the vararg value in the locals from this array. This means that with this patch we can get:

    >>> class C(tuple): pass
    ... 
    >>> def f(*args): print(type(args))
    ... 
    >>> f(*C([1, 2, 3]))
    <class 'tuple'>

    The only case where we get potentially strange behavior is if we have a type whose tp_call did not forward to PyArg_Parse* or the PyTupleObject api and went though the abstract object interface. I ran the tests and the benchmark suite and did not run into any issues here but I can see how this would be potentially problematic behavior.

    I have updated the code to also do a PyTuple_CheckExact against stararg. The tests I ran are:

    timeit("h(*(a, b, c, d, e, f, g))", 'def h(a, b, c, d, e, f, g): pass\na, b, c, d, e, f, g = range(7)', number=int(10e7))

    default: 17.191688070015516
    patch: 14.060781285981648

    There is also a really great case where you star unpack a tuple of literals which gets pushed into the co_consts:

    timeit("f(*('a', 'b', 'c', 'd', 'e', 'f', 'g'))", 'def f(a, b, c, d, e, f, g): pass', number=int(10e7))

    default: 13.842308042978402
    patch: 9.696972723002546

    This case seems far less common, but maybe someone has something like:

    a = 1, 2, 3
    ...
    f(*a)

    @llllllllll llllllllll mannequin added performance Performance or resource usage and removed performance Performance or resource usage labels Apr 19, 2016
    @vstinner
    Copy link
    Member

    + if (!nstar) {
    + /* There are no positional arguments on the stack or in in a
    + sequence that was unpacked. */
    + return PyTuple_New(0);
    + }

    It's possible that stararg is already an empty tuple. Is it worth to use it?

    + if (!nstar) {
    + if (stararg != NULL && PyTuple_CheckExact(stararg)) {
    + Py_INCREF(stararg);
    + return stararg;
    + }
    + else {
    + /* There are no positional arguments on the stack or in in a
    + sequence that was unpacked. */
    + return PyTuple_New(0);
    + }
    + }

    @vstinner
    Copy link
    Member

    call-function-var.patch LGTM.

    @vstinner
    Copy link
    Member

    FYI I ran the full test suite with the second attached patch and all tests pass.

    By the way, please add a suffix like -2, -3, etc. to your patches next time. It's easier to identify them if they have a different name ;-)

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 19, 2016

    It's possible that stararg is already an empty tuple. Is it worth to use it?

    PyTuple_New(0) should return the unit tuple in most cases:

    #if PyTuple_MAXSAVESIZE > 0
        if (size == 0 && free_list[0]) {
            op = free_list[0];
            Py_INCREF(op);
    #ifdef COUNT_ALLOCS
            tuple_zero_allocs++;
    #endif
            return (PyObject *) op;
        }

    Looking at this again I think that checking if stararg is nonnull is more clear, I will update the patch (as call-function-var-3.patch). I cannot exactly rewrite the if to use the control flow you show because that would cause non-tuple subclasses to forward a stararg of () instead of copying it into a normal tuple when no arguments are passed ont the stack.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Apr 19, 2016

    BTW, for a realistic use case that would be sped up by this patch (possibly a lot), consider a recursive function that is continually doing a "virtual" pop of the first argument, and receiving the rest as varargs, which are then unpacked for the recursive call, e.g., a function to make a multidimensional list populated with zeroes:

        def nestedzeroes(firstdim, *dims):
            '''Make multidimensional list populated with 0s'''
            if not dims:
                return [0] * firstdim
            return [nestedzeroes(*dims) for _ in range(firstdim)]

    It's somewhat less artificial in that it multiplies the effect of the optimization in a way that would actually exercise a microbenchmark-like scenario, where the amount of work involved in the star-args-only function calls is an appreciable percentage of the work. Running nestedzeroes(5, 5, 5) to create a 5x5x5 structure would involve 30 recursive calls; for nestedzeroes(10, 10, 10), 110 recursive calls.

    I've actually used code like this (admittedly rarely) to avoid the hassle of inlining a many times nested set of list comprehensions to initialize a multidimensional list.

    @serhiy-storchaka
    Copy link
    Member

    call-function-var-3.patch LGTM. Thank you for your contribution Joe.

    @serhiy-storchaka serhiy-storchaka self-assigned this Apr 19, 2016
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 19, 2016

    New changeset 15b20201cfa5 by Serhiy Storchaka in branch 'default':
    Issue bpo-26802: Optimized calling a function with *args only positional arguments.
    https://hg.python.org/cpython/rev/15b20201cfa5

    @serhiy-storchaka
    Copy link
    Member

    FYI, there is a proposition about constructing arguments tuple and dict in bytecode instead of ceval.c. This will significantly simplify CALL_FUNCTION (which will get just one optional tuple and one optional dict). Likely this idea will be implemented after changing to wordcode.

    @vstinner
    Copy link
    Member

    Good news. On a microbenchmark, the patch makes func(*tuple) between 14% and 18% faster. See attached bench.py. I ran the benchmark on Linux with isolated CPUs to get reliable results.
    https://haypo-notes.readthedocs.org/microbenchmark.html

    $ ./python-orig ~/prog/HG/misc/python/benchmark.py script bench.py --file=orig
    $ ./python-patched ~/prog/HG/misc/python/benchmark.py script bench.py --file=patch
    $ ./python-orig ~/prog/HG/misc/python/benchmark.py compare_to orig patch

    Common platform:
    CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
    SCM: hg revision=75f40345d784 branch=default date="2016-04-19 22:29 +0200"
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
    Timer: time.perf_counter
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    Python unicode implementation: PEP-393
    CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
    Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three

    Platform of campaign orig:
    Timer precision: 62 ns
    Python version: 3.6.0a0 (default:75f40345d784, Apr 19 2016, 23:07:49) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]
    Date: 2016-04-19 23:08:21

    Platform of campaign patch:
    Timer precision: 61 ns
    Python version: 3.6.0a0 (default:084d5315dd50, Apr 19 2016, 23:06:38) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]
    Date: 2016-04-19 23:08:39

    -----------------------------------------------------+------------+--------------
    Tests | orig | patch
    -----------------------------------------------------+------------+--------------

    def func(a, b, c): pass args=(1, 2, 3); func(*args)  | 127 ns (*) | 104 ns (-18%)
    def func(*args): pass args=(1, 2, 3); func(*args)    | 147 ns (*) | 126 ns (-14%)
    def func(*args): pass args=(2, 3, 4); func(1, *args) | 154 ns (*) |        154 ns
    -----------------------------------------------------+------------+

    Total | 429 ns (*) | 384 ns (-10%)
    -----------------------------------------------------+------------+--------------

    @vstinner
    Copy link
    Member

    New changeset 15b20201cfa5 by Serhiy Storchaka in branch 'default':

    Oh. I was cooking a commit, you was faster than me :-)

    IMHO it's worth to make the optimization in Misc/NEWS, or maybe even in Whats new in Python 3.6.

    I prepared the text:

    Issue bpo-26802: Optimize function calls only using unpacking like func(*tuple) (no other positional argument, no keyword): avoid copying the tuple. Such function calls are up to 15% faster. Patch written by Joe Jevnik.

    @vstinner
    Copy link
    Member

    FYI, there is a proposition about constructing arguments tuple and dict in bytecode instead of ceval.c. This will significantly simplify CALL_FUNCTION (which will get just one optional tuple and one optional dict). Likely this idea will be implemented after changing to wordcode.

    Wordcode is the issue bpo-26647.

    Do you have a reference to the more efficient implementation of CALL_FUNCTION? I recall vaguely this idea.

    --

    It was also proposed to pass keyword arguments as positional arguments: replace func(a=1, b=2) with func(1, 2) with "def func(a, b): ...". But this optimization requires something like FAT Python to disable the optimization if the function is replaced at runtime.

    This optimization is more complex, maybe it's not worth.

    @serhiy-storchaka
    Copy link
    Member

    Do you have a reference to the more efficient implementation of CALL_FUNCTION?

    Actually it was about MAKE_FUNCTION. But I think the same is applicable to CALL_FUNCTION.

    http://comments.gmane.org/gmane.comp.python.devel/157321

    @llllllllll
    Copy link
    Mannequin Author

    llllllllll mannequin commented Apr 19, 2016

    CALL_FUNCTION doesn't use extended arg; I don't see what we get by adding some opcode to convert the sequence into a tuple. We also don't want to box up the stack arguments into a tuple because when we do a CALL_FUNCTION to a python function we just copy the stack[-n:] elements into the locals of the next frame. Emitting a build_tuple here would be an unneeded allocation.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 19, 2016

    New changeset 6535481d610e by Victor Stinner in branch 'default':
    Optimize func(*tuple) function call
    https://hg.python.org/cpython/rev/6535481d610e

    @vstinner
    Copy link
    Member

    New changeset 6535481d610e by Victor Stinner in branch 'default':
    Optimize func(*tuple) function call
    https://hg.python.org/cpython/rev/6535481d610e

    Oops. Mercurial is too hard for me /o\ I didn't want to push this change, since Serhiy already pushed the change itself. I asked Serhiy what he thinks about documenting the optimization. Well, now at least it's documented in NEWS. But I'm still interested to know if it's worth to mention the change in What's New in Python 3.6?

    @serhiy-storchaka
    Copy link
    Member

    I was not sure about documenting this change in Misc/NEWS, but I think that it is not worth to mention it in What's New. This is very minor change. It's effect on any real code is negligible. I have pushed it only because it is trivial, unlikely has a negative effect, and there is a hope that the accumulated effect of several similar minor optimizations can be noticeable.

    Simple refactoring or bug fix can have comparable performance effect on large program, but we don't mention this regression in What's New.

    @vstinner
    Copy link
    Member

    Serhiy: "I was not sure about documenting this change in Misc/NEWS, but I think that it is not worth to mention it in What's New. This is very minor change. It's effect on any real code is negligible. I have pushed it only because it is trivial, unlikely has a negative effect, and there is a hope that the accumulated effect of several similar minor optimizations can be noticeable."

    I documented other similar optimizations which probably have a negligible effect. I also expect a visible effect when other optimizations are accumulated.

    https://docs.python.org/dev/whatsnew/3.6.html#optimizations

    It also a deliberate choice to document optimization. It's to communicate on the fact that we *are working* on performance. It looks like users are happy to see work done on this area:
    https://twitter.com/TalkPython/status/720704770198126594/photo/1

    The drawback is maybe that users may have too high expectations on these optimizations.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants