Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argument Clinic: inline parsing code for functions with only positional parameters #79763

Closed
serhiy-storchaka opened this issue Dec 25, 2018 · 22 comments
Labels
3.8 only security fixes performance Performance or resource usage topic-argument-clinic

Comments

@serhiy-storchaka
Copy link
Member

BPO 35582
Nosy @pitrou, @scoder, @vstinner, @larryhastings, @serhiy-storchaka, @tirkarthi
PRs
  • bpo-35582: Argument Clinic: inline parsing code for positional parameters. #11313
  • bpo-35582: Argument Clinic: inline parsing code for positional parameters. #11313
  • bpo-35582: Argument Clinic: inline parsing code for positional parameters. #11313
  • bpo-35664: Optimize operator.itemgetter #11435
  • bpo-35664: Optimize operator.itemgetter #11435
  • bpo-35582: Argument Clinic: Optimize the "all boring objects" case. #11520
  • bpo-35582: Argument Clinic: Optimize the "all boring objects" case. #11520
  • bpo-35582: Argument Clinic: Optimize the "all boring objects" case. #11520
  • bpo-35582: Inline arguments tuple unpacking in handwritten code. #11524
  • bpo-35582: Inline arguments tuple unpacking in handwritten code. #11524
  • bpo-35582: Inline arguments tuple unpacking in handwritten code. #11524
  • Files
  • bench.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-01-12.06:30:26.507>
    created_at = <Date 2018-12-25.14:40:52.841>
    labels = ['3.8', 'expert-argument-clinic', 'performance']
    title = 'Argument Clinic: inline parsing code for functions with only positional parameters'
    updated_at = <Date 2019-01-16.17:11:25.929>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2019-01-16.17:11:25.929>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-01-12.06:30:26.507>
    closer = 'serhiy.storchaka'
    components = ['Argument Clinic']
    creation = <Date 2018-12-25.14:40:52.841>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['48042']
    hgrepos = []
    issue_num = 35582
    keywords = ['patch', 'patch', 'patch']
    message_count = 22.0
    messages = ['332510', '333446', '333449', '333453', '333455', '333457', '333458', '333463', '333469', '333476', '333478', '333479', '333480', '333482', '333487', '333488', '333491', '333493', '333505', '333513', '333516', '333776']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'scoder', 'vstinner', 'larry', 'serhiy.storchaka', 'xtreak']
    pr_nums = ['11313', '11313', '11313', '11435', '11435', '11520', '11520', '11520', '11524', '11524', '11524']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue35582'
    versions = ['Python 3.8']

    @serhiy-storchaka
    Copy link
    Member Author

    This is a continuation of bpo-23867. The proposed PR makes Argument Clinic inlining parsing code for functions with only positional parameters, i.e. functions that use PyArg_ParseTuple() and _PyArg_ParseStack() now. This saves time for parsing format strings and calling few levels of functions. It can save also a C stack, because of lesser number of nested (and potentially recursive) calls, lesser number of variables, and getting rid of a stack allocated array for "objects" which will need to be deallocated or cleaned up if overall parsing fails.

    PyArg_ParseTuple() and _PyArg_ParseStack() will still be used if there are parameters for which inlining converter is not supported. Unsupported converters are deprecated Py_UNICODE API ("u", "Z"), encoded strings ("es", "et"), obsolete string/bytes converters ("y", "s#", "z#"), some custom converters (DWORD, HANDLE, pid_t, intptr_t).

    @serhiy-storchaka serhiy-storchaka added 3.8 only security fixes topic-argument-clinic performance Performance or resource usage labels Dec 25, 2018
    @serhiy-storchaka
    Copy link
    Member Author

    Some examples:

    $ ./python -m timeit "format('abc')"
    Unpatched:  5000000 loops, best of 5: 65 nsec per loop
    Patched:    5000000 loops, best of 5: 42.4 nsec per loop
    
    $ ./python -m timeit "'abc'.replace('x', 'y')"
    Unpatched:  5000000 loops, best of 5: 101 nsec per loop
    Patched:    5000000 loops, best of 5: 63.8 nsec per loop
    
    $ ./python -m timeit "'abc'.ljust(5)"
    Unpatched:  2000000 loops, best of 5: 120 nsec per loop
    Patched:    5000000 loops, best of 5: 94.4 nsec per loop
    
    $ ./python -m timeit "(1, 2, 3).index(2)"
    Unpatched:  2000000 loops, best of 5: 100 nsec per loop
    Patched:    5000000 loops, best of 5: 62.4 nsec per loop
    
    $ ./python -m timeit -s "a = [1, 2, 3]" "a.index(2)"
    Unpatched:  2000000 loops, best of 5: 93.8 nsec per loop
    Patched:    5000000 loops, best of 5: 70.1 nsec per loop

    ./python -m timeit -s "import math" "math.pow(0.5, 2.0)"
    Unpatched: 2000000 loops, best of 5: 112 nsec per loop
    Patched: 5000000 loops, best of 5: 82.3 nsec per loop

    @vstinner
    Copy link
    Member

    $ ./python -m timeit "format('abc')"
    Unpatched:  5000000 loops, best of 5: 65 nsec per loop
    Patched:    5000000 loops, best of 5: 42.4 nsec per loop

    -23 ns on 65 ns: this is very significant! I spent like 6 months to implement "FASTCALL" to avoid a single tuple to pass positional arguments and it was only 20 ns faster per call. Additional 23 ns make the code way faster compared than Python without FASTCALL! I estimate something like 80 ns => 42 ns: 2x faster!

    $ ./python -m timeit "'abc'.replace('x', 'y')"
    Unpatched:  5000000 loops, best of 5: 101 nsec per loop
    Patched:    5000000 loops, best of 5: 63.8 nsec per loop

    -38 ns on 101 ns: that's even more significant! Wow, that's very impressive!

    Please merge your PR, I want it now :-D

    Can you maybe add a vague sentence in the Optimizations section of What's New in Python 3.8 ? Something like: "Parsing positional arguments in builtin functions has been made more efficient."? I'm not sure if "builtin" is the proper term here. Functions using Argument Clinic to parse their arguments?

    @serhiy-storchaka
    Copy link
    Member Author

    I suppose that my computer is a bit faster than your, so your 20 ns can be only 15 ns or 10 ns on my computer. Run microbenchmarks on your computer to get a scale.

    It may be possible to save yet few nanoseconds if inline a fast path for _PyArg_CheckPositional(), but I'm going to try this later.

    This change is a step in a sequence. I will add a What's New note after finishing so much steps as possible. The next large step is to optimize argument parsing for functions with keyword parameters.

    @tirkarthi
    Copy link
    Member

    Is it possible to run custom builds or benchmark of this once merged on speed.python.org ? I hope this give will be a noticeable dip in the benchmark graphs.

    @vstinner
    Copy link
    Member

    I can trigger a benchmark run on speed.python.org once the change is merged.

    @serhiy-storchaka
    Copy link
    Member Author

    Added Stefan because the new C API could be used in Cython after stabilizing. We should more cooperate with Cython team and provide a (semi-)official stable API for using in Cython.

    I do not expect large affect on most tests, since this optimization affects only a part of functions, and can be noticeable only for very fast function calls.

    @scoder
    Copy link
    Contributor

    scoder commented Jan 11, 2019

    It might be worth inlining a fast path of "_PyArg_CheckPositional()" that only tests "nargs < min || nargs > max" (even via a macro), and then branches to the full error checking and reporting code only if that fails. Determining the concrete exception to raise is not time critical, but the good case is. Also, that would immediately collapse into "nargs != minmax" for the cases where "min == max", i.e. we expect an exact number of arguments.

    And yes, a function that raises the expected exception with the expected error message for a hand full of common cases would be nice. :)

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 4fa9591 by Serhiy Storchaka in branch 'master':
    bpo-35582: Argument Clinic: inline parsing code for positional parameters. (GH-11313)
    4fa9591

    @vstinner
    Copy link
    Member

    I converted msg333446 into attached bench.py using perf. Results on my laptop:

    vstinner@apu$ ./python -m perf compare_to ref.json inlined.json --table -G
    +-------------------------+---------+------------------------------+
    | Benchmark | ref | inlined |
    +=========================+=========+==============================+
    | format('abc') | 74.4 ns | 43.7 ns: 1.70x faster (-41%) |
    +-------------------------+---------+------------------------------+
    | 'abc'.replace('x', 'y') | 93.0 ns | 57.5 ns: 1.62x faster (-38%) |
    +-------------------------+---------+------------------------------+
    | (1, 2, 3).index(2) | 92.5 ns | 59.2 ns: 1.56x faster (-36%) |
    +-------------------------+---------+------------------------------+
    | a.index(2) | 93.6 ns | 59.9 ns: 1.56x faster (-36%) |
    +-------------------------+---------+------------------------------+
    | 'abc'.ljust(5) | 124 ns | 86.0 ns: 1.44x faster (-30%) |
    +-------------------------+---------+------------------------------+
    | math.pow(0.5, 2.0) | 121 ns | 88.1 ns: 1.37x faster (-27%) |
    +-------------------------+---------+------------------------------+

    The speedup on my laptop is between 30.7 and 38.0 ns per function call, on these specific functions.

    1.7x faster on format() is very welcome, well done Serhiy!

    Note: You need the just released perf 1.6.0 version to run this benchmark ;-)

    @vstinner
    Copy link
    Member

    I can trigger a benchmark run on speed.python.org once the change is merged.

    Aha, it seems like Serhiy has more optimizations to come: PR bpo-11520.

    @serhiy: tell me when you are done, so I can trigger a new benchmark run.

    @serhiy-storchaka
    Copy link
    Member Author

    PR 11520 additionally replaces PyArg_UnpackTuple() and _PyArg_UnpackStack() with _PyArg_CheckPositional() and inlined code in Argument Clinic.

    Some examples for PR 11520:

    $ ./python -m timeit "'abc'.strip()"
    Unpatched:  5000000 loops, best of 5: 51.2 nsec per loop
    Patched:    5000000 loops, best of 5: 45.8 nsec per loop
    
    $ ./python -m timeit -s "d = {'a': 1}" "d.get('a')"
    Unpatched:  5000000 loops, best of 5: 55 nsec per loop
    Patched:    5000000 loops, best of 5: 51.1 nsec per loop
    
    $ ./python -m timeit "divmod(5, 2)"
    Unpatched:  5000000 loops, best of 5: 87 nsec per loop
    Patched:    5000000 loops, best of 5: 80.6 nsec per loop
    
    $ ./python -m timeit "hasattr(1, 'numerator')"
    Unpatched:  5000000 loops, best of 5: 62.4 nsec per loop
    Patched:    5000000 loops, best of 5: 54.8 nsec per loop
    
    $ ./python -m timeit "isinstance(1, int)"
    Unpatched:  5000000 loops, best of 5: 62.7 nsec per loop
    Patched:    5000000 loops, best of 5: 54.1 nsec per loop
    
    $ ./python -m timeit -s "from math import gcd" "gcd(6, 10)"
    Unpatched:  2000000 loops, best of 5: 99.6 nsec per loop
    Patched:    5000000 loops, best of 5: 89.9 nsec per loop
    
    $ ./python -m timeit -s "from operator import add" "add(1, 2)"
    Unpatched:  5000000 loops, best of 5: 40.7 nsec per loop
    Patched:    10000000 loops, best of 5: 32.6 nsec per loop

    @vstinner
    Copy link
    Member

    $ ./python -m timeit -s "from operator import add" "add(1, 2)"
    Unpatched:  5000000 loops, best of 5: 40.7 nsec per loop
    Patched:    10000000 loops, best of 5: 32.6 nsec per loop

    We should stop you, or the timing will become negative if you continue!

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 2a39d25 by Serhiy Storchaka in branch 'master':
    bpo-35582: Argument Clinic: Optimize the "all boring objects" case. (GH-11520)
    2a39d25

    @serhiy-storchaka
    Copy link
    Member Author

    PR 11524 performs the same kind of changes as PR 11520, but for handwritten code (only if this causes noticeable speed up). Also iter() is now use the fast call convention.

    $ ./python -m timeit "iter(())"
    Unpatched:  5000000 loops, best of 5: 82.8 nsec per loop
    Patched:    5000000 loops, best of 5: 56.3 nsec per loop
    
    $ ./python -m timeit -s "it = iter([])" "next(it, None)"
    Unpatched:  5000000 loops, best of 5: 54.1 nsec per loop
    Patched:    5000000 loops, best of 5: 44.9 nsec per loop
    
    $ ./python -m timeit "getattr(1, 'numerator')"
    Unpatched:  5000000 loops, best of 5: 63.6 nsec per loop
    Patched:    5000000 loops, best of 5: 57.5 nsec per loop
    
    $ ./python -m timeit -s "from operator import attrgetter; f = attrgetter('numerator')" "f(1)"
    Unpatched:  5000000 loops, best of 5: 64.1 nsec per loop
    Patched:    5000000 loops, best of 5: 56.8 nsec per loop
    
    $ ./python -m timeit -s "from operator import methodcaller; f = methodcaller('conjugate')" "f(1)"
    Unpatched:  5000000 loops, best of 5: 79.5 nsec per loop
    Patched:    5000000 loops, best of 5: 74.1 nsec per loop

    It is possible to speed up also many math methods and maybe some contextvar and hamt methods, but this is for other issues.

    @scoder
    Copy link
    Contributor

    scoder commented Jan 11, 2019

    Nice! Well done, Serhiy!

    @vstinner
    Copy link
    Member

    $ ./python -m timeit "iter(())"
    Unpatched:  5000000 loops, best of 5: 82.8 nsec per loop
    Patched:    5000000 loops, best of 5: 56.3 nsec per loop

    That's quite significant. Oh, it's because you converted builtin_iter() from METH_VARARGS to METH_FASTCALL at the same time. Interesting.

    @serhiy-storchaka
    Copy link
    Member Author

    Just inlining the arg tuple unpacking in iter() give only 10% speed up. I would not apply this optimization for such small difference. But with converting it to fast call it looks more interesting.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 11, 2019

    Are there any numbers on higher-level benchmarks?

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 7934266 by Serhiy Storchaka in branch 'master':
    bpo-35582: Inline arguments tuple unpacking in handwritten code. (GH-11524)
    7934266

    @serhiy-storchaka
    Copy link
    Member Author

    I do not expect significant changes in higher-level benchmarks. But if there are some, they can be shown on speed.python.org.

    I this all work on this stage is finished.

    @vstinner
    Copy link
    Member

    I ran benchmarks on speed.python.org, it's the 5bb146a dot (Jan 13, 2019). I didn't look at results.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes performance Performance or resource usage topic-argument-clinic
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants