Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argument Clinic: inline parsing code for functions with keyword parameters #80308

Closed
serhiy-storchaka opened this issue Feb 26, 2019 · 11 comments
Assignees
Labels
3.8 only security fixes performance Performance or resource usage topic-argument-clinic

Comments

@serhiy-storchaka
Copy link
Member

BPO 36127
Nosy @vstinner, @larryhastings, @serhiy-storchaka, @MojoVampire, @ammaraskar, @tirkarthi
PRs
  • bpo-36127: Argument Clinic: inline parsing code for keyword parameters. #12058
  • bpo-36127: Fix _PyArg_UnpackKeywords() warning #12345
  • bpo-36127: Fix compiler warning in _PyArg_UnpackKeywords(). #12353
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2019-08-31.20:28:24.943>
    created_at = <Date 2019-02-26.17:30:36.013>
    labels = ['3.8', 'expert-argument-clinic', 'performance']
    title = 'Argument Clinic: inline parsing code for functions with keyword parameters'
    updated_at = <Date 2019-08-31.20:28:24.943>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2019-08-31.20:28:24.943>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2019-08-31.20:28:24.943>
    closer = 'serhiy.storchaka'
    components = ['Argument Clinic']
    creation = <Date 2019-02-26.17:30:36.013>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36127
    keywords = ['patch']
    message_count = 11.0
    messages = ['336700', '336701', '336703', '336706', '336719', '336748', '336752', '336754', '336756', '337903', '338090']
    nosy_count = 6.0
    nosy_names = ['vstinner', 'larry', 'serhiy.storchaka', 'josh.r', 'ammar2', 'xtreak']
    pr_nums = ['12058', '12345', '12353']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue36127'
    versions = ['Python 3.8']

    @serhiy-storchaka
    Copy link
    Member Author

    This is a follow up of bpo-23867 and bpo-35582. The proposed PR makes Argument Clinic inlining parsing code for functions with keyword parameters, i.e. functions that use _PyArg_ParseTupleAndKeywordsFast() and _PyArg_ParseStackAndKeywords() now. This saves time for parsing format strings and calling few levels of functions.

    @serhiy-storchaka serhiy-storchaka added the 3.8 only security fixes label Feb 26, 2019
    @serhiy-storchaka serhiy-storchaka self-assigned this Feb 26, 2019
    @serhiy-storchaka serhiy-storchaka added topic-argument-clinic performance Performance or resource usage labels Feb 26, 2019
    @serhiy-storchaka
    Copy link
    Member Author

    Some examples:

    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "round_ = round" "round_(4.2)"
    Mean +- std dev: [...] 110 ns +- 3 ns -> [...] 81.4 ns +- 2.2 ns: 1.35x faster (-26%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "sum_ = sum" "sum_(())"
    Mean +- std dev: [...] 88.0 ns +- 6.5 ns -> [...] 57.6 ns +- 1.1 ns: 1.53x faster (-35%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "sum_ = sum; a = [1, 2]"  "sum_(a)"
    Mean +- std dev: [...] 95.9 ns +- 2.1 ns -> [...] 70.6 ns +- 2.0 ns: 1.36x faster (-26%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "'abc'.split()"          
    Mean +- std dev: [...] 102 ns +- 3 ns -> [...] 80.5 ns +- 2.1 ns: 1.26x faster (-21%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "b'abc'.split()"
    Mean +- std dev: [...] 91.8 ns +- 2.3 ns -> [...] 75.1 ns +- 1.4 ns: 1.22x faster (-18%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "'abc'.split('-')"
    Mean +- std dev: [...] 118 ns +- 2 ns -> [...] 89.2 ns +- 1.8 ns: 1.32x faster (-24%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "b'abc'.decode()"
    Mean +- std dev: [...] 96.1 ns +- 3.6 ns -> [...] 78.9 ns +- 2.2 ns: 1.22x faster (-18%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "'abc'.encode()"
    Mean +- std dev: [...] 72.4 ns +- 1.9 ns -> [...] 55.1 ns +- 1.8 ns: 1.31x faster (-24%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "int_= int"  "int(4.2)"
    Mean +- std dev: [...] 105 ns +- 4 ns -> [...] 78.8 ns +- 1.9 ns: 1.33x faster (-25%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "int_= int"  "int('5')"
    Mean +- std dev: [...] 154 ns +- 5 ns -> [...] 122 ns +- 4 ns: 1.26x faster (-21%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "42 .to_bytes(2, 'little')
    Mean +- std dev: [...] 109 ns +- 3 ns -> [...] 72.4 ns +- 1.9 ns: 1.51x faster (-34%)
    
    $ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "from_bytes = int.from_bytes" "from_bytes(b'ab', 'little')"
    Mean +- std dev: [...] 138 ns +- 3 ns -> [...] 96.3 ns +- 3.0 ns: 1.43x faster (-30%)

    @tirkarthi
    Copy link
    Member

    That's a lot faster and will be great if they make it to next alpha :) Adding Ammar Askar since they did review for positional arguments.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Feb 26, 2019

    How much bigger does the core interpreter + built-in extension modules get when you make this change? How much more memory is used by real world programs?

    I'm a little concerned when I see individual functions growing by 140 lines in the first file of the diff. Clearly the increase in memory usage wasn't enough to slow down the microbenchmarks (which can probably fit the entire hot code path in CPU on die cache), but I'd be worried about the aggregate effect on real world programs if we go from a handful of argument parsing code paths reused by every function (and therefore kept constantly hot) to hundreds (or thousands?) of unique argument parsing code paths, each one unique to a single function. It could easily look great when a single function is being called repeatedly, while looking much less great (possibly worse) when the many varied function calls are interleaved.

    It should be tested on a number of systems too; any losses to cache unfriendliness would be highly dependent on the size of the CPU cache.

    I'll grant, it doesn't seem like inlining positional argument parsing has caused problems, and it looks like we're still using _PyArg_UnpackKeywords, so argument parsing isn't completely devolved to each functions, but I think we need something more than microbenchmarks before we jump on this.

    If my worries end up being unfounded, I'll be happy. Looks very cool if we can really get that sort of speed up for all function calls, not just positional args only functions. :-)

    @vstinner
    Copy link
    Member

    How much bigger does the core interpreter + built-in extension modules get when you make this change? How much more memory is used by real world programs?

    Well, any optimization is a matter of trade-off between memory and CPU. Last years, CPU are not really getting way faster (especially when you consider that Python is usually only able to use a single CPU thread), whereas computers are getting more and more RAM.

    It should be tested on a number of systems too; any losses to cache unfriendliness would be highly dependent on the size of the CPU cache.

    I prefer to no pay too much attention to assumptions on the hardware. I prefer to only trust benchmarks :-)

    @serhiy-storchaka
    Copy link
    Member Author

    Good questions Josh!

    The size of the python binary has been increased from 17494944 bytes (17085 KiB) to 17657112 bytes (17243 KiB) -- by 1%.

    I think that this change can not increase memory consumption, because the new code does not use the heap (the old code can allocate additional memory dynamically).

    As for using the stack memory, it is not so clear. On one side, the new code allocates an array on the stack for references to all parameters, and this memory is left in use until you return from the function. On other side, the old code allocates a lot of variables and static-sized buffers, and creates several function frames, but this memory is released after the end of arguments parsing. I think that for non-recursive functions the new code has smaller stack memory consumption (while the function has less than several tens of parameters), but for recursive functions it can increase stack memory consumption. Although I do not know whether any of affected functions is recursive.

    @serhiy-storchaka
    Copy link
    Member Author

    As for depending the optimization on the size of CPU cache, I have repeated mickrobenchmarks on the computer with 6 MiB cache and two computers with 512 KiB caches (64- and 32-bit).

    Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (cache size: 6144 KB):

    +---------------------------------+----------+------------------------------+
    | Benchmark | baseline | inline |
    +=================================+==========+==============================+
    | round_(4.2) | 113 ns | 81.3 ns: 1.39x faster (-28%) |
    +---------------------------------+----------+------------------------------+
    | sum_(()) | 83.8 ns | 56.7 ns: 1.48x faster (-32%) |
    +---------------------------------+----------+------------------------------+
    | sum_(a) | 98.0 ns | 72.1 ns: 1.36x faster (-26%) |
    +---------------------------------+----------+------------------------------+
    | 'abc'.split() | 107 ns | 83.1 ns: 1.29x faster (-22%) |
    +---------------------------------+----------+------------------------------+
    | b'abc'.split() | 101 ns | 75.4 ns: 1.34x faster (-25%) |
    +---------------------------------+----------+------------------------------+
    | 'abc'.split('-') | 123 ns | 89.9 ns: 1.37x faster (-27%) |
    +---------------------------------+----------+------------------------------+
    | 'abc'.encode() | 79.6 ns | 59.2 ns: 1.34x faster (-26%) |
    +---------------------------------+----------+------------------------------+
    | b'abc'.decode() | 105 ns | 84.7 ns: 1.24x faster (-20%) |
    +---------------------------------+----------+------------------------------+
    | int_(4.2) | 88.9 ns | 64.1 ns: 1.39x faster (-28%) |
    +---------------------------------+----------+------------------------------+
    | int_('5') | 137 ns | 108 ns: 1.28x faster (-22%) |
    +---------------------------------+----------+------------------------------+
    | 42 .to_bytes(2, 'little') | 113 ns | 77.6 ns: 1.45x faster (-31%) |
    +---------------------------------+----------+------------------------------+
    | int_from_bytes(b'ab', 'little') | 83.4 ns | 51.5 ns: 1.62x faster (-38%) |
    +---------------------------------+----------+------------------------------+
    | struct_i32_unpack_from(b'abcd') | 96.0 ns | 71.6 ns: 1.34x faster (-25%) |
    +---------------------------------+----------+------------------------------+
    | re_word_match('a') | 221 ns | 180 ns: 1.22x faster (-18%) |
    +---------------------------------+----------+------------------------------+
    | datetime_now() | 282 ns | 248 ns: 1.14x faster (-12%) |
    +---------------------------------+----------+------------------------------+

    Not significant (1): zlib_compress(b'abc')

    AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (cache size: 512 KB):

    +---------------------------------+----------+-----------------------------+
    | Benchmark | baseline | inline |
    +=================================+==========+=============================+
    | round_(4.2) | 391 ns | 272 ns: 1.44x faster (-31%) |
    +---------------------------------+----------+-----------------------------+
    | sum_(()) | 212 ns | 160 ns: 1.32x faster (-24%) |
    +---------------------------------+----------+-----------------------------+
    | sum_(a) | 256 ns | 211 ns: 1.21x faster (-18%) |
    +---------------------------------+----------+-----------------------------+
    | 'abc'.split() | 290 ns | 233 ns: 1.25x faster (-20%) |
    +---------------------------------+----------+-----------------------------+
    | b'abc'.split() | 263 ns | 226 ns: 1.16x faster (-14%) |
    +---------------------------------+----------+-----------------------------+
    | 'abc'.split('-') | 316 ns | 262 ns: 1.21x faster (-17%) |
    +---------------------------------+----------+-----------------------------+
    | 'abc'.encode() | 197 ns | 154 ns: 1.28x faster (-22%) |
    +---------------------------------+----------+-----------------------------+
    | b'abc'.decode() | 303 ns | 250 ns: 1.21x faster (-18%) |
    +---------------------------------+----------+-----------------------------+
    | int_(4.2) | 234 ns | 171 ns: 1.37x faster (-27%) |
    +---------------------------------+----------+-----------------------------+
    | int_('5') | 372 ns | 310 ns: 1.20x faster (-17%) |
    +---------------------------------+----------+-----------------------------+
    | 42 .to_bytes(2, 'little') | 370 ns | 245 ns: 1.51x faster (-34%) |
    +---------------------------------+----------+-----------------------------+
    | int_from_bytes(b'ab', 'little') | 251 ns | 167 ns: 1.50x faster (-33%) |
    +---------------------------------+----------+-----------------------------+
    | struct_i32_unpack_from(b'abcd') | 252 ns | 202 ns: 1.24x faster (-20%) |
    +---------------------------------+----------+-----------------------------+
    | re_word_match('a') | 625 ns | 524 ns: 1.19x faster (-16%) |
    +---------------------------------+----------+-----------------------------+
    | datetime_now() | 2.05 us | 1.99 us: 1.03x faster (-3%) |
    +---------------------------------+----------+-----------------------------+
    | zlib_compress(b'abc') | 28.6 us | 28.0 us: 1.02x faster (-2%) |
    +---------------------------------+----------+-----------------------------+

    Intel(R) Atom(TM) CPU N570 @ 1.66GHz (cache size: 512 KB), 32-bit:

    +---------------------------------+----------+------------------------------+
    | Benchmark | baseline | inline |
    +=================================+==========+==============================+
    | round_(4.2) | 1.95 us | 1.29 us: 1.51x faster (-34%) |
    +---------------------------------+----------+------------------------------+
    | sum_(()) | 1.15 us | 821 ns: 1.40x faster (-29%) |
    +---------------------------------+----------+------------------------------+
    | sum_(a) | 1.32 us | 1.02 us: 1.30x faster (-23%) |
    +---------------------------------+----------+------------------------------+
    | 'abc'.split() | 1.32 us | 1.11 us: 1.19x faster (-16%) |
    +---------------------------------+----------+------------------------------+
    | b'abc'.split() | 1.22 us | 1.03 us: 1.18x faster (-15%) |
    +---------------------------------+----------+------------------------------+
    | 'abc'.split('-') | 1.78 us | 1.15 us: 1.54x faster (-35%) |
    +---------------------------------+----------+------------------------------+
    | 'abc'.encode() | 1.05 us | 883 ns: 1.19x faster (-16%) |
    +---------------------------------+----------+------------------------------+
    | b'abc'.decode() | 1.34 us | 1.17 us: 1.15x faster (-13%) |
    +---------------------------------+----------+------------------------------+
    | int_(4.2) | 1.23 us | 859 ns: 1.43x faster (-30%) |
    +---------------------------------+----------+------------------------------+
    | int_('5') | 2.20 us | 1.41 us: 1.56x faster (-36%) |
    +---------------------------------+----------+------------------------------+
    | 42 .to_bytes(2, 'little') | 1.45 us | 1.09 us: 1.33x faster (-25%) |
    +---------------------------------+----------+------------------------------+
    | int_from_bytes(b'ab', 'little') | 1.07 us | 737 ns: 1.45x faster (-31%) |
    +---------------------------------+----------+------------------------------+
    | struct_i32_unpack_from(b'abcd') | 1.31 us | 1.08 us: 1.21x faster (-18%) |
    +---------------------------------+----------+------------------------------+
    | re_word_match('a') | 2.85 us | 2.06 us: 1.39x faster (-28%) |
    +---------------------------------+----------+------------------------------+
    | datetime_now() | 6.20 us | 5.92 us: 1.05x faster (-4%) |
    +---------------------------------+----------+------------------------------+
    | zlib_compress(b'abc') | 28.7 us | 26.9 us: 1.07x faster (-6%) |
    +---------------------------------+----------+------------------------------+

    The speed up is significant on all computers.

    @vstinner
    Copy link
    Member

    About the stack memory usage, in the past, I used a subfunction tagged with _Py_NO_INLINE to work on temporary stack but then drop it:

    void function()
    {
       subfunction(); /* use temporary stack */
       /* don't waste stack memory */
       ...
    }

    I'm not sure if such pattern could be used here for things like " PyObject *argsbuf[12];".

    The problem is that argument parsing uses a lot of local variables allocated on the stack. In practice, it's more like:

    void function(args)
    {
       int x;
       parse_args(args, &x); /* use temporary stack */
       /* don't waste stack memory */
       ...
    }

    I expect a long list of "&arg" where arg is a local variable of function(). Well, that's basically the design of the current PyArg_ParseTuple() function family :-)

    PyArg_ParseTuple() does its stuff in private and uses more stack memory, but once PyArg_ParseTuple() returns, the memory on the stack is "released" just because we exited the function.

    @serhiy-storchaka
    Copy link
    Member Author

    In addition, this change can allow us to get rid of large and complex functions _PyArg_ParseTupleAndKeywordsFast() and _PyArg_ParseStackAndKeywords(). The former is no longer used in CPython, and the latter is still used in few places to support some deprecated formatting codes for which I intentionally not implemented inlining. After getting rid of uses of such codes (the patch in progress) we could remove both these functions.

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 3191391 by Serhiy Storchaka in branch 'master':
    bpo-36127: Argument Clinic: inline parsing code for keyword parameters. (GH-12058)
    3191391

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 1b0393d by Serhiy Storchaka in branch 'master':
    bpo-36127: Fix compiler warning in _PyArg_UnpackKeywords(). (GH-12353)
    1b0393d

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes performance Performance or resource usage topic-argument-clinic
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants