classification
Title: Argument Clinic: inline parsing code for functions with keyword parameters
Type: performance Stage: resolved
Components: Argument Clinic Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ammar2, josh.r, larry, serhiy.storchaka, vstinner, xtreak
Priority: normal Keywords: patch

Created on 2019-02-26 17:30 by serhiy.storchaka, last changed 2019-08-31 20:28 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12058 merged serhiy.storchaka, 2019-02-26 17:32
PR 12345 closed vstinner, 2019-03-15 14:03
PR 12353 merged serhiy.storchaka, 2019-03-15 16:00
Messages (11)
msg336700 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-26 17:30
This is a follow up of issue23867 and issue35582. The proposed PR makes Argument Clinic inlining parsing code for functions with keyword parameters, i.e. functions that use _PyArg_ParseTupleAndKeywordsFast() and _PyArg_ParseStackAndKeywords() now. This saves time for parsing format strings and calling few levels of functions.
msg336701 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-26 17:34
Some examples:

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "round_ = round" "round_(4.2)"
Mean +- std dev: [...] 110 ns +- 3 ns -> [...] 81.4 ns +- 2.2 ns: 1.35x faster (-26%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "sum_ = sum" "sum_(())"
Mean +- std dev: [...] 88.0 ns +- 6.5 ns -> [...] 57.6 ns +- 1.1 ns: 1.53x faster (-35%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "sum_ = sum; a = [1, 2]"  "sum_(a)"
Mean +- std dev: [...] 95.9 ns +- 2.1 ns -> [...] 70.6 ns +- 2.0 ns: 1.36x faster (-26%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "'abc'.split()"          
Mean +- std dev: [...] 102 ns +- 3 ns -> [...] 80.5 ns +- 2.1 ns: 1.26x faster (-21%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "b'abc'.split()"
Mean +- std dev: [...] 91.8 ns +- 2.3 ns -> [...] 75.1 ns +- 1.4 ns: 1.22x faster (-18%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "'abc'.split('-')"
Mean +- std dev: [...] 118 ns +- 2 ns -> [...] 89.2 ns +- 1.8 ns: 1.32x faster (-24%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "b'abc'.decode()"
Mean +- std dev: [...] 96.1 ns +- 3.6 ns -> [...] 78.9 ns +- 2.2 ns: 1.22x faster (-18%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "'abc'.encode()"
Mean +- std dev: [...] 72.4 ns +- 1.9 ns -> [...] 55.1 ns +- 1.8 ns: 1.31x faster (-24%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "int_= int"  "int(4.2)"
Mean +- std dev: [...] 105 ns +- 4 ns -> [...] 78.8 ns +- 1.9 ns: 1.33x faster (-25%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 -s "int_= int"  "int('5')"
Mean +- std dev: [...] 154 ns +- 5 ns -> [...] 122 ns +- 4 ns: 1.26x faster (-21%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000  "42 .to_bytes(2, 'little')
Mean +- std dev: [...] 109 ns +- 3 ns -> [...] 72.4 ns +- 1.9 ns: 1.51x faster (-34%)

$ ./python -m perf timeit --compare-to=../cpython-release-baseline/python --duplicate=1000 "from_bytes = int.from_bytes" "from_bytes(b'ab', 'little')"
Mean +- std dev: [...] 138 ns +- 3 ns -> [...] 96.3 ns +- 3.0 ns: 1.43x faster (-30%)
msg336703 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2019-02-26 17:44
That's a lot faster and will be great if they make it to next alpha :) Adding Ammar Askar since they did review for positional arguments.
msg336706 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-02-26 19:12
How much bigger does the core interpreter + built-in extension modules get when you make this change? How much more memory is used by real world programs?

I'm a little concerned when I see individual functions growing by 140 lines in the first file of the diff. Clearly the increase in memory usage wasn't enough to slow down the microbenchmarks (which can probably fit the entire hot code path in CPU on die cache), but I'd be worried about the aggregate effect on real world programs if we go from a handful of argument parsing code paths reused by every function (and therefore kept constantly hot) to hundreds (or thousands?) of unique argument parsing code paths, each one unique to a single function. It could easily look great when a single function is being called repeatedly, while looking much less great (possibly worse) when the many varied function calls are interleaved.

It should be tested on a number of systems too; any losses to cache unfriendliness would be highly dependent on the size of the CPU cache.

I'll grant, it doesn't seem like inlining positional argument parsing has caused problems, and it looks like we're still using _PyArg_UnpackKeywords, so argument parsing isn't completely devolved to each functions, but I think we need something more than microbenchmarks before we jump on this.

If my worries end up being unfounded, I'll be happy. Looks very cool if we can really get that sort of speed up for all function calls, not just positional args only functions. :-)
msg336719 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-26 22:41
> How much bigger does the core interpreter + built-in extension modules get when you make this change? How much more memory is used by real world programs?

Well, any optimization is a matter of trade-off between memory and CPU. Last years, CPU are not really getting way faster (especially when you consider that Python is usually only able to use a single CPU thread), whereas computers are getting more and more RAM.


> It should be tested on a number of systems too; any losses to cache unfriendliness would be highly dependent on the size of the CPU cache.

I prefer to no pay too much attention to assumptions on the hardware. I prefer to only trust benchmarks :-)
msg336748 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-27 12:46
Good questions Josh!

The size of the python binary has been increased from 17494944 bytes (17085 KiB) to 17657112 bytes (17243 KiB) -- by 1%.

I think that this change can not increase memory consumption, because the new code does not use the heap (the old code can allocate additional memory dynamically).

As for using the stack memory, it is not so clear. On one side, the new code allocates an array on the stack for references to all parameters, and this memory is left in use until you return from the function. On other side, the old code allocates a lot of variables and static-sized buffers, and creates several function frames, but this memory is released after the end of arguments parsing. I think that for non-recursive functions the new code has smaller stack memory consumption (while the function has less than several tens of parameters), but for recursive functions it can increase stack memory consumption. Although I do not know whether any of affected functions is recursive.
msg336752 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-27 13:00
As for depending the optimization on the size of CPU cache, I have repeated mickrobenchmarks on the computer with 6 MiB cache and two computers with 512 KiB caches (64- and 32-bit).

Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (cache size: 6144 KB):

+---------------------------------+----------+------------------------------+
| Benchmark                       | baseline | inline                       |
+=================================+==========+==============================+
| round_(4.2)                     | 113 ns   | 81.3 ns: 1.39x faster (-28%) |
+---------------------------------+----------+------------------------------+
| sum_(())                        | 83.8 ns  | 56.7 ns: 1.48x faster (-32%) |
+---------------------------------+----------+------------------------------+
| sum_(a)                         | 98.0 ns  | 72.1 ns: 1.36x faster (-26%) |
+---------------------------------+----------+------------------------------+
| 'abc'.split()                   | 107 ns   | 83.1 ns: 1.29x faster (-22%) |
+---------------------------------+----------+------------------------------+
| b'abc'.split()                  | 101 ns   | 75.4 ns: 1.34x faster (-25%) |
+---------------------------------+----------+------------------------------+
| 'abc'.split('-')                | 123 ns   | 89.9 ns: 1.37x faster (-27%) |
+---------------------------------+----------+------------------------------+
| 'abc'.encode()                  | 79.6 ns  | 59.2 ns: 1.34x faster (-26%) |
+---------------------------------+----------+------------------------------+
| b'abc'.decode()                 | 105 ns   | 84.7 ns: 1.24x faster (-20%) |
+---------------------------------+----------+------------------------------+
| int_(4.2)                       | 88.9 ns  | 64.1 ns: 1.39x faster (-28%) |
+---------------------------------+----------+------------------------------+
| int_('5')                       | 137 ns   | 108 ns: 1.28x faster (-22%)  |
+---------------------------------+----------+------------------------------+
| 42 .to_bytes(2, 'little')       | 113 ns   | 77.6 ns: 1.45x faster (-31%) |
+---------------------------------+----------+------------------------------+
| int_from_bytes(b'ab', 'little') | 83.4 ns  | 51.5 ns: 1.62x faster (-38%) |
+---------------------------------+----------+------------------------------+
| struct_i32_unpack_from(b'abcd') | 96.0 ns  | 71.6 ns: 1.34x faster (-25%) |
+---------------------------------+----------+------------------------------+
| re_word_match('a')              | 221 ns   | 180 ns: 1.22x faster (-18%)  |
+---------------------------------+----------+------------------------------+
| datetime_now()                  | 282 ns   | 248 ns: 1.14x faster (-12%)  |
+---------------------------------+----------+------------------------------+

Not significant (1): zlib_compress(b'abc')

AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (cache size: 512 KB):

+---------------------------------+----------+-----------------------------+
| Benchmark                       | baseline | inline                      |
+=================================+==========+=============================+
| round_(4.2)                     | 391 ns   | 272 ns: 1.44x faster (-31%) |
+---------------------------------+----------+-----------------------------+
| sum_(())                        | 212 ns   | 160 ns: 1.32x faster (-24%) |
+---------------------------------+----------+-----------------------------+
| sum_(a)                         | 256 ns   | 211 ns: 1.21x faster (-18%) |
+---------------------------------+----------+-----------------------------+
| 'abc'.split()                   | 290 ns   | 233 ns: 1.25x faster (-20%) |
+---------------------------------+----------+-----------------------------+
| b'abc'.split()                  | 263 ns   | 226 ns: 1.16x faster (-14%) |
+---------------------------------+----------+-----------------------------+
| 'abc'.split('-')                | 316 ns   | 262 ns: 1.21x faster (-17%) |
+---------------------------------+----------+-----------------------------+
| 'abc'.encode()                  | 197 ns   | 154 ns: 1.28x faster (-22%) |
+---------------------------------+----------+-----------------------------+
| b'abc'.decode()                 | 303 ns   | 250 ns: 1.21x faster (-18%) |
+---------------------------------+----------+-----------------------------+
| int_(4.2)                       | 234 ns   | 171 ns: 1.37x faster (-27%) |
+---------------------------------+----------+-----------------------------+
| int_('5')                       | 372 ns   | 310 ns: 1.20x faster (-17%) |
+---------------------------------+----------+-----------------------------+
| 42 .to_bytes(2, 'little')       | 370 ns   | 245 ns: 1.51x faster (-34%) |
+---------------------------------+----------+-----------------------------+
| int_from_bytes(b'ab', 'little') | 251 ns   | 167 ns: 1.50x faster (-33%) |
+---------------------------------+----------+-----------------------------+
| struct_i32_unpack_from(b'abcd') | 252 ns   | 202 ns: 1.24x faster (-20%) |
+---------------------------------+----------+-----------------------------+
| re_word_match('a')              | 625 ns   | 524 ns: 1.19x faster (-16%) |
+---------------------------------+----------+-----------------------------+
| datetime_now()                  | 2.05 us  | 1.99 us: 1.03x faster (-3%) |
+---------------------------------+----------+-----------------------------+
| zlib_compress(b'abc')           | 28.6 us  | 28.0 us: 1.02x faster (-2%) |
+---------------------------------+----------+-----------------------------+

Intel(R) Atom(TM) CPU N570   @ 1.66GHz (cache size: 512 KB), 32-bit:

+---------------------------------+----------+------------------------------+
| Benchmark                       | baseline | inline                       |
+=================================+==========+==============================+
| round_(4.2)                     | 1.95 us  | 1.29 us: 1.51x faster (-34%) |
+---------------------------------+----------+------------------------------+
| sum_(())                        | 1.15 us  | 821 ns: 1.40x faster (-29%)  |
+---------------------------------+----------+------------------------------+
| sum_(a)                         | 1.32 us  | 1.02 us: 1.30x faster (-23%) |
+---------------------------------+----------+------------------------------+
| 'abc'.split()                   | 1.32 us  | 1.11 us: 1.19x faster (-16%) |
+---------------------------------+----------+------------------------------+
| b'abc'.split()                  | 1.22 us  | 1.03 us: 1.18x faster (-15%) |
+---------------------------------+----------+------------------------------+
| 'abc'.split('-')                | 1.78 us  | 1.15 us: 1.54x faster (-35%) |
+---------------------------------+----------+------------------------------+
| 'abc'.encode()                  | 1.05 us  | 883 ns: 1.19x faster (-16%)  |
+---------------------------------+----------+------------------------------+
| b'abc'.decode()                 | 1.34 us  | 1.17 us: 1.15x faster (-13%) |
+---------------------------------+----------+------------------------------+
| int_(4.2)                       | 1.23 us  | 859 ns: 1.43x faster (-30%)  |
+---------------------------------+----------+------------------------------+
| int_('5')                       | 2.20 us  | 1.41 us: 1.56x faster (-36%) |
+---------------------------------+----------+------------------------------+
| 42 .to_bytes(2, 'little')       | 1.45 us  | 1.09 us: 1.33x faster (-25%) |
+---------------------------------+----------+------------------------------+
| int_from_bytes(b'ab', 'little') | 1.07 us  | 737 ns: 1.45x faster (-31%)  |
+---------------------------------+----------+------------------------------+
| struct_i32_unpack_from(b'abcd') | 1.31 us  | 1.08 us: 1.21x faster (-18%) |
+---------------------------------+----------+------------------------------+
| re_word_match('a')              | 2.85 us  | 2.06 us: 1.39x faster (-28%) |
+---------------------------------+----------+------------------------------+
| datetime_now()                  | 6.20 us  | 5.92 us: 1.05x faster (-4%)  |
+---------------------------------+----------+------------------------------+
| zlib_compress(b'abc')           | 28.7 us  | 26.9 us: 1.07x faster (-6%)  |
+---------------------------------+----------+------------------------------+

The speed up is significant on all computers.
msg336754 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-27 13:13
About the stack memory usage, in the past, I used a subfunction tagged with _Py_NO_INLINE to work on temporary stack but then drop it:

void function()
{
   subfunction(); /* use temporary stack */
   /* don't waste stack memory */
   ...
}

I'm not sure if such pattern could be used here for things like "     PyObject *argsbuf[12];".

The problem is that argument parsing uses a lot of local variables allocated on the stack. In practice, it's more like:

void function(args)
{
   int x;
   parse_args(args, &x); /* use temporary stack */
   /* don't waste stack memory */
   ...
}

I expect a long list of "&arg" where arg is a local variable of function(). Well, that's basically the design of the current PyArg_ParseTuple() function family :-)

PyArg_ParseTuple() does its stuff in private and uses more stack memory, but once PyArg_ParseTuple() returns, the memory on the stack is "released" just because we exited the function.
msg336756 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-27 13:20
In addition, this change can allow us to get rid of large and complex functions _PyArg_ParseTupleAndKeywordsFast() and _PyArg_ParseStackAndKeywords(). The former is no longer used in CPython, and the latter is still used in few places to support some deprecated formatting codes for which I intentionally not implemented inlining. After getting rid of uses of such codes (the patch in progress) we could remove both these functions.
msg337903 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-14 08:32
New changeset 3191391515824fa7f3c573d807f1034c6a28fab3 by Serhiy Storchaka in branch 'master':
bpo-36127: Argument Clinic: inline parsing code for keyword parameters. (GH-12058)
https://github.com/python/cpython/commit/3191391515824fa7f3c573d807f1034c6a28fab3
msg338090 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-16 17:45
New changeset 1b0393d5b7842dcd9e933117d2d5404d15e2ad00 by Serhiy Storchaka in branch 'master':
bpo-36127: Fix compiler warning in _PyArg_UnpackKeywords(). (GH-12353)
https://github.com/python/cpython/commit/1b0393d5b7842dcd9e933117d2d5404d15e2ad00
History
Date User Action Args
2019-08-31 20:28:24serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-03-16 17:45:06serhiy.storchakasetmessages: + msg338090
2019-03-15 16:00:57serhiy.storchakasetpull_requests: + pull_request12318
2019-03-15 14:03:41vstinnersetpull_requests: + pull_request12311
2019-03-14 08:32:25serhiy.storchakasetmessages: + msg337903
2019-02-27 13:20:55serhiy.storchakasetmessages: + msg336756
2019-02-27 13:13:29vstinnersetmessages: + msg336754
2019-02-27 13:00:34serhiy.storchakasetmessages: + msg336752
2019-02-27 12:46:41serhiy.storchakasetmessages: + msg336748
2019-02-26 22:41:32vstinnersetmessages: + msg336719
2019-02-26 19:12:27josh.rsetmessages: + msg336706
2019-02-26 18:53:22josh.rsetnosy: + josh.r
2019-02-26 17:44:32xtreaksetnosy: + xtreak, ammar2
messages: + msg336703
2019-02-26 17:34:00serhiy.storchakasetmessages: + msg336701
2019-02-26 17:32:46serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request12083
2019-02-26 17:30:36serhiy.storchakacreate