classification
Title: Packing constant call arguments
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: BTaskaya Nosy List: BTaskaya, Mark.Shannon, pablogsal, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2021-06-23 18:13 by BTaskaya, last changed 2021-07-03 15:14 by BTaskaya.

Pull Requests
URL Status Linked Edit
PR 26901 closed BTaskaya, 2021-06-24 15:20
Messages (9)
msg396437 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2021-06-23 18:13
It is a common scenario to make calls with only constant arguments (e.g to datetime.datetime/os.path.join/re.match.group/nox.session.run etc) and the bytecode that we currently generate looks like this;
f(1,2,3,4,5,6)
  1           0 LOAD_NAME                0 (f)
              2 LOAD_CONST               0 (1)
              4 LOAD_CONST               1 (2)
              6 LOAD_CONST               2 (3)
              8 LOAD_CONST               3 (4)
             10 LOAD_CONST               4 (5)
             12 LOAD_CONST               5 (6)
             14 CALL_FUNCTION            6
             16 POP_TOP
             18 LOAD_CONST               6 (None)
             20 RETURN_VALUE

But if we are sure that all arguments to a function is positional* (it is also possible to support keyword arguments to some extent, needs more research, but out of the scope for this particular optimization) and constant, then we could simply pack everything together and use CALL_FUNCTION_EX (we also need to set some limits, since when it is too little might prevent constant cache, and when it is too high might create giant tuples in the code object, perhaps 75 > N > 4)

  1           0 LOAD_NAME                0 (f)
              2 LOAD_CONST               0 ((1, 2, 3, 4, 5, 6))
              4 CALL_FUNCTION_EX         0
              6 POP_TOP
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE

The implementation is also very simple, and doesn't even touch anywhere beside the ast optimizer itself. It is possible to do this in the compiler, but that might complicate the logic so I'd say it is best to keep it as isolated as it can be.

(debug builds)

-s 'foo = lambda *args: None' 'foo("yyyyy", 123, 123321321312, (1,2,3), "yyyyy", 1.0, (1,2,3), "yyyyy", "yyyyy", (1,2,3), 5, 6, 7)'
Mean +- std dev: [master_artificial] 251 ns +- 2 ns -> [optimized_artificial] 185 ns +- 1 ns: 1.36x faster

-s 'from datetime import datetime' 'datetime(1997, 7, 27, 12, 10, 0, 0)'
Mean +- std dev: [master_datetime] 461 ns +- 1 ns -> [optimized_datetime] 386 ns +- 2 ns: 1.19x faster

One other potential candidate to this optimization is doing something similar in the CFG optimizer, and folding all contiguous LOAD_CONSTs (within some sort of limit ofc) into a single tuple load and then adding an UNPACK_SEQUENCE (which would replicate the effect). This is a poorer form, and I was only able to observe a speedup of 1.13x / 1.03x respectively on the benchmarks. The good thing about that optimization was that, first it was able to work with mixed parameters (so if you have some other types of expressions besides constants, but all constants follow each other, then it was able to optimize that case as well) and also it wasn't only for calls but rather all compiler cases where LOAD_CONST blocks were generated.
msg396440 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-06-23 18:34
> folding all contiguous LOAD_CONSTs (within some sort of limit ofc) into a single tuple load and then adding an UNPACK_SEQUENCE

I already experimented with this in issue33325. It does not worth an effort.

For optimizing calls with constant arguments, it looks more interesting. Do you have any statistic about what percentage of calls is with constant arguments, what is the average number of constant arguments, what are microbenchmark results in debug mode for the average number of constant arguments? Please test for trivial Python function with variable and fixed number of parameters.
msg396444 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2021-06-23 19:40
> I already experimented with this in issue33325. It does not worth an effort.

Yea, that is what the results for the CFG optimization shows to me (1.13x at best, ~1.03x in real cases). 

> For optimizing calls with constant arguments, it looks more interesting. Do you have any statistic about what percentage of calls is with constant arguments, what is the average number of constant arguments, what are microbenchmark results in debug mode for the average number of constant arguments? Please test for trivial Python function with variable and fixed number of parameters.

Okay, addressing multiple points here (with the common dataset of a couple thousand pypi projects)

>> Do you have any statistic about what percentage of calls is with constant arguments

Since calls are the most common operation (?) in the language, the percentage might seem low since the total number is just absurd: 0.6840838575699993% (96170/14058218) of all calls are made with positional only constant arguments (LEN(args) >= 3, not including 1 or 2 arguments). 

>> what is the average number of constant arguments


> stats = {3: 69533, 4: 10867, 5: 9071, 6: 4323, 7: 1504, 8: 252, 9: 167, ...}
> np.average(list(stats.keys()), weights=list(stats.values()))
> 3.5765623375272955

points me out the result of 3.57 arguments, which is rounded to 4.

>> what are microbenchmark results in debug mode for the average number of constant arguments?

-s 'f = lambda *s: None' 'f(1,2,3,4)'
3  arguments: 1.01x faster
4  arguments: 1.05x faster
5  arguments: 1.08x faster
6  arguments: 1.11x faster
7  arguments: 1.13x faster
8  arguments: 1.16x faster
9  arguments: 1.30x faster
10 arguments: 1.32x faster

(there seems to be something special about 9, since it breaks the chart about the speedups)

An interesting (and potentially something related to the nature of datetime.datetime being implemented in C) benchmark is that instead of calling an empty python function, if we switch to datetime.datetime the early speed-up goes 1.13x for even 3 arguments!

-s 'from datetime import datetime' 'datetime(1,2,3)'

3 arguments: 1.13x faster
4 arguments: 1.15x faster
5 arguments: 1.16x faster
6 arguments: 1.17x faster
7 arguments: 1.20x faster

Another really good example is re.match().group() (I assume this is also implemented in C, so some sort of CALL_FUNCTION_EX magic)

-s 'import re; group = re.match("(a)(a)(a)(a)(a)(a)(a)(a)", "a" * 8).group' 'group(0,1,2)'

3 arguments: 1.25x faster (seems to be the general speed-up)
4 arguments: 1.26x faster
5 arguments: 1.24x faster
6 arguments: 1.23x faster
7 arguments: 1.24x faster
8 arguments: 1.24x faster
9 arguments: 1.27x faster

These all can be verified by simply running the same code but wrapping all arguments as a tuple and expanding them (generates the same code) (e.g: -s 'import re; group = re.match("(a)(a)(a)(a)(a)(a)(a)(a)", "a" * 8).group' 'group(*(0,1,2,3,4,5,6,7,8))' would perform as fast as the optimized one) 

>> Please test for trivial Python function with variable and fixed number of parameters.

I already shared the variable one, so this is the results for a fixed one;

-s 'def foo(a, b, c): pass' 'foo(1,2,3)'

3 arguments: 1.03x faster
4 arguments: 1.07x faster
5 arguments: 1.10x faster
6 arguments: 1.13x faster
7 arguments: 1.17x faster
8 arguments: 1.20x faster
9 arguments: 1.36x faster (consistent with the other jump)

Here is the list of functions that I've found to be called with only constant arguments (very raw!): https://gist.github.com/isidentical/099da256f2a46d13cef8570e839a2cea

Overall I believe it shows a great improvement for little-to-none change on the codebase (especially for C-level functions).
msg396445 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2021-06-23 19:48
Here is my branch for the reference (I didn't worked on about it much, so still lacks some error handling etc); https://github.com/isidentical/cpython/blob/b556d1172b08c65b88093f7ff1dadc985ce72f62/Python/ast_opt.c#L634-L698
msg396494 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2021-06-24 15:21
I've tested this with pyperformance, and no significant (1.02 faster on some, 1.01x slower on others, nothing significant) change on those so this is a targeted optimization.
msg396898 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-07-03 12:51
I am still not sure that it is worth to add 50 lines of the C code to optimize 0.7% of calls.

You have found 70000 examples of function calls with 3 or 4 constant arguments. Does it include tests? Could you please show several (or several hundreds) non-test examples? I am wondering how much of them are in tight loops and are not just executed once at module initialization time.
msg396904 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2021-07-03 14:03
> I am still not sure that it is worth to add 50 lines of the C code to optimize 0.7% of calls.

As I stated, the reason that this is 'relatively' lower is that there are just too many calls, so it just allocates a rather small part of the total pie. 

> Does it include tests?

Yes, to some extent (some packages include tests, some don't). 

> function calls with 3 or 4 constant arguments. Does it include tests? Could you please show several (or several hundreds) non-test examples?

Here are 2 gists (attached together) that contain the locations where such function calls are performed within a loop (the first one contains your first query, 3 and 4 arguments and the second one contains 5+). 

I added an extra guard to check whether filename starts with test_ or not, so it should be generally free of tests (at least the ones that follow the general convention). I didn't deal with doing an extreme analysis, it just checks whether there call is made under any for/while's loop so there might be some false positives. 

https://gist.github.com/isidentical/8743aca3f7815cd19ee71579ca5ba974
msg396908 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-07-03 15:00
Great!

Common examples are str.replace(), re.Match.groups() and __exit__(None, None, None). Common anti-patterns include slice(None, None, None) (why not slice(None)?) and datetime(1970, 1, 1) (it should be a global constant). I suspect that in most of other examples the function execution time is too large to make the optimization meaningful.

BTW, why do you not fold 2-argument calls? It would add 2-argument str.replace() and str.split().
msg396909 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2021-07-03 15:14
> BTW, why do you not fold 2-argument calls?

Seems like it is making stuff worse. 

> I suspect that in most of other examples the function execution time is too large to make the optimization meaningful.

This is (to some extent) a constant folding pass, so I don't think it makes sense to expect it to optimize the general workload of the function.
History
Date User Action Args
2021-07-03 15:14:12BTaskayasetmessages: + msg396909
2021-07-03 15:00:50serhiy.storchakasetmessages: + msg396908
2021-07-03 14:03:05BTaskayasetmessages: + msg396904
2021-07-03 12:51:37serhiy.storchakasetmessages: + msg396898
2021-06-24 15:21:54BTaskayasetmessages: + msg396494
2021-06-24 15:20:14BTaskayasetkeywords: + patch
stage: patch review
pull_requests: + pull_request25477
2021-06-23 19:48:46BTaskayasetmessages: + msg396445
2021-06-23 19:40:42BTaskayasetmessages: + msg396444
2021-06-23 18:34:26serhiy.storchakasetmessages: + msg396440
2021-06-23 18:13:42BTaskayacreate