Issue 44501: Packing constant call arguments

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/88667

classification

Title:	Packing constant call arguments
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.11

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	BTaskaya	Nosy List:	BTaskaya, Mark.Shannon, pablogsal, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2021-06-23 18:13 by BTaskaya, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 26901	closed	BTaskaya, 2021-06-24 15:20

Messages (9)
msg396437 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2021-06-23 18:13
It is a common scenario to make calls with only constant arguments (e.g to datetime.datetime/os.path.join/re.match.group/nox.session.run etc) and the bytecode that we currently generate looks like this; f(1,2,3,4,5,6) 1 0 LOAD_NAME 0 (f) 2 LOAD_CONST 0 (1) 4 LOAD_CONST 1 (2) 6 LOAD_CONST 2 (3) 8 LOAD_CONST 3 (4) 10 LOAD_CONST 4 (5) 12 LOAD_CONST 5 (6) 14 CALL_FUNCTION 6 16 POP_TOP 18 LOAD_CONST 6 (None) 20 RETURN_VALUE But if we are sure that all arguments to a function is positional* (it is also possible to support keyword arguments to some extent, needs more research, but out of the scope for this particular optimization) and constant, then we could simply pack everything together and use CALL_FUNCTION_EX (we also need to set some limits, since when it is too little might prevent constant cache, and when it is too high might create giant tuples in the code object, perhaps 75 > N > 4) 1 0 LOAD_NAME 0 (f) 2 LOAD_CONST 0 ((1, 2, 3, 4, 5, 6)) 4 CALL_FUNCTION_EX 0 6 POP_TOP 8 LOAD_CONST 1 (None) 10 RETURN_VALUE The implementation is also very simple, and doesn't even touch anywhere beside the ast optimizer itself. It is possible to do this in the compiler, but that might complicate the logic so I'd say it is best to keep it as isolated as it can be. (debug builds) -s 'foo = lambda *args: None' 'foo("yyyyy", 123, 123321321312, (1,2,3), "yyyyy", 1.0, (1,2,3), "yyyyy", "yyyyy", (1,2,3), 5, 6, 7)' Mean +- std dev: [master_artificial] 251 ns +- 2 ns -> [optimized_artificial] 185 ns +- 1 ns: 1.36x faster -s 'from datetime import datetime' 'datetime(1997, 7, 27, 12, 10, 0, 0)' Mean +- std dev: [master_datetime] 461 ns +- 1 ns -> [optimized_datetime] 386 ns +- 2 ns: 1.19x faster One other potential candidate to this optimization is doing something similar in the CFG optimizer, and folding all contiguous LOAD_CONSTs (within some sort of limit ofc) into a single tuple load and then adding an UNPACK_SEQUENCE (which would replicate the effect). This is a poorer form, and I was only able to observe a speedup of 1.13x / 1.03x respectively on the benchmarks. The good thing about that optimization was that, first it was able to work with mixed parameters (so if you have some other types of expressions besides constants, but all constants follow each other, then it was able to optimize that case as well) and also it wasn't only for calls but rather all compiler cases where LOAD_CONST blocks were generated.
msg396440 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-06-23 18:34
> folding all contiguous LOAD_CONSTs (within some sort of limit ofc) into a single tuple load and then adding an UNPACK_SEQUENCE I already experimented with this in issue33325. It does not worth an effort. For optimizing calls with constant arguments, it looks more interesting. Do you have any statistic about what percentage of calls is with constant arguments, what is the average number of constant arguments, what are microbenchmark results in debug mode for the average number of constant arguments? Please test for trivial Python function with variable and fixed number of parameters.
msg396444 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2021-06-23 19:40
> I already experimented with this in issue33325. It does not worth an effort. Yea, that is what the results for the CFG optimization shows to me (1.13x at best, ~1.03x in real cases). > For optimizing calls with constant arguments, it looks more interesting. Do you have any statistic about what percentage of calls is with constant arguments, what is the average number of constant arguments, what are microbenchmark results in debug mode for the average number of constant arguments? Please test for trivial Python function with variable and fixed number of parameters. Okay, addressing multiple points here (with the common dataset of a couple thousand pypi projects) >> Do you have any statistic about what percentage of calls is with constant arguments Since calls are the most common operation (?) in the language, the percentage might seem low since the total number is just absurd: 0.6840838575699993% (96170/14058218) of all calls are made with positional only constant arguments (LEN(args) >= 3, not including 1 or 2 arguments). >> what is the average number of constant arguments > stats = {3: 69533, 4: 10867, 5: 9071, 6: 4323, 7: 1504, 8: 252, 9: 167, ...} > np.average(list(stats.keys()), weights=list(stats.values())) > 3.5765623375272955 points me out the result of 3.57 arguments, which is rounded to 4. >> what are microbenchmark results in debug mode for the average number of constant arguments? -s 'f = lambda s: None' 'f(1,2,3,4)' 3 arguments: 1.01x faster 4 arguments: 1.05x faster 5 arguments: 1.08x faster 6 arguments: 1.11x faster 7 arguments: 1.13x faster 8 arguments: 1.16x faster 9 arguments: 1.30x faster 10 arguments: 1.32x faster (there seems to be something special about 9, since it breaks the chart about the speedups) An interesting (and potentially something related to the nature of datetime.datetime being implemented in C) benchmark is that instead of calling an empty python function, if we switch to datetime.datetime the early speed-up goes 1.13x for even 3 arguments! -s 'from datetime import datetime' 'datetime(1,2,3)' 3 arguments: 1.13x faster 4 arguments: 1.15x faster 5 arguments: 1.16x faster 6 arguments: 1.17x faster 7 arguments: 1.20x faster Another really good example is re.match().group() (I assume this is also implemented in C, so some sort of CALL_FUNCTION_EX magic) -s 'import re; group = re.match("(a)(a)(a)(a)(a)(a)(a)(a)", "a" 8).group' 'group(0,1,2)' 3 arguments: 1.25x faster (seems to be the general speed-up) 4 arguments: 1.26x faster 5 arguments: 1.24x faster 6 arguments: 1.23x faster 7 arguments: 1.24x faster 8 arguments: 1.24x faster 9 arguments: 1.27x faster These all can be verified by simply running the same code but wrapping all arguments as a tuple and expanding them (generates the same code) (e.g: -s 'import re; group = re.match("(a)(a)(a)(a)(a)(a)(a)(a)", "a" * 8).group' 'group(*(0,1,2,3,4,5,6,7,8))' would perform as fast as the optimized one) >> Please test for trivial Python function with variable and fixed number of parameters. I already shared the variable one, so this is the results for a fixed one; -s 'def foo(a, b, c): pass' 'foo(1,2,3)' 3 arguments: 1.03x faster 4 arguments: 1.07x faster 5 arguments: 1.10x faster 6 arguments: 1.13x faster 7 arguments: 1.17x faster 8 arguments: 1.20x faster 9 arguments: 1.36x faster (consistent with the other jump) Here is the list of functions that I've found to be called with only constant arguments (very raw!): https://gist.github.com/isidentical/099da256f2a46d13cef8570e839a2cea Overall I believe it shows a great improvement for little-to-none change on the codebase (especially for C-level functions).
msg396445 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2021-06-23 19:48
Here is my branch for the reference (I didn't worked on about it much, so still lacks some error handling etc); https://github.com/isidentical/cpython/blob/b556d1172b08c65b88093f7ff1dadc985ce72f62/Python/ast_opt.c#L634-L698
msg396494 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2021-06-24 15:21
I've tested this with pyperformance, and no significant (1.02 faster on some, 1.01x slower on others, nothing significant) change on those so this is a targeted optimization.
msg396898 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-07-03 12:51
I am still not sure that it is worth to add 50 lines of the C code to optimize 0.7% of calls. You have found 70000 examples of function calls with 3 or 4 constant arguments. Does it include tests? Could you please show several (or several hundreds) non-test examples? I am wondering how much of them are in tight loops and are not just executed once at module initialization time.
msg396904 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2021-07-03 14:03
> I am still not sure that it is worth to add 50 lines of the C code to optimize 0.7% of calls. As I stated, the reason that this is 'relatively' lower is that there are just too many calls, so it just allocates a rather small part of the total pie. > Does it include tests? Yes, to some extent (some packages include tests, some don't). > function calls with 3 or 4 constant arguments. Does it include tests? Could you please show several (or several hundreds) non-test examples? Here are 2 gists (attached together) that contain the locations where such function calls are performed within a loop (the first one contains your first query, 3 and 4 arguments and the second one contains 5+). I added an extra guard to check whether filename starts with test_ or not, so it should be generally free of tests (at least the ones that follow the general convention). I didn't deal with doing an extreme analysis, it just checks whether there call is made under any for/while's loop so there might be some false positives. https://gist.github.com/isidentical/8743aca3f7815cd19ee71579ca5ba974
msg396908 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-07-03 15:00
Great! Common examples are str.replace(), re.Match.groups() and __exit__(None, None, None). Common anti-patterns include slice(None, None, None) (why not slice(None)?) and datetime(1970, 1, 1) (it should be a global constant). I suspect that in most of other examples the function execution time is too large to make the optimization meaningful. BTW, why do you not fold 2-argument calls? It would add 2-argument str.replace() and str.split().
msg396909 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2021-07-03 15:14
> BTW, why do you not fold 2-argument calls? Seems like it is making stuff worse. > I suspect that in most of other examples the function execution time is too large to make the optimization meaningful. This is (to some extent) a constant folding pass, so I don't think it makes sense to expect it to optimize the general workload of the function.

History
Date	User	Action	Args
2022-04-11 14:59:47	admin	set	github: 88667
2022-04-06 09:28:46	BTaskaya	set	status: open -> closed resolution: rejected stage: patch review -> resolved
2021-07-03 15:14:12	BTaskaya	set	messages: + msg396909
2021-07-03 15:00:50	serhiy.storchaka	set	messages: + msg396908
2021-07-03 14:03:05	BTaskaya	set	messages: + msg396904
2021-07-03 12:51:37	serhiy.storchaka	set	messages: + msg396898
2021-06-24 15:21:54	BTaskaya	set	messages: + msg396494
2021-06-24 15:20:14	BTaskaya	set	keywords: + patch stage: patch review pull_requests: + pull_request25477
2021-06-23 19:48:46	BTaskaya	set	messages: + msg396445
2021-06-23 19:40:42	BTaskaya	set	messages: + msg396444
2021-06-23 18:34:26	serhiy.storchaka	set	messages: + msg396440
2021-06-23 18:13:42	BTaskaya	create