Issue 46020: Optimize long_pow for the common case

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90178

classification

Title:	Optimize long_pow for the common case
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Dennis Sweeney, kj, mark.dickinson, rhettinger, tim.peters
Priority:	normal	Keywords:	patch

Created on 2021-12-09 06:32 by rhettinger, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 30555	merged	tim.peters, 2022-01-12 02:41

Messages (8)
msg408076 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-09 06:32
The expression 'x * x' is faster than 'x 2'. In Python3.10, the speed difference was enormous. Due to ceval optimizations, the difference in Python3.11 is tighter; however, there is still room for improvement. The code for long_pow() doesn't currently have a fast path for squaring which is by far the most important case. $ python3.10 -m timeit -r 11 -s 'x = 10' 'x 2' 1000000 loops, best of 11: 201 nsec per loop $ python3.10 -m timeit -r 11 -s 'x = 10' 'x * x' 10000000 loops, best of 11: 21.9 nsec per loop $ python3.11 -m timeit -r 11 -s 'x = 10' 'x ** 2' 10000000 loops, best of 11: 32 nsec per loop $ python3.11 -m timeit -r 11 -s 'x = 10' 'x * x' 20000000 loops, best of 11: 17.6 nsec per loop
msg408079 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2021-12-09 08:01
I believe https://bugs.python.org/issue44376 added a special case for 2nd and 3rd powers, and that's the 3.10/3.11 difference in the speed of x**2, not ceval optimizations.
msg408094 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-09 09:37
Hmm, I had just looked at that code and it wasn't at all obvious that an optimization had been added. I expected something like: if (exp==2) return PyNumber_Multiply(x, x); I wonder where the extra clock cycles are going. Looking at the ceval.c dispatch and the bodies of PyNumber_Multiply(), _PyNumber_PowerNoMod(), and PyNumber_Power(), it looks like the power dispatch code should be about the same as or slightly cheaper than the multiply dispatch. I'm surprised there is still almost a two to one speed difference. ISTM there is still too much overhead for squaring.
msg408099 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-12-09 09:50
The situation for floats is also disappointing: $ python3.11 -m timeit -s 'x=1.1' 'x 2' 5000000 loops, best of 5: 60.8 nsec per loop $ python3.11 -m timeit -s 'x=1.1' 'x 2.0' 5000000 loops, best of 5: 51.5 nsec per loop $ python3.11 -m timeit -s 'x=1.1' 'x * x' 20000000 loops, best of 5: 17.7 nsec per loop Complex numbers are more balanced. Surprisingly, some of the complex cases are faster than their float counterparts: $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x 2' 5000000 loops, best of 5: 42.4 nsec per loop $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x 2.0' 5000000 loops, best of 5: 43.3 nsec per loop $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0j' 2000000 loops, best of 5: 107 nsec per loop $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x * x' 10000000 loops, best of 5: 30.6 nsec per loop Decimal still shows a large difference: $ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x 2' 1000000 loops, best of 5: 206 nsec per loop $ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x Decimal(2)' 1000000 loops, best of 5: 281 nsec per loop $ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x * x' 5000000 loops, best of 5: 60.9 nsec per loop
msg408116 - (view)	Author: Ken Jin (kj) *	Date: 2021-12-09 13:12
I'm not sure about the original 10:1 difference in 3.10, but in 3.11, the 2:1 difference might be due to the PEP 659 machinery optimizing for int * int, and float * float cases (see BINARY_OP_MULTIPLY_INT and BINARY_OP_MULTIPLY_FLOAT in ceval). Last I recall, these were slightly faster than their unspecialized counterparts as they have less overhead in C function calls. Meanwhile, none of the power operations are optimized via PEP 659 machinery at the moment.
msg409609 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-01-03 19:25
I was suprised that https://bugs.python.org/issue44376 managed to get i*2 to within a factor of 2 of ii's speed. The overheads of running long_pow() at all are high! Don't overlook that initialization of stack variables at the start, like PyLongObject z = NULL; / accumulated result / isn't free - code has to be generated to force zeroes into those variables. The initialization of `table[]` alone requires code to fill 256 memory bytes with zeroes (down to 128 on current main branch). Nothing is free. We can't sanely move the `table` initialization expense into the "giant k-ary window exponentiation" block either, because every bigint operation can fail ("out of memory"), and the macros for doing the common ones (MULT and REDUCE) can do "goto Error;", and that common exit code has no way to know what is or isn't initialized. We can't let it see uninitialized stack trash. The exit code in turn has a string of things like Py_DECREF(a); Py_DECREF(b); Py_XDECREF(c); and those cost cycles too, including tests and branches. So the real "outrage" to me is why xx took 17.6 nsec for x == 10 in the original report. That's many times longer than the HW takes to do the actual multiply. Whether it's spelled xx or x2, we're overwhelming timing overheads. `pow()` has many because it's a kind of Swiss army knife doing all sorts of things; what's `xx`'s excuse? ;-)
msg410381 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-01-12 02:47
GH_30555 helps a bit by leaving the giant-exponent table of small odd powers as uninitialized stack trash unless it's actually used.
msg410422 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-01-12 18:55
New changeset fc05e6bfce5d5dfc23859e6f7862c1e707a12e42 by Tim Peters in branch 'main': bpo-46020: Optimize long_pow for the common case (GH-30555) https://github.com/python/cpython/commit/fc05e6bfce5d5dfc23859e6f7862c1e707a12e42

History
Date	User	Action	Args
2022-04-11 14:59:53	admin	set	github: 90178
2022-01-12 18:55:05	tim.peters	set	messages: + msg410422
2022-01-12 02:47:49	tim.peters	set	messages: + msg410381
2022-01-12 02:41:39	tim.peters	set	keywords: + patch stage: patch review pull_requests: + pull_request28756
2022-01-03 19:25:09	tim.peters	set	messages: + msg409609
2022-01-02 16:38:38	mark.dickinson	set	nosy: + tim.peters
2021-12-09 13:12:12	kj	set	nosy: + kj messages: + msg408116
2021-12-09 09:50:28	rhettinger	set	messages: + msg408099
2021-12-09 09:37:00	rhettinger	set	messages: + msg408094
2021-12-09 08:38:22	mark.dickinson	set	nosy: + mark.dickinson
2021-12-09 08:01:11	Dennis Sweeney	set	nosy: + Dennis Sweeney messages: + msg408079
2021-12-09 06:32:23	rhettinger	create