This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Optimize long_pow for the common case
Type: performance Stage: patch review
Components: Interpreter Core Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Dennis Sweeney, kj, mark.dickinson, rhettinger, tim.peters
Priority: normal Keywords: patch

Created on 2021-12-09 06:32 by rhettinger, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 30555 merged tim.peters, 2022-01-12 02:41
Messages (8)
msg408076 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-09 06:32
The expression 'x * x' is faster than 'x ** 2'.

In Python3.10, the speed difference was enormous.  Due to ceval optimizations, the difference in Python3.11 is tighter; however, there is still room for improvement.

The code for long_pow() doesn't currently have a fast path for squaring which is by far the most important case.

$ python3.10 -m timeit -r 11 -s 'x = 10' 'x ** 2'
1000000 loops, best of 11: 201 nsec per loop
$ python3.10 -m timeit -r 11 -s 'x = 10' 'x * x'
10000000 loops, best of 11: 21.9 nsec per loop

$ python3.11 -m timeit -r 11 -s 'x = 10' 'x ** 2'
10000000 loops, best of 11: 32 nsec per loop
$ python3.11 -m timeit -r 11 -s 'x = 10' 'x * x'
20000000 loops, best of 11: 17.6 nsec per loop
msg408079 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-12-09 08:01
I believe https://bugs.python.org/issue44376 added a special case for 2nd and 3rd powers, and that's the 3.10/3.11 difference in the speed of x**2, not ceval optimizations.
msg408094 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-09 09:37
Hmm, I had just looked at that code and it wasn't at all obvious that an optimization had been added.  I expected something like:

   if (exp==2) return PyNumber_Multiply(x, x);

I wonder where the extra clock cycles are going.  Looking at the ceval.c dispatch and the bodies of PyNumber_Multiply(), _PyNumber_PowerNoMod(), and PyNumber_Power(), it looks like the power dispatch code should be about the same as or slightly cheaper than the multiply dispatch.

I'm surprised there is still almost a two to one speed difference.   ISTM there is still too much overhead for squaring.
msg408099 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-12-09 09:50
The situation for floats is also disappointing:

$ python3.11 -m timeit -s 'x=1.1' 'x ** 2'
5000000 loops, best of 5: 60.8 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x ** 2.0'
5000000 loops, best of 5: 51.5 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x * x'
20000000 loops, best of 5: 17.7 nsec per loop

Complex numbers are more balanced.  Surprisingly, some of the complex cases are faster than their float counterparts:

$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2'
5000000 loops, best of 5: 42.4 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0'
5000000 loops, best of 5: 43.3 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0j'
2000000 loops, best of 5: 107 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x * x'
10000000 loops, best of 5: 30.6 nsec per loop

Decimal still shows a large difference:

$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** 2'
1000000 loops, best of 5: 206 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** Decimal(2)'
1000000 loops, best of 5: 281 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x * x'
5000000 loops, best of 5: 60.9 nsec per loop
msg408116 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-12-09 13:12
I'm not sure about the original 10:1 difference in 3.10, but in 3.11, the 2:1 difference might be due to the PEP 659 machinery optimizing for int * int, and float * float cases (see BINARY_OP_MULTIPLY_INT and BINARY_OP_MULTIPLY_FLOAT in ceval).

Last I recall, these were slightly faster than their unspecialized counterparts as they have less overhead in C function calls. Meanwhile, none of the power operations are optimized via PEP 659 machinery at the moment.
msg409609 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-01-03 19:25
I was suprised that

https://bugs.python.org/issue44376 

managed to get i**2 to within a factor of 2 of i*i's speed. The overheads of running long_pow() at all are high! Don't overlook that initialization of stack variables at the start, like

    PyLongObject *z = NULL;  /* accumulated result */

isn't free - code has to be generated to force zeroes into those variables. The initialization of `table[]` alone requires code to fill 256 memory bytes with zeroes (down to 128 on current main branch). Nothing is free.

We can't sanely move the `table` initialization expense into the "giant k-ary window exponentiation" block either, because every bigint operation can fail ("out of memory"), and the macros for doing the common ones (MULT and REDUCE) can do "goto Error;", and that common exit code has no way to know what is or isn't initialized. We can't let it see uninitialized stack trash.

The exit code in turn has a string of things like

    Py_DECREF(a);
    Py_DECREF(b);
    Py_XDECREF(c);

and those cost cycles too, including tests and branches.

So the real "outrage" to me is why x*x took 17.6 nsec for x == 10 in the original report. That's many times longer than the HW takes to do the actual multiply. Whether it's spelled x*x or x**2, we're overwhelming timing overheads. `pow()` has many because it's a kind of Swiss army knife doing all sorts of things; what's `x*x`'s excuse? ;-)
msg410381 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-01-12 02:47
GH_30555 helps a bit by leaving the giant-exponent table of small odd powers as uninitialized stack trash unless it's actually used.
msg410422 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-01-12 18:55
New changeset fc05e6bfce5d5dfc23859e6f7862c1e707a12e42 by Tim Peters in branch 'main':
bpo-46020: Optimize long_pow for the common case (GH-30555)
https://github.com/python/cpython/commit/fc05e6bfce5d5dfc23859e6f7862c1e707a12e42
History
Date User Action Args
2022-04-11 14:59:53adminsetgithub: 90178
2022-01-12 18:55:05tim.peterssetmessages: + msg410422
2022-01-12 02:47:49tim.peterssetmessages: + msg410381
2022-01-12 02:41:39tim.peterssetkeywords: + patch
stage: patch review
pull_requests: + pull_request28756
2022-01-03 19:25:09tim.peterssetmessages: + msg409609
2022-01-02 16:38:38mark.dickinsonsetnosy: + tim.peters
2021-12-09 13:12:12kjsetnosy: + kj
messages: + msg408116
2021-12-09 09:50:28rhettingersetmessages: + msg408099
2021-12-09 09:37:00rhettingersetmessages: + msg408094
2021-12-09 08:38:22mark.dickinsonsetnosy: + mark.dickinson
2021-12-09 08:01:11Dennis Sweeneysetnosy: + Dennis Sweeney
messages: + msg408079
2021-12-09 06:32:23rhettingercreate