classification
Title: Addition chains for pow saves 5-20% time for pow(int,int)
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.10
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: jneb, mark.dickinson, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2021-01-12 14:52 by jneb, last changed 2021-02-01 20:08 by mark.dickinson. This issue is now closed.

Files
File name Uploaded Description Edit
longobject.c jneb, 2021-01-12 14:52 Proof of concept for faster integer powers
longobject.py jneb, 2021-01-12 18:56
Pull Requests
URL Status Linked Edit
PR 24206 closed jneb, 2021-01-13 10:55
Messages (7)
msg384949 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2021-01-12 14:52
When looking at the code of pow() with integer exponent, I noticed there is a hard boundary between the binary and "fiveary" (actually 32-ary) computations. Also, the fiveary wasn't really optimal.

So I wrote a proof of concept version of long_pow that dynamically uses addition chains!
It does save over 10 % of multiplications for exponents from 20 to a few hundred bits, and then the saving go down to a few percent for very long numbers. It does not take much more memory nor time for any argument combination I checked.
I tested it on 3.8rc1, but I am planning to port it to 3.10. This is a bit difficult, since *lots* of code around it changed, and I only have Windows 7. However, I'll keep you posted.
See https://github.com/jneb/cpython/tree/38_fast_pow
msg384975 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2021-01-12 18:56
Some more information for the interested:
The algorithm I made tries to smoothly change the"chunk size" with growing length of the exponent. So the exponents that win the most (up to 14% fewer multiplication) are the long exponents that are just shorter than the FIVEARY_CUTOFF.
But, I worked hard to make an algorithm that also saves multiplications for shorter numbers. Numbers of up to about 20 bits will be using the optimal chunk size.
And, of course, the decision must be made quickly because for some frequently occurring parameters (e.g., 3**25), the routine doesn't take that long anyway.
This is obtained by checking two properties of the exponent that strongly influence the addition chain: the higher four bits, and (surprise!) the number of pairs of bits with distance 2: in other words, (n&n>>2).bit_count().
After days of trying out all kinds of heuristics, and days of crunching,I measured the optimal parameters. I added the code I used to do that.
Guido may remember that I wrote a chapter in my Ph.D. on the subject of addition chains. The interesting thing is that I then used Python for that too: that was almost 30 years ago!
When implementing, I discovered that lots of the code around it had been in flux, so I didn't manage to "cherry pick" it into 3.10 yet. (One example: the bit_length and bit_count routines were renamed and moved around). And, I don't have windows 10:-(
But anyway, have a look and let me hear what you think of it. I'll also want to test and measure it a bit more, but I am sure it is quite stable.
msg385118 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2021-01-15 16:30
Well, measurements show that it saves more time than I thought (sometimes over 20%!), because there are other ways in which it saves time; I am quite happy with that.

In the code I needed functions _Py_bit_length64 and _Py_bit_count64.
I thought these could better move to the bitutils, but I am not sure about a good name to use, since there are probably other places where these are used, too (I know of at least one in hashtable.c).
The 32 bits versions in bitutils are called _Py_bit_length and _Py_popcount32 (not the most logical names).
So then it would also be more logical to give all four of these consistent names, everywhere. But that's probably better at a later time, right?
msg385121 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-01-15 17:25
Thank you for the proposal and PR!

There are some tradeoffs to be considered here, between simplicity and performance; it's not always trivial to find the sweet spot.  Python's int implementation mostly favours the simplicity end of the spectrum. It's also focused on the sort of integer sizes that turn up in everyday problems, rather than crypto-sized (or worse, number-theoretic-sized) integers - those are more the province of libraries like cryptlib and GMP. That's why Python still has quadratic-time division and base-conversion algorithms.

For this particular case, my feeling is that the added complexity (~300 lines of C code) isn't worth the payoff.

That said, the existing integer powering implementation, like much else in longobject.c, goes back to Tim Peters. If Tim wants to champion this change and push it forward, I'm certainly not going to oppose that.
msg385423 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2021-01-21 14:19
Well, I would argue that there is already quite a work going to for crypto-sized computations in the integer code, as well as the crypto-oriented .bit_count() function that was recently added.

For starters, the arguably crypto-oriented three argument pow() was there from Python 0.1 already, where I used it :-).
There's Karatsuba multiplication, five-ary powering, and quite a few optimizations on the speed of the number conversion.

And then of course the incredible implementation of Decimal, which does include a subquadratic division. I would say this would fit there.

And maybe I'll make a subquadratic division for ints someday...

Tim, your vote please...
msg385424 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2021-01-21 14:29
...not to mention the new gcd and lcm functions, and the fact that the number conversion is linear for exponent-of-two bases, and the negative power modulo a prime number!
msg386098 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-02-01 20:08
Looks like Tim isn't taking the bait. :-)

@Jurjen: Thank you for the suggestion and PR, but I'm going to close here; for me this ends up on the wrong side of the performance / complexity trade-off.
History
Date User Action Args
2021-02-01 20:08:53mark.dickinsonsetstatus: open -> closed
resolution: rejected
messages: + msg386098

stage: patch review -> resolved
2021-01-21 14:29:57jnebsetmessages: + msg385424
2021-01-21 14:19:29jnebsetmessages: + msg385423
2021-01-15 17:25:38mark.dickinsonsetmessages: + msg385121
2021-01-15 16:32:52jnebsettitle: Addition chains for pow saves 10 % time! -> Addition chains for pow saves 5-20% time for pow(int,int)
2021-01-15 16:30:36jnebsetmessages: + msg385118
2021-01-13 10:55:50jnebsetkeywords: + patch
stage: patch review
pull_requests: + pull_request23032
2021-01-12 20:43:32serhiy.storchakasetnosy: + serhiy.storchaka
2021-01-12 18:56:30jnebsetfiles: + longobject.py

messages: + msg384975
2021-01-12 15:50:54gvanrossumsetnosy: + tim.peters, mark.dickinson
2021-01-12 14:52:20jnebcreate