classification
Title: fast modular exponentiation
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: christian.heimes, gregory.p.smith, mark.dickinson, tim.peters, trevp
Priority: normal Keywords: patch

Created on 2004-04-17 08:16 by trevp, last changed 2009-12-31 19:39 by mark.dickinson. This issue is now closed.

Files
File name Uploaded Description Edit
long_pow.diff nobody, 2004-07-19 10:52
longobject_diffs.tar.gz trevp, 2004-07-22 08:39
long_mont.diff trevp, 2004-09-13 08:20
long_mont2.diff trevp, 2004-10-04 05:43
long_mont3.diff trevp, 2004-10-04 07:48
long_mont4.diff trevp, 2004-10-05 07:25
long_pow_2005_9_28.diff trevp, 2005-09-29 06:29 update to CVS head
long_pow_2009_12_30.diff mark.dickinson, 2009-12-30 19:02
long_pow_2009_12_31.diff mark.dickinson, 2009-12-31 15:20
time_powmod.py mark.dickinson, 2009-12-31 16:32
Messages (27)
msg45791 - (view) Author: Trevor Perrin (trevp) Date: 2004-04-17 08:16
For crypto-sized numbers, Python mod-exp is several
times slower than GMP or OpenSSL (6x or more).  Those
libraries do crazy special-case stuff, + assembly,
platform-specific tuning, and so on.

However, there's some low-hanging fruit: this patch has
a few basic optimizations giving a 2-3x speedup for
numbers in the 1000-8000 bit range (that's what I've
mostly tested; but the patch should improve, or at
least not hurt, everything else):

 - x_mul() is special-cased for squaring, which is
almost twice as fast as general multiplication.
 
 - x_mul() uses pointers instead of indices for
iteration, giving ~10% speedup (under VC6).
 
 - the right-to-left square-and-multiply exponentiation
algorithm is replaced with a left-to-right
square-and-multiply, which takes advantage of small bases.
 
 - when the exponent is above a certain size, "k-ary"
exponentiation is used to reduce the number of
multiplications via precalculation.
 
 - when the modulus is odd, Montgomery reduction is used.

 - the Karatsuba cutoff seems too low.  For
multiplicands in the range of 500-5000 bits, Karatsuba
slows multiplication by around ~25% (VC6sp4, Intel P4M
1.7 Ghz).  For larger numbers, the benefits of
Karatsuba are less than they could be.
 
 Currently, the cutoff is 35 digits (525 bits).  I've
tried 70, 140, 280, and 560.  70, 140, and 280 are
roughly the same: they don't slow down small values,
and they have good speedup on large ones.  560 is not
quite as good for large values, but at least it doesn't
hurt small ones.
 
I know this is platform-dependent, but I think we
should err on the side of making the cutoff too high
and losing some optimization, instead of putting it too
low and slowing things down.  I suggest 70.
 

A couple misc. things:

 - Negative exponents with a modulus no longer give an
error, when the base is coprime with the modulus. 
Instead, it calculates the multiplicative inverse of
the base with respect to the modulus, using the
extended euclidean algorithm, and exponentiates that.
 
 Libraries like GMP and LibTomMath work the same way. 
Being able to take inverses mod a number is useful for
cryptography (e.g. RSA, DSA, and Elgamal).
 
 - The diff includes patch 923643, which supports
converting longs to byte-strings.  Ignore the last few
diff entries, if you don't want this.
 
 - I haven't looked into harmonizing with int_pow(). 
Something may have to be done.
msg45792 - (view) Author: Trevor Perrin (trevp) Date: 2004-07-13 08:04
Logged In: YES 
user_id=973611

Uploading 2nd version of longobject.diff - the only change
is that patch 923643 is removed from this diff.  That was a
diff for converting longs to byte-strings, which I
unnecessarily left in.
msg45793 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-07-17 03:06
Logged In: YES 
user_id=31435

Notes after a brief eyeball scan:

Note that the expression

a & 1 == 1

groups as

a & (1 == 1)

in C -- comparisons have higher precedence in C than bit-
fiddling operators.  Stuff like that is usually best resolved by 
explicitly parenthesizing any "impure" expression fiddling with 
bits.  In this case, in a boolean expression plain

a & 1

has the hoped-for effect. and is clearer anyway.

Would be better to use "**" than "^" in comments when 
exponentiation is intended, since "^" means xor in both 
Python and C.

Doc changes are needed, because you're changing visible 
semantics in some cases.

Tests are needed, especially for new semantics.

l_invmod can return NULL for more than one reason, but one 
of its callers ignores this, assuming that all NULL returns are 
due to lack of coprimality.  It's unreasonable to, e.g., replace 
a MemoryError with a complaint about coprimality; this needs 
reworking.  l_invmod should probably set an exception in 
the "not coprime" case.  As is, it's a weird function, 
sometimes setting an exception when it returns NULL, but not 
setting one when coprimality doesn't obtain.  That makes life 
difficult for callers (which isn't apparent in the patch, 
because its callers are currently ignoring this issue).

The Montgomery reduction gimmicks grossly complicate this 
patch -- they're long-winded and hard to follow.  That may 
come with the territory, but it's the only part of the patch 
that made me want to vomit <wink>.  I'd be happier if it 
weren't there, for aesthetic, clarity, and maintainability 
reasons.   How much of a speedup does it actually buy?

You're right that int pow must deliver the same results as 
long pow, so code is needed for that too.  "short int" 
versus "unbounded int" is increasingly meant to be an invisible 
internal implementation detail in Python.  I'm also in favor of 
giving this meaning to modular negative exponents, btw, so 
no problem with that.  An easy way would be to change int 
pow to delegate to long pow when this is needed.

Pragmatics:  there's a better chance of making 2.4 if the 
patch were done in bite-size stages.  For example, no doc 
changes are needed to switch to 5-ary left-to-right 
exponentation, and that has no effect on the int 
implementation either, etc.  A patch that did just that much 
probably would have gone in a long time ago.
msg45794 - (view) Author: Trevor Perrin (trevp) Date: 2004-07-19 11:00
Logged In: YES 
user_id=973611


Tim, thanks for the feedback.  I'm uploading a new patch
against CVS latest that fixes those issues, and adds docs
and tests.  Also, I cleaned up the code quite a bit, and got
it properly handling (I hope) all the varied combinations of
ints/longs, positives/negatives/zeros etc..

Unfortunately, Montgomery is the bulk of the speedup:
http://trevp.net/long_pow/

But I could split out the negative exponent handling into a
separate patch, if that would be easier.

Anyways, I'd like to add more tests for the exponentiation
stuff.  Aside from that, I think the patch is complete.  And
more robust than previously, though I still wouldn't trust
it until another person or two gives it a serious
looking-over....
msg45795 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-07-21 19:29
Logged In: YES 
user_id=31435

Pragmatics are a real problem here, Trevor.  I don't foresee 
being able to make a solid block of sufficient hours to give to 
reviewing this before Python 2.4 is history (which is why I've 
left this patch unassigned, BTW -- I just can't promise to 
make enough time).  So if nobody else can volunteer to 
review it, that alone is likely to leave the patch sitting here 
unapplied.

But there are several independent changes in this patch, and 
it *could* be broken into several smaller patches.  I tossed 
that bait out before, but you didn't bite.  You should <wink>.
msg45796 - (view) Author: Trevor Perrin (trevp) Date: 2004-07-22 08:39
Logged In: YES 
user_id=973611

Pragmatics isn't my strong suit... but I get your drift :-).
 I split it into 3 diffs:
 1) x_mul optimizations: (pointers instead of indices,
special-case squaring, changing Karatsuba cutoff)
 2) rewriting long_pow() for left-to-right 5-ary
 3) Montgomery reduction.  This also includes l_invmod(),
since it's necessary for Montgomery.

I've left out the code which exposes l_invmod() to the user
(and associated docs, tests, and intobject changes).  We
could slap that on afterwards or not...

Anyways, these are applied sequentially:
longobject.c + longobject1.diff = longobject1.c
longobject1.c + longobject2.diff = longobject2.c
longobject2.c + longobject2.diff = longobject3.c

Should I open new tracker items for them?
msg45797 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-08-29 22:21
Logged In: YES 
user_id=31435

Checked in the first part of the patch, with major format 
changes (Python's C coding standard is hard 8-column tabs), 
and minor code changes:

Include/longintrepr.h 2.15
Misc/ACKS 1.280
Misc/NEWS 1.1119
Objects/longobject.c 1.162

I don't know whether it's possible for me to get to part 2 of 
the patch before 2.4a3, but I would like to.  It seems plainly 
impossible that I'll be able to get to part 3 before 2.4a3.
msg45798 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-08-30 02:47
Logged In: YES 
user_id=31435

Same deal with the 2nd part of the patch (major format 
changes, minor code changes).  Incidentally fixed an old leak 
bug in long_pow() during the review.  Added code to raise a 
compile-time error (C) if SHIFT isn't divisible by 5, and 
removed long_pow's new hardcoded assumption that SHIFT is 
exactly 15.

Include/longintrepr.h 2.16
Misc/NEWS 1.1120
Objects/longobject.c 1.163

This is cool stuff (& thank you!), but I'm sorry to say I can't 
foresee making time for the 3rd part of the patch for weeks.
msg45799 - (view) Author: Trevor Perrin (trevp) Date: 2004-09-13 08:20
Logged In: YES 
user_id=973611

Here's the 3rd part of the patch (long_mont.diff; Montgomery
Reduction), diff'd against 2.4a3 and cleaned up a bit.

Note that this doesn't include negative exponent handling. 
If this patch is accepted, I'll make a new tracker item for
that, since it's not an optimization, just an "opportunistic
feature" (it builds on one of the helper functions needed
for Montgomery).
msg45800 - (view) Author: Trevor Perrin (trevp) Date: 2004-10-04 05:43
Logged In: YES 
user_id=973611

I did more code review, testing, and timing.  The only
change in this new patch (long_mont2.diff) is a couple
"int"s were changed to "digits"s, and it's against CVS head.

As far as testing, I used the random module and GMPY to
check it on ~3 million random input values.  That's about an
hour of testing.  I'll leave the tests running for a few
days and see if anything crops up.

As far as timing, I updated the benchmarks with a new
machine (OpenBSD):
http://trevp.net/long_pow/
On 3 different machines, Montgomery gives a speedup of 2x,
3x, and 4x.  That dwarfs what we've done so far, so I'm
crossing my fingers for 2.4.  Let me know if I can explain
or improve the code, or anything..  

(The below crypto library comes with a "book" which has an 
explanation of Montgomery I found helpful):
http://math.libtomcrypt.org/download.html
msg45801 - (view) Author: Trevor Perrin (trevp) Date: 2004-10-04 07:48
Logged In: YES 
user_id=973611


oops.  Good thing for random testing, carry propagation was
buggy.  Submitting long_mont3.diff.
msg45802 - (view) Author: Trevor Perrin (trevp) Date: 2004-10-05 07:25
Logged In: YES 
user_id=973611

Montgomery has a fixed cost, so it slows down small
exponents. For example modular squaring is slowed ~5x.  I
added a MONTGOMERY_CUTOFF to take care of this.  Submitting
long_mont4.diff.
msg45803 - (view) Author: Trevor Perrin (trevp) Date: 2005-09-29 06:29
Logged In: YES 
user_id=973611

I updated this patch to CVS head, but didn't change it
otherwise.  It's still a bit hairy.  However, it's also
still a big speedup (see benchmarks from 2004-10-03).

If I can do anything to help this make it in 2.5, let me know.
msg59785 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-12 04:31
Re-targeting for 2.6
We should discuss it at the bug day.
msg59825 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-12 17:05
Mark, as the second math guru in our team, you are probably interested
in these patches. Trevor has written an interesting patch to optimize
longs for cryptographic problems. In fact it might be two patches now,
one for the Montgomery Reduction and one containing other optimizations.
The 2005 patch applies almost cleanly.
msg82063 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-14 14:18
I'll see if I can find time to look at this;  I'm currently looking at 
various ways to improve long integer arithmetic in 2.7/3.1.
msg82498 - (view) Author: Trevor Perrin (trevp) Date: 2009-02-19 20:39
Hi Mark,

Let me know if I can give you any help with this.  The original patch
was split into 3 parts.  The only part remaining unapplied is the
Montgomery Reduction.  

It appeared to be a significant speedup when I was last testing, and is
frequently used in other modular exponentiation libraries, but it does,
admittedly, complicate the code.
msg82535 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-20 16:16
Thanks, Trevor.  I'm currently working on the 15-bit -> 30-bit digit 
change (issue 4258), since it seems sensible to get that in before 
considering other optimizations, but I certainly hope to find time to look 
at this before the 3.1 release cycle gets underway.
msg82536 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-20 16:25
By the way, I'd be interested to know if you (Trevor) have any thoughts on 
the multiplication optimizations that are in the patch

30bit_longdigit13+optimizations.patch

in the issue 4258 discussion.  These have been giving me some quite
spectacular speedups (4 or 5 times faster) for multiplication on
64-bit machines; smaller speedups on 32-bit.  Currently, those speedups
render your earlier special-case-squaring patch obsolete;  I wonder
whether there's a way to get the same speedup for squaring.
msg97003 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-29 21:07
This patch still(!) applies almost perfectly cleanly to trunk.  On a 64-
bit machine, I'm getting a failure in test_auto_overflow, coming from:

>>> pow(0L, 0, 9223372036854775807)
28051505152L

I haven't looked hard to figure out where this is coming from, but my 
guess is that the 15-bitness of digits is hard-coded in the patch 
somewhere.

My general feeling is that three-argument pow is such a little-used 
operation in Python that it's not worth the extra code to speed it up.
msg97031 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-30 13:55
Looks like the test failure is a result of a misplaced (twodigits) cast:  replacing the line

   carry += (twodigits) ( (*aptr) + (u * (*mptr++)) );

in function mont_reduce with

   carry += *aptr + (twodigits)u * *mptr++;

fixes this.
msg97054 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-30 19:02
Here's a slightly modified version of Trevor's patch:

 - update to apply cleanly to trunk
 - fix misplaced twodigits cast described above
 - add 'PyLong_' prefix to BASE, MASK and SHIFT
 - no need for _PyLong_Copy in l_invmod:  just copy and INCREF the
   pointers
 - don't calculate d - q*c by hand in l_invmod, since l_divmod computes
   the remainder already
 - simplify reference counting in l_invmod
 - add a '* 1U' to a digit-by-digit multiplication, to avoid possible
   (in theory;  unlikely in practice) undefined behaviour resulting
   from signed integer overflow.

The rest of the patch looks fine to me, modulo some minor style issues.

Two points:

- it seems that l_invmod is only ever used to compute the inverse of a 
single-digit long modulo PyLong_BASE.  Mightn't it be more efficient to 
simply do a twodigit/digit-based calculation here instead, to reduce the 
overhead of setting up the Montgomery reductions?

- the current algorithm for three-argument pow involves many allocations 
and deallocations of integers;  it seems to me that these could almost 
all be eliminated:  for pow(a, b, c) with c an n-digit PyLong, just 
allocate 2*n (or possibly 2*n+1) digits of workspace to begin with and 
use that to store intermediate products;  the Montgomery reduction could 
then be done more-or-less in place.  This doesn't work with the non-
Montgomery method, though, since l_divmod doesn't operate in place.
msg97098 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-31 15:20
Here's a second revision of Trevor's patch:

 - factor out the code for creating Montgomery representatives;  this
   simplifies the changes to the main long_pow function
 - get rid of l_invmod and use a simple function for computing the
   negation of an inverse of a single digit modulo PyLong_BASE instead
 - Montgomery reduction wasn't being used for multidigit exponent b if
   the last digit of the exponent is small.  Fix this.
 - add 'static' qualifier to mont_reduce function
 - use type Py_ssize_t instead of int for digit indices in mont_reduce
 - various other unimportant style fixes
msg97101 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-31 16:32
Some timings on my machine (OS X 10.6, 64-bit nondebug build, trunk r77157).  These are just 
doing an RSA-like powmod pow(c, d, n), with n the product of two similarly-sized primes, d the 
inverse of 7 modulo eulerPhi(n), and c of similar magnitude to n.

Without the patch:

Mark-Dickinsons-MacBook-Pro:trunk dickinsm$ ./python.exe ../time_powmod.py
64-bit modulus: 0.000031
253-bit modulus: 0.000274
1023-bit modulus: 0.008032

With the patch:

Mark-Dickinsons-MacBook-Pro:trunk-issue936813 dickinsm$ ./python.exe ../time_powmod.py
64-bit modulus: 0.000025
253-bit modulus: 0.000209
1023-bit modulus: 0.006717

So I'm seeing a speedup of 20-30%.

I've attached the (rather primtive) timing script. Anyone else want to contribute timings?
msg97102 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-31 16:39
Hmm.  For smaller inputs, I'm actually getting significant slowdowns:

Unpatched:

>>> timeit('pow(123, 123456789, 123456789L)')
7.355183839797974

Patched:

>>> timeit('pow(123, 123456789, 123456789L)')
8.873976945877075
msg97104 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-31 16:59
One more lot of timings, from Trevor's pow_benchmark.txt:

Unpatched
---------
1024 bits: 0.008256
2048 bits: 0.052324
3072 bits: 0.159689
4096 bits: 0.357264

Patched (percent speedup)
-------
1024 bits: 0.006576  (+25.5%)
2048 bits: 0.045878  (+14.1%)
3072 bits: 0.135740  (+17.6%)
4096 bits: 0.310756  (+15.0%)

I'm not quite sure why I'm not seeing the same level of speedup that Trevor originally 
reported.  Perhaps this a result of some of the other optimizations that have been 
applied to the long implementation (30-bit digits, a reworked division algorithm), or 
perhaps the same level of speedup isn't available on 64-bit machines for some reason.  
I hope I haven't somehow negated the effects of the patch in my refactoring.
msg97108 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-12-31 19:39
Okay, I retested the original patch without any of my refactoring (besides 
fixing the twodigits cast), and got pretty much the same numbers.

On a 32-bit non-debug trunk build (still on OS X 10.6), I get:

Unpatched
---------

Mark-Dickinsons-MacBook-Pro:trunk dickinsm$ ./python.exe 
../pow_benchmark.py 
1024 bits: 0.033691
2048 bits: 0.224796
3072 bits: 0.712510
4096 bits: 1.691484
Mark-Dickinsons-MacBook-Pro:trunk dickinsm$ ./python.exe ../time_powmod.py
64-bit modulus: 0.000054
253-bit modulus: 0.000981
1023-bit modulus: 0.034314

Patched
-------

Mark-Dickinsons-MacBook-Pro:trunk-issue936813 dickinsm$ ./python.exe 
../pow_benchmark.py 
1024 bits: 0.027317  (+23.3%)
2048 bits: 0.181053  (+24.2%)
3072 bits: 0.571688  (+24.6%)
4096 bits: 1.251051  (+35.2%)
Mark-Dickinsons-MacBook-Pro:trunk-issue936813 dickinsm$ ./python.exe 
../time_powmod.py
64-bit modulus: 0.000045 (+20.0%)
253-bit modulus: 0.000773 (+26.9%)
1023-bit modulus: 0.026983 (+27.2%)

I must admit I was hoping for a bit more than this.  IMO, these speedups 
aren't big enough to justify the extra complexity for what's really a bit 
of a niche function, so I'm going to reject this 3rd part and close this 
issue (but marking it as 'accepted' because most of the original 2004 
patch went in).
History
Date User Action Args
2009-12-31 19:39:35mark.dickinsonsetstatus: open -> closed
resolution: accepted
messages: + msg97108

stage: patch review -> resolved
2009-12-31 16:59:39mark.dickinsonsetmessages: + msg97104
2009-12-31 16:39:22mark.dickinsonsetmessages: + msg97102
2009-12-31 16:32:25mark.dickinsonsetfiles: + time_powmod.py

messages: + msg97101
2009-12-31 15:20:26mark.dickinsonsetfiles: + long_pow_2009_12_31.diff

messages: + msg97098
2009-12-30 19:02:41mark.dickinsonsetfiles: + long_pow_2009_12_30.diff

messages: + msg97054
2009-12-30 13:55:37mark.dickinsonsetmessages: + msg97031
2009-12-29 21:07:11mark.dickinsonsetmessages: + msg97003
2009-02-20 16:25:58mark.dickinsonsetmessages: + msg82536
2009-02-20 16:16:53mark.dickinsonsetmessages: + msg82535
2009-02-19 21:02:04gregory.p.smithsetnosy: + gregory.p.smith
2009-02-19 20:39:49trevpsetmessages: + msg82498
2009-02-14 14:18:07mark.dickinsonsetassignee: tim.peters -> mark.dickinson
messages: + msg82063
2009-02-14 13:53:47ajaksu2setstage: patch review
versions: + Python 3.1, Python 2.7, - Python 2.6, Python 3.0
2008-01-12 17:05:02christian.heimessetnosy: + mark.dickinson
messages: + msg59825
versions: + Python 3.0
2008-01-12 04:31:15christian.heimessetnosy: + christian.heimes
type: enhancement
messages: + msg59785
versions: + Python 2.6, - Python 2.5
2004-04-17 08:16:31trevpcreate