classification
Title: long int bitwise ops speedup (patch included)
Type: performance Stage: test needed
Components: Interpreter Core Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: christian.heimes, gregsmith, loewis, mark.dickinson, pitrou, rhettinger, sopoforic, tim.peters
Priority: normal Keywords: patch

Created on 2004-12-18 05:22 by gregsmith, last changed 2009-10-25 20:46 by mark.dickinson. This issue is now closed.

Files
File name Uploaded Description Edit
long.diff gregsmith, 2004-12-18 05:22 diffs to longobject.c (rel 2.4 version)
long_bitwise_ops_metd.patch mark.dickinson, 2009-10-25 14:00
Messages (17)
msg47377 - (view) Author: Gregory Smith (gregsmith) Date: 2004-12-18 05:22
The 'inner loop' for applying bitwise ops to longs is quite
inefficient.

The improvement in the attached diff is
 - 'a' is never shorter than 'b' (result: only test 1
   loop index condition instead of 3)
 - each operation ( & | ^ ) has its own loop, instead
   of switch inside loop
- I found that, when this is done, a lot
of things can be simplified, resulting in further speedup,
and the resulting code is not very much longer than
before (my libpython2.4.dll  .text got 140 bytes longer).

Operations on longs of a few thousand bits appear
to be 2 ... 2.5 times faster with this patch.
I'm not 100% sure the code is right, but it passes
test_long.py, anyway.

msg47378 - (view) Author: Gregory Smith (gregsmith) Date: 2005-01-03 19:54
Logged In: YES 
user_id=292741

I originally timed this on a cygwin system, I've since found
that cygwin timings tend to be strange and possibly
misleading. On a RH8 system, I'm seeing speedup of x3.5 with
longs of ~1500 bits and larger, and x1.5 speedup with only
about 300 bits. Times were measured with timeit.Timer(
'a|b', 'a=...; b=...')
Increase in .text size is likewise about 120 bytes.

msg47379 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-01-07 06:54
Logged In: YES 
user_id=80475

Patch Review
------------

On Windows using MSC 6.0, I could only reproduce about a
small speedup at around 300 bits. 

While the patch is short, it adds quite a bit of complexity
to the routine.  Its correctness is not self-evident or
certain.  Even if correct, it is likely to encumber future
maintenance.

Unless you have important use cases and feel strongly about
it, I think this one should probably not go in.

An alternative to submit a patch that limits its scope to
factoring  out the innermost switch/case.  I tried that and
found that the speedup is microscopic.  I suspect that that
one unpredictable branch is not much of a bottleneck.  More
time is likely spent on creating z.
msg47380 - (view) Author: Gregory Smith (gregsmith) Date: 2005-02-11 03:45
Logged In: YES 
user_id=292741

I started by just factoring out the inner switch loop. But then
it becomes evident that when op = '^', you always have
maska == maskb, so there's no point in doing the ^mask at all.
And when op == '|', then maska==maskb==0. So likewise.
And if you put a check in so that len(a) >= len(b), then the
calculation of len_z can be simplified. It also becomes easy
to break the end off the loops, so that, say, or'ing a small
number with a really long becomes mostly a copy. etc.
It's was just a series of small simple changes following
from the refactoring of the loop/switch. 

I see a repeatable 1.5 x speedup at 300 bits, which
I think is significant (I wasn't using negative #s, which
of course have their own extra overhead). The difference
should be even higher on CPUs that don't have several
100 mW of branch-prediction circuitry.

One use case is that you can simulate an array
of hundreds or thousands of simple 1-bit processors
in pure python using long operations, and get very
good performance, even better with this fix. This app
involves all logical ops, with the occasional shift.


IMHO, I don't think the changed code is more complex; it's a
little longer, but it's more explicit in what is really
being done, and it doesn't roll together 3 cases, which
don't really have that much in common, for the sake of
brevity.  It wasn't obvious to
me about the masks being redundant until after I did the
factoring, and this is my point - rolling it together hides
that.
The original author may not have noticed the redundancy.

 I see a lot of effort being expended on very complex
multiply operations, why should the logical ops be left
behind for
the sake of a few lines?







msg47381 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-05-25 19:54
Logged In: YES 
user_id=31435

Assigned to me, changed Category to Performance.
msg47382 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-12-11 04:36
Tim, what is your view on this patch?
msg59323 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-05 19:56
It's probably useful for Python 3.0 since the old int type is gone.
msg94400 - (view) Author: Gregory Smith (gregsmith) Date: 2009-10-24 01:56
Actually, my view for 3.x is this: I do agree hugely with the 'top
level' decision to only have one type that handles everything, and
obviously the easiest way to get there is to just use the existing long
type. However, though the rad-2^15 implementation of the 'long' type was
a fine thing for the 2.x python where it's only used for relatively
unusual cases, I think it's too inefficient to use as the primary
integer type, so I'm wondering if there's a plan to replace it with
something more efficient (thus rendering this particular issue irrelevant).

I did spend some time delving into python internals years ago, but
haven't really looked into 3.x, so bear with me. I am aware of the
following 'speed tweaks' in Python 2.x integers, aren't these lost now?

   (1) integers all same size, allocated from linked-list pool instead
of from malloc
   (2) 'add' and 'sub' byte codes do special check in interpreter core
to see if ints being added, and will go direct to PyInt_FromLong where
possible

It would seem to me that a more suitable implementation would be to
store numbers as 32*k-bit 2's complement integers; I've used this in C++
libraries. Compared to the rad-15 sign/magnitude approach, it may seem
trickier to do carries but in practice it's not that big a deal. It also
basically assumes you have a decently fast 64-bit operation to do
multiplies and divides with, which I think is now reasonable. This
implementation will be much easier to optimize for the (now) extremely
common case where numbers fit into 32 bits.
It could also be possible to store 'wee' integers in a special pool and
only use malloc for larger ones. 

I know there's a lot of code in that module, virtually all of which
would be affected by this. But isn't this type now a big performance
factor in 3.x?
msg94402 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-24 06:50
The type is an important performance factor but most uses of it are for
small ints (< 2**32 or 2**64), so your approach wouldn't make much of a
difference.
Besides, there are already some improvements in the py3k branch (for
example, longs now use 30-bit "digits" on 64-bit systems).

> I am aware of the
> following 'speed tweaks' in Python 2.x integers, aren't these lost now?

Yes, they are. As a result, calculations on small ints have become a bit
slower.
msg94404 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-10-24 09:10
[Greg]
> It would seem to me that a more suitable implementation would be to
> store numbers as 32*k-bit 2's complement integers; I've used this in 
> C++ libraries. Compared to the rad-15 sign/magnitude approach, it may
> seem trickier to do carries but in practice it's not that big a deal.

Do you know of any publicly available code that takes this approach, 
while remaining simple and portable?

For 32-bit limbs (on a 32-bit platform, say, with no C99 support and no 
uint64_t available), how *do* you deal with carries?  You can write 
portable C89 that does the correct thing, but you then have to cross 
your fingers and hope that the compiler can turn it into sensible 
assembler, and it often can't (I'll post an example of this that I ran 
into just yesterday in a second).  And even if your compiler can 
optimize this effectively, what about all the other compilers out there?  
The alternative is to use inline assembler for selected platforms;  at 
that point, you lose simplicity.

The sign-magnitude versus two's complement is an orthogonal issue, I 
think.  I'm not convinced of the value of two's complement.  Certainly 
two's complement would be more convenient for the bitwise operations 
under discussion, and probably also works well for addition and 
subtraction.  But it's less convenient for negation, multiplication, 
division, power, conversion to and from human-readable strings, etc.  
It's worth noting that GMP uses sign-magnitude, so I can't believe there 
could be much performance gain in moving to two's complement.  
(Actually, I'm not sure I've seen any arbitrary-precision arithmetic 
package that uses two's complement.  Has anyone here?)

Here's the example promised earlier:  yesterday I wanted to add two 128-
bit unsigned integers a and b, each represented as a pair of type 
uint64_t integers, on a 64-bit machine.  In portable C99, the code looks 
something like:

#include <stdint.h>

/* *sumhigh:*sumlow = ahigh:alow + bhigh:blow */

void
add_128(uint64_t *sumhigh, uint64_t *sumlow,
           uint64_t ahigh, uint64_t alow,
           uint64_t bhigh, uint64_t blow)
{
  alow += blow;
  ahigh += bhigh + (alow < blow);
  *sumlow = alow;
  *sumhigh = ahigh;
}

Ideally, the compiler would manage to optimize this to a simple 'addq, 
adcq' pair of instructions.  But gcc-4.4 -O3, on OS X 10.6/x86_64 gives 
me:

_add_128:
LFB0:
        leaq    (%r9,%rcx), %rcx
        xorl    %eax, %eax
        leaq    (%r8,%rdx), %rdx
        pushq   %rbp
LCFI0:
        cmpq    %r9, %rcx
        movq    %rcx, (%rsi)
        setb    %al
        movq    %rsp, %rbp
LCFI1:
        addq    %rax, %rdx
        movq    %rdx, (%rdi)
        leave
        ret

(Here it looks like alow and blow are in r9 and rcx, ahigh and bhigh are 
in r8 and rdx, and rsi and rdi are sumlow and sumhigh.)

How do you write the C code in such a way that gcc produces the right 
result?  I eventually gave up and just put the assembler in directly.
msg94405 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-10-24 09:23
Here's the inline assembly version, for comparison:

#define SUM2_BIN64(sumhigh, sumlow, ahigh, alow, bhigh, blow)    \
  __asm__ ("addq\t%5, %1\n\t"                                    \
           "adcq\t%3, %0"                                        \
           : "=r" (sumhigh), "=&r" (sumlow)                      \
           : "0" ((uint64_t)(ahigh)), "rm" ((uint64_t)(bhigh)),  \
             "%1" ((uint64_t)(alow)), "rm" ((uint64_t)(blow))    \
           : "cc")

void
add_128_asm(uint64_t *sumhigh, uint64_t *sumlow,
           uint64_t ahigh, uint64_t alow,
           uint64_t bhigh, uint64_t blow)
{
  SUM2_BIN64(ahigh, alow, ahigh, alow, bhigh, blow);
  *sumlow = alow;
  *sumhigh = ahigh;
}

And the generated output (again gcc-4.4 with -O3):

_add_128_asm:
LFB1:
        pushq   %rbp
LCFI2:
# 29 "test.c" 1
        addq    %r9, %rcx
        adcq    %r8, %rdx
# 0 "" 2
        movq    %rsp, %rbp
LCFI3:
        movq    %rcx, (%rsi)
        movq    %rdx, (%rdi)
        leave
        ret
msg94416 - (view) Author: Gregory Smith (gregsmith) Date: 2009-10-24 14:05
Antoine, "most uses of it are for small ints (< 2**32 or 2**64), ",
that's my point exactly, the current long type is quite slow for those
cases because of sign-magnitude implementation.

I don't see a problem with sign-magnitude for old long ints, or for GMP,
since in those cases the assumption is that you have a fair percentage
of large #s, this is not valid in Py3.0 where most are small. Most
operations are add and subtract, and in each such operation you need to
look at both signs, and decide if you want to really add or subtract,
and if you are subtracting, you then have to do a magnitude test to see
which way - all of that before you do any actual computation. That's a
lot of overhead for e.g. 'i -= 1'. Especially given that the sign &
length information are rolled together into a single value; you need a
fair number of tests & conditional branches. With a 2's complement
implementation you would test for the most common case where both values
are 1 digit (which would include 0 value) and in that case do an add (or
subtract) followed by a check for overflow, and that's the whole thing.
I'm guessing this is 10% - 25% as much time as in the current
implementation - this is one of those cases where the time for tests and
setup dominate over the actual computation.

Mark, what you've written looks fine to me. It would be a bit faster
with an add-with-carry, but there's no conditional branches in the
generated code, so it's not bad, it's still faster than it would be if
you were extracting carries from the msb of each result and masking them
afterwards. Bear in mind that some RISC processors don't have an
add-with-carry (lacking condition codes altogether) so the method of
comparing the sum to the inputs is the only method available. Given that
in Python the operation is in a loop, I think it's unreasonable to to
expect a compiler to identify an add-with-carry application, but at the
same time it's not necessary. Now that the long int is the *only* int
type, multi-digit ops will be relatively uncommon, and you want to focus
on reducing the overhead before and after the operation.

(And, for speed freaks who want to use long ints to implement large bits
arrays and want fast bit ops - my original motivation for this issue -
it would be possible to use SSE2 vector instructions on specific
processors. That can also be done with the 15-bit implementation, but it
works much better with 32).


The 'algorithmic C' package
(http://www.mentor.com/products/esl/high_level_synthesis/ac_datatypes)
is a C++ template class for arbitrary-length signed/unsigned integers.
It's not suitable for use in Python because of license issues and
because its integers are controlled by templates and are fixed-size at
compile time - but it uses Kx32-bit 2's complement format, and
implements adds, shifts, multiplies, logic ops, and limited division
ops. It works very well, with extremely low overhead on small values -
often the code is exactly the same as if you used int type  - that would
not be possible with a sign/magnitude approach.

As for multiplies and divides - it's certainly possible to proceed
through a multiply entirely in 2's complement, so no overhead is needed
for abs value, or changing sign at the end. For division, it's possible
if the denominator is > 0, and it may be possible if <0 (but probably
not worth the hassle).

I would be happy to do a design doc for this and write some of the inner
loops, but if (a) it's already being done or (b) there's no chance of it
being deployed then it would be a waste of time...
if there was definite interest in it (and reasonable schedule
expectations) I could write a lot more of it.
msg94420 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-10-24 14:25
> I would be happy to do a design doc for this and write some of the inner
> loops, but if (a) it's already being done or (b) there's no chance of it
> being deployed then it would be a waste of time...
> if there was definite interest in it (and reasonable schedule
> expectations) I could write a lot more of it.

I think the only realistic chance in which something may change is that
somebody sits down and does all the work, from beginning to end. I doubt
that a design doc and some inner loops alone will make any difference.

It's not being done (to my knowledge), for the reason alone that it
would be a lot of work. It has a chance to be deployed only if a)
it is significantly faster, for small numbers, b) correct, and c)
only slightly more complicated than the current code.

I notice that such discussion is off-topic for this bug report, which
is about your original patch. All we have to decide here is whether to
accept it (perhaps with modifications), or to reject it.
msg94435 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-10-24 19:25
> Most operations are add and subtract, and in each such operation you
> need to look at both signs, and decide if you want to really add or
> subtract, and if you are subtracting, you then have to do a magnitude
> test to see which way - all of that before you do any actual
> computation. That's a lot of overhead for e.g. 'i -= 1'.

Hmm.  I agree this isn't ideal, and I now see the attraction of
two's complement.  Thanks.

It would be interesting to see timings from such an approach.
Maybe one could just implement the basic operations (+, -, *, /)
to get an idea of whether it's worth considering more seriously.
msg94451 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-10-25 14:00
I think Greg's patch looks fine, modulo updating it to apply cleanly to 
py3k.

I couldn't resist tinkering a bit, though:  factoring out the complement 
operations (for a, b and z) into a separate function gives (IMO) cleaner 
and more direct code that's free of mask operations.

Here's the patch.
msg94457 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-10-25 18:29
Mark, if you want to get reviews, it might be useful to upload the patch
to Rietveld (as the diff is difficult to read). However, I would trust
you to get it right, anyway, so feel free to go ahead and apply it.
msg94461 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-10-25 20:46
Applied in r75697 (trunk) and r75698 (py3k).
History
Date User Action Args
2009-10-25 20:46:06mark.dickinsonsetstatus: open -> closed
resolution: accepted
messages: + msg94461
2009-10-25 18:29:20loewissetmessages: + msg94457
2009-10-25 14:00:40mark.dickinsonsetfiles: + long_bitwise_ops_metd.patch

messages: + msg94451
2009-10-24 19:25:41mark.dickinsonsetmessages: + msg94435
2009-10-24 14:25:49loewissetmessages: + msg94420
2009-10-24 14:05:27gregsmithsetmessages: + msg94416
2009-10-24 09:23:13mark.dickinsonsetmessages: + msg94405
2009-10-24 09:10:15mark.dickinsonsetmessages: + msg94404
2009-10-24 06:50:14pitrousetnosy: + pitrou

messages: + msg94402
versions: + Python 3.2, - Python 2.6, Python 3.1
2009-10-24 01:56:09gregsmithsetmessages: + msg94400
2009-10-23 22:37:39sopoforicsetnosy: + sopoforic
2009-04-28 17:26:18ajaksu2unlinkissue1492860 dependencies
2009-04-26 01:10:40ajaksu2setnosy: + mark.dickinson
versions: + Python 3.1, Python 2.7, - Python 3.0

type: performance
stage: test needed
2009-04-26 01:08:53ajaksu2linkissue1492860 dependencies
2008-01-05 19:56:59christian.heimessetpriority: low -> normal
nosy: + christian.heimes
messages: + msg59323
versions: + Python 2.6, Python 3.0
2004-12-18 05:22:50gregsmithcreate