Issue 1087418: long int bitwise ops speedup (patch included)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/41337

classification

Title:	long int bitwise ops speedup (patch included)
Type:	performance	Stage:	test needed
Components:	Interpreter Core	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	tim.peters	Nosy List:	christian.heimes, gregsmith, loewis, mark.dickinson, pitrou, rhettinger, sopoforic, tim.peters
Priority:	normal	Keywords:	patch

Created on 2004-12-18 05:22 by gregsmith, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
long.diff	gregsmith, 2004-12-18 05:22	diffs to longobject.c (rel 2.4 version)
long_bitwise_ops_metd.patch	mark.dickinson, 2009-10-25 14:00

Messages (17)
msg47377 - (view)	Author: Gregory Smith (gregsmith)	Date: 2004-12-18 05:22
The 'inner loop' for applying bitwise ops to longs is quite inefficient. The improvement in the attached diff is - 'a' is never shorter than 'b' (result: only test 1 loop index condition instead of 3) - each operation ( & \| ^ ) has its own loop, instead of switch inside loop - I found that, when this is done, a lot of things can be simplified, resulting in further speedup, and the resulting code is not very much longer than before (my libpython2.4.dll .text got 140 bytes longer). Operations on longs of a few thousand bits appear to be 2 ... 2.5 times faster with this patch. I'm not 100% sure the code is right, but it passes test_long.py, anyway.
msg47378 - (view)	Author: Gregory Smith (gregsmith)	Date: 2005-01-03 19:54
Logged In: YES user_id=292741 I originally timed this on a cygwin system, I've since found that cygwin timings tend to be strange and possibly misleading. On a RH8 system, I'm seeing speedup of x3.5 with longs of ~1500 bits and larger, and x1.5 speedup with only about 300 bits. Times were measured with timeit.Timer( 'a\|b', 'a=...; b=...') Increase in .text size is likewise about 120 bytes.
msg47379 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2005-01-07 06:54
Logged In: YES user_id=80475 Patch Review ------------ On Windows using MSC 6.0, I could only reproduce about a small speedup at around 300 bits. While the patch is short, it adds quite a bit of complexity to the routine. Its correctness is not self-evident or certain. Even if correct, it is likely to encumber future maintenance. Unless you have important use cases and feel strongly about it, I think this one should probably not go in. An alternative to submit a patch that limits its scope to factoring out the innermost switch/case. I tried that and found that the speedup is microscopic. I suspect that that one unpredictable branch is not much of a bottleneck. More time is likely spent on creating z.
msg47380 - (view)	Author: Gregory Smith (gregsmith)	Date: 2005-02-11 03:45
Logged In: YES user_id=292741 I started by just factoring out the inner switch loop. But then it becomes evident that when op = '^', you always have maska == maskb, so there's no point in doing the ^mask at all. And when op == '\|', then maska==maskb==0. So likewise. And if you put a check in so that len(a) >= len(b), then the calculation of len_z can be simplified. It also becomes easy to break the end off the loops, so that, say, or'ing a small number with a really long becomes mostly a copy. etc. It's was just a series of small simple changes following from the refactoring of the loop/switch. I see a repeatable 1.5 x speedup at 300 bits, which I think is significant (I wasn't using negative #s, which of course have their own extra overhead). The difference should be even higher on CPUs that don't have several 100 mW of branch-prediction circuitry. One use case is that you can simulate an array of hundreds or thousands of simple 1-bit processors in pure python using long operations, and get very good performance, even better with this fix. This app involves all logical ops, with the occasional shift. IMHO, I don't think the changed code is more complex; it's a little longer, but it's more explicit in what is really being done, and it doesn't roll together 3 cases, which don't really have that much in common, for the sake of brevity. It wasn't obvious to me about the masks being redundant until after I did the factoring, and this is my point - rolling it together hides that. The original author may not have noticed the redundancy. I see a lot of effort being expended on very complex multiply operations, why should the logical ops be left behind for the sake of a few lines?
msg47381 - (view)	Author: Tim Peters (tim.peters) *	Date: 2006-05-25 19:54
Logged In: YES user_id=31435 Assigned to me, changed Category to Performance.
msg47382 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-12-11 04:36
Tim, what is your view on this patch?
msg59323 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2008-01-05 19:56
It's probably useful for Python 3.0 since the old int type is gone.
msg94400 - (view)	Author: Gregory Smith (gregsmith)	Date: 2009-10-24 01:56
Actually, my view for 3.x is this: I do agree hugely with the 'top level' decision to only have one type that handles everything, and obviously the easiest way to get there is to just use the existing long type. However, though the rad-2^15 implementation of the 'long' type was a fine thing for the 2.x python where it's only used for relatively unusual cases, I think it's too inefficient to use as the primary integer type, so I'm wondering if there's a plan to replace it with something more efficient (thus rendering this particular issue irrelevant). I did spend some time delving into python internals years ago, but haven't really looked into 3.x, so bear with me. I am aware of the following 'speed tweaks' in Python 2.x integers, aren't these lost now? (1) integers all same size, allocated from linked-list pool instead of from malloc (2) 'add' and 'sub' byte codes do special check in interpreter core to see if ints being added, and will go direct to PyInt_FromLong where possible It would seem to me that a more suitable implementation would be to store numbers as 32*k-bit 2's complement integers; I've used this in C++ libraries. Compared to the rad-15 sign/magnitude approach, it may seem trickier to do carries but in practice it's not that big a deal. It also basically assumes you have a decently fast 64-bit operation to do multiplies and divides with, which I think is now reasonable. This implementation will be much easier to optimize for the (now) extremely common case where numbers fit into 32 bits. It could also be possible to store 'wee' integers in a special pool and only use malloc for larger ones. I know there's a lot of code in that module, virtually all of which would be affected by this. But isn't this type now a big performance factor in 3.x?
msg94402 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-10-24 06:50
The type is an important performance factor but most uses of it are for small ints (< 232 or 264), so your approach wouldn't make much of a difference. Besides, there are already some improvements in the py3k branch (for example, longs now use 30-bit "digits" on 64-bit systems). > I am aware of the > following 'speed tweaks' in Python 2.x integers, aren't these lost now? Yes, they are. As a result, calculations on small ints have become a bit slower.
msg94404 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-10-24 09:10
[Greg] > It would seem to me that a more suitable implementation would be to > store numbers as 32k-bit 2's complement integers; I've used this in > C++ libraries. Compared to the rad-15 sign/magnitude approach, it may > seem trickier to do carries but in practice it's not that big a deal. Do you know of any publicly available code that takes this approach, while remaining simple and portable? For 32-bit limbs (on a 32-bit platform, say, with no C99 support and no uint64_t available), how do* you deal with carries? You can write portable C89 that does the correct thing, but you then have to cross your fingers and hope that the compiler can turn it into sensible assembler, and it often can't (I'll post an example of this that I ran into just yesterday in a second). And even if your compiler can optimize this effectively, what about all the other compilers out there? The alternative is to use inline assembler for selected platforms; at that point, you lose simplicity. The sign-magnitude versus two's complement is an orthogonal issue, I think. I'm not convinced of the value of two's complement. Certainly two's complement would be more convenient for the bitwise operations under discussion, and probably also works well for addition and subtraction. But it's less convenient for negation, multiplication, division, power, conversion to and from human-readable strings, etc. It's worth noting that GMP uses sign-magnitude, so I can't believe there could be much performance gain in moving to two's complement. (Actually, I'm not sure I've seen any arbitrary-precision arithmetic package that uses two's complement. Has anyone here?) Here's the example promised earlier: yesterday I wanted to add two 128- bit unsigned integers a and b, each represented as a pair of type uint64_t integers, on a 64-bit machine. In portable C99, the code looks something like: #include <stdint.h> /* sumhigh:sumlow = ahigh:alow + bhigh:blow / void add_128(uint64_t sumhigh, uint64_t sumlow, uint64_t ahigh, uint64_t alow, uint64_t bhigh, uint64_t blow) { alow += blow; ahigh += bhigh + (alow < blow); sumlow = alow; *sumhigh = ahigh; } Ideally, the compiler would manage to optimize this to a simple 'addq, adcq' pair of instructions. But gcc-4.4 -O3, on OS X 10.6/x86_64 gives me: _add_128: LFB0: leaq (%r9,%rcx), %rcx xorl %eax, %eax leaq (%r8,%rdx), %rdx pushq %rbp LCFI0: cmpq %r9, %rcx movq %rcx, (%rsi) setb %al movq %rsp, %rbp LCFI1: addq %rax, %rdx movq %rdx, (%rdi) leave ret (Here it looks like alow and blow are in r9 and rcx, ahigh and bhigh are in r8 and rdx, and rsi and rdi are sumlow and sumhigh.) How do you write the C code in such a way that gcc produces the right result? I eventually gave up and just put the assembler in directly.
msg94405 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-10-24 09:23
Here's the inline assembly version, for comparison: #define SUM2_BIN64(sumhigh, sumlow, ahigh, alow, bhigh, blow) \ __asm__ ("addq\t%5, %1\n\t" \ "adcq\t%3, %0" \ : "=r" (sumhigh), "=&r" (sumlow) \ : "0" ((uint64_t)(ahigh)), "rm" ((uint64_t)(bhigh)), \ "%1" ((uint64_t)(alow)), "rm" ((uint64_t)(blow)) \ : "cc") void add_128_asm(uint64_t sumhigh, uint64_t sumlow, uint64_t ahigh, uint64_t alow, uint64_t bhigh, uint64_t blow) { SUM2_BIN64(ahigh, alow, ahigh, alow, bhigh, blow); sumlow = alow; sumhigh = ahigh; } And the generated output (again gcc-4.4 with -O3): _add_128_asm: LFB1: pushq %rbp LCFI2: # 29 "test.c" 1 addq %r9, %rcx adcq %r8, %rdx # 0 "" 2 movq %rsp, %rbp LCFI3: movq %rcx, (%rsi) movq %rdx, (%rdi) leave ret
msg94416 - (view)	Author: Gregory Smith (gregsmith)	Date: 2009-10-24 14:05
Antoine, "most uses of it are for small ints (< 232 or 264), ", that's my point exactly, the current long type is quite slow for those cases because of sign-magnitude implementation. I don't see a problem with sign-magnitude for old long ints, or for GMP, since in those cases the assumption is that you have a fair percentage of large #s, this is not valid in Py3.0 where most are small. Most operations are add and subtract, and in each such operation you need to look at both signs, and decide if you want to really add or subtract, and if you are subtracting, you then have to do a magnitude test to see which way - all of that before you do any actual computation. That's a lot of overhead for e.g. 'i -= 1'. Especially given that the sign & length information are rolled together into a single value; you need a fair number of tests & conditional branches. With a 2's complement implementation you would test for the most common case where both values are 1 digit (which would include 0 value) and in that case do an add (or subtract) followed by a check for overflow, and that's the whole thing. I'm guessing this is 10% - 25% as much time as in the current implementation - this is one of those cases where the time for tests and setup dominate over the actual computation. Mark, what you've written looks fine to me. It would be a bit faster with an add-with-carry, but there's no conditional branches in the generated code, so it's not bad, it's still faster than it would be if you were extracting carries from the msb of each result and masking them afterwards. Bear in mind that some RISC processors don't have an add-with-carry (lacking condition codes altogether) so the method of comparing the sum to the inputs is the only method available. Given that in Python the operation is in a loop, I think it's unreasonable to to expect a compiler to identify an add-with-carry application, but at the same time it's not necessary. Now that the long int is the only int type, multi-digit ops will be relatively uncommon, and you want to focus on reducing the overhead before and after the operation. (And, for speed freaks who want to use long ints to implement large bits arrays and want fast bit ops - my original motivation for this issue - it would be possible to use SSE2 vector instructions on specific processors. That can also be done with the 15-bit implementation, but it works much better with 32). The 'algorithmic C' package (http://www.mentor.com/products/esl/high_level_synthesis/ac_datatypes) is a C++ template class for arbitrary-length signed/unsigned integers. It's not suitable for use in Python because of license issues and because its integers are controlled by templates and are fixed-size at compile time - but it uses Kx32-bit 2's complement format, and implements adds, shifts, multiplies, logic ops, and limited division ops. It works very well, with extremely low overhead on small values - often the code is exactly the same as if you used int type - that would not be possible with a sign/magnitude approach. As for multiplies and divides - it's certainly possible to proceed through a multiply entirely in 2's complement, so no overhead is needed for abs value, or changing sign at the end. For division, it's possible if the denominator is > 0, and it may be possible if <0 (but probably not worth the hassle). I would be happy to do a design doc for this and write some of the inner loops, but if (a) it's already being done or (b) there's no chance of it being deployed then it would be a waste of time... if there was definite interest in it (and reasonable schedule expectations) I could write a lot more of it.
msg94420 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-10-24 14:25
> I would be happy to do a design doc for this and write some of the inner > loops, but if (a) it's already being done or (b) there's no chance of it > being deployed then it would be a waste of time... > if there was definite interest in it (and reasonable schedule > expectations) I could write a lot more of it. I think the only realistic chance in which something may change is that somebody sits down and does all the work, from beginning to end. I doubt that a design doc and some inner loops alone will make any difference. It's not being done (to my knowledge), for the reason alone that it would be a lot of work. It has a chance to be deployed only if a) it is significantly faster, for small numbers, b) correct, and c) only slightly more complicated than the current code. I notice that such discussion is off-topic for this bug report, which is about your original patch. All we have to decide here is whether to accept it (perhaps with modifications), or to reject it.
msg94435 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-10-24 19:25
> Most operations are add and subtract, and in each such operation you > need to look at both signs, and decide if you want to really add or > subtract, and if you are subtracting, you then have to do a magnitude > test to see which way - all of that before you do any actual > computation. That's a lot of overhead for e.g. 'i -= 1'. Hmm. I agree this isn't ideal, and I now see the attraction of two's complement. Thanks. It would be interesting to see timings from such an approach. Maybe one could just implement the basic operations (+, -, *, /) to get an idea of whether it's worth considering more seriously.
msg94451 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-10-25 14:00
I think Greg's patch looks fine, modulo updating it to apply cleanly to py3k. I couldn't resist tinkering a bit, though: factoring out the complement operations (for a, b and z) into a separate function gives (IMO) cleaner and more direct code that's free of mask operations. Here's the patch.
msg94457 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-10-25 18:29
Mark, if you want to get reviews, it might be useful to upload the patch to Rietveld (as the diff is difficult to read). However, I would trust you to get it right, anyway, so feel free to go ahead and apply it.
msg94461 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-10-25 20:46
Applied in r75697 (trunk) and r75698 (py3k).

History
Date	User	Action	Args
2022-04-11 14:56:08	admin	set	github: 41337
2009-10-25 20:46:06	mark.dickinson	set	status: open -> closed resolution: accepted messages: + msg94461
2009-10-25 18:29:20	loewis	set	messages: + msg94457
2009-10-25 14:00:40	mark.dickinson	set	files: + long_bitwise_ops_metd.patch messages: + msg94451
2009-10-24 19:25:41	mark.dickinson	set	messages: + msg94435
2009-10-24 14:25:49	loewis	set	messages: + msg94420
2009-10-24 14:05:27	gregsmith	set	messages: + msg94416
2009-10-24 09:23:13	mark.dickinson	set	messages: + msg94405
2009-10-24 09:10:15	mark.dickinson	set	messages: + msg94404
2009-10-24 06:50:14	pitrou	set	nosy: + pitrou messages: + msg94402 versions: + Python 3.2, - Python 2.6, Python 3.1
2009-10-24 01:56:09	gregsmith	set	messages: + msg94400
2009-10-23 22:37:39	sopoforic	set	nosy: + sopoforic
2009-04-28 17:26:18	ajaksu2	unlink	issue1492860 dependencies
2009-04-26 01:10:40	ajaksu2	set	nosy: + mark.dickinson versions: + Python 3.1, Python 2.7, - Python 3.0 type: performance stage: test needed
2009-04-26 01:08:53	ajaksu2	link	issue1492860 dependencies
2008-01-05 19:56:59	christian.heimes	set	priority: low -> normal nosy: + christian.heimes messages: + msg59323 versions: + Python 2.6, Python 3.0
2004-12-18 05:22:50	gregsmith	create