Use 30-bit digits instead of 15-bit digits for Python integers. #48508

mdickinson · 2008-11-04T10:45:11Z

BPO	4258
Nosy	@tim-one, @loewis, @gpshead, @mdickinson, @pitrou, @vstinner, @tiran
Dependencies	bpo-5260: longobject.c: minor fixes, cleanups and optimizations
Files	pybench_results.txt: pybench results optimize.patch: Improve mark's patch long_stat.patch 30bit_longdigit13.patch 30bit_longdigit13+optimizations.patch pidigits.py 30bit_longdigit14.patch pidigits_noprint.py pidigits_bestof.py: pidigits benchmark w/warmup and 5 runs 30bit_longdigit13+optimizations1.patch: Improve mark's patch 30bit_longdigit17.patch 30bit_longdigit17+asm.patch 30bit_longdigit20.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/mdickinson'
closed_at = <Date 2009-03-20.16:08:05.934>
created_at = <Date 2008-11-04.10:45:10.677>
labels = ['interpreter-core', 'performance']
title = 'Use 30-bit digits instead of 15-bit digits for Python integers.'
updated_at = <Date 2009-03-20.16:08:05.899>
user = 'https://github.com/mdickinson'

bugs.python.org fields:

activity = <Date 2009-03-20.16:08:05.899>
actor = 'mark.dickinson'
assignee = 'mark.dickinson'
closed = True
closed_date = <Date 2009-03-20.16:08:05.934>
closer = 'mark.dickinson'
components = ['Interpreter Core']
creation = <Date 2008-11-04.10:45:10.677>
creator = 'mark.dickinson'
dependencies = ['5260']
files = ['11950', '11951', '11952', '13108', '13118', '13119', '13122', '13127', '13132', '13137', '13142', '13149', '13164']
hgrepos = []
issue_num = 4258
keywords = ['patch']
message_count = 78.0
messages = ['75491', '75492', '75494', '75512', '75513', '75518', '75520', '75522', '75534', '75536', '75539', '75540', '75551', '75553', '75559', '75560', '75561', '75562', '75695', '75718', '75720', '75746', '77769', '77770', '82116', '82250', '82251', '82318', '82319', '82320', '82321', '82325', '82326', '82342', '82344', '82345', '82346', '82347', '82348', '82350', '82362', '82364', '82368', '82374', '82375', '82385', '82386', '82404', '82407', '82424', '82425', '82426', '82428', '82429', '82430', '82431', '82432', '82433', '82434', '82444', '82445', '82448', '82458', '82466', '82468', '82469', '82533', '82538', '82563', '82599', '82669', '82792', '83389', '83467', '83772', '83773', '83774', '83865']
nosy_count = 11.0
nosy_names = ['tim.peters', 'loewis', 'collinwinter', 'gregory.p.smith', 'mark.dickinson', 'pitrou', 'pernici', 'vstinner', 'christian.heimes', 'jyasskin', 'schuppenies']
pr_nums = []
priority = 'low'
resolution = 'accepted'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue4258'
versions = ['Python 2.7']

mdickinson · 2008-11-04T10:45:02Z

Here's an experimental patch, against the py3k branch, that makes Python
represent its long integers internally in base 2**30 instead of base
2**15, on platforms that have 32-bit and 64-bit integers available.

On platforms for which autoconf is unable to find both 32-bit and 64-bit
integers, the representation falls back to the usual one.

See also bpo-1814 (GMP for longs), and the discussion at

http://mail.python.org/pipermail/python-dev/2008-November/083315.html

(note particularly Tim Peter's message at:

http://mail.python.org/pipermail/python-dev/2008-November/083355.html
)

mdickinson · 2008-11-04T11:04:15Z

Note that to avoid "bad marshal data" errors, you'll probably need to do a
'make distclean' before rebuilding with this patch.

mdickinson · 2008-11-04T15:28:56Z

Here's an updated patch, with the following changes from the original:

make the size of a digit (both the conceptual size
in bits and actual size in bytes) available to Python
via a new structseq sys.int_info. This information
is useful for the sys.getsizeof tests.
fix a missing cast in long_hash
better fast path for 1-by-1 multiplication that
doesn't go via PyLong_FromLongLong.

vstinner · 2008-11-05T00:29:07Z

Note that to avoid "bad marshal data" errors,
you'll probably need to do a 'make distclean'
before rebuilding with this patch.

I saw that you choosed to use the base 2^30 for marshal. For a better
portability (be able to use .pyc generated without your patch), you
may keep the base 2^15. I implemented that in my GMP patch (manual
conversion from/to base 2^15).

If we change the marshal format of long, the magic number should be
different (we might use a tag like the "full unicode" tag used in
Python3 magic number) and/or the bytecode (actual bytecode is 'l').
The base should be independent of the implementation, like Python does
with text: UTF-8 for files and UCS-4 in memory. We may use the base
2^8 (256) or another power or 2^8 (2^16, 2^32, 2^64?). The base 256
sounds interresting because any CPU is able to process 8 bits digits.

Cons: Use a different bases makes Python slower for loading/writing
from/to .pyc.

gpshead · 2008-11-05T00:57:48Z

oh yay, thanks. it looks like you did approximately what i had started
working on testing a while back but have gone much further and added
autoconf magic to try and determine when which size should be used. good.

(i haven't reviewed your autoconf stuff yet)

As for marhsalled data and pyc compatibility, yes that is important to
consider.

We should probably base the decision on which digit size to use
internally on benchmarks, not just if the platform can support 64bit
ints. Many archs support 64bit numbers as a native C type but require
multiple instructions or are very slow when doing it.

(embedded arm, mips or ppc come to mind as obvious things to test that on)

vstinner · 2008-11-05T08:55:58Z

As for marhsalled data and pyc compatibility, yes that is important to
consider.

The problem is also that with the 30-bit digit patch, some Python will use 15
bits whereas some other will use 30 bits. The base in marshal should be the
same in both cases.

And why 30 bits and not 31 bits, or 63 bits, or 120 bits? We may use other
bases in the future. That's why I prefer to use a common base like base 256.
And GMP has functions (mpz_import) to load data in base 256, but it's more
complicate to load data in base 2^15.

mdickinson · 2008-11-05T10:53:49Z

[Victor Stinner]

I saw that you choosed to use the base 2^30 for marshal. For a better
portability (be able to use .pyc generated without your patch), you
may keep the base 2^15. I implemented that in my GMP patch (manual
conversion from/to base 2^15).

Agreed. I'll fix this so that the .pyc format is unchanged. Thanks!

And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

Mostly laziness: the change from 15 to 30 bits turned out to be extraordinarily easy. Note that the longobject.c part of
the patch does almost nothing except adding a bunch of casts here and there.

31 bits would involve rewriting the powering algorithm, which assumes that PyLong_SHIFT is divisible by 5. It would gain
very little over 30 bits, and if Pernici Mario's optimizations are considered (bpo-3944) multiplication would likely be
slower with 31-bit digits than with 30-bit digits.

63 (or 62, or 60) bits is simply too large right now: you'd need access to a hardware 64 x 64 -> 128 bit multiply to make
this worth it, and I'd guess there are many fewer platforms that make this easy, though I don't really know. I know it's
possible on gcc/x86_64 by making use of the (undocumented) __uint128_t type. But even where this type is available, base
2**63 might still be slower than base 2**30. I've done some experiments with multiprecision *decimal* arithmetic in C that
showed that even on a 64-bit machine, using base 10**9 (i.e. approx 30 bits) was significantly faster than using base
10**18.

120 bits? Does GMP even go this far? I guess part of the attraction is that it's a multiple of 8...

The other obvious choices to consider would be 32 bits (or 16 bits, or 64 bits). This is possible, and may even be worth
it, but it would be a *much* bigger effort; most of the algorithms would need to be rewritten. One problem is again the
mismatch between C and assembler: detecting overflow when adding two 32-bit unsigned integers is trivial in x86 assembler,
but it's not so obvious how to write portable C code that has a decent chance of being compiled optimally on a wide variety
of compilers.

I guess my feeling is simply that the 15-bit to 30-bit change seems incredibly easy and cheap: very little code change, and
hence low risk of accidental breakage. So if there are indeed significant efficiency benefits (still to be determined) then
it looks like a clear win to me.

[Gregory Smith]

(i haven't reviewed your autoconf stuff yet)

The changes to configure and pyconfig.h are just from rerunning autoconf and autoheader; it's only configure.in that should
need looking at.

vstinner · 2008-11-05T11:24:26Z

> And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

Mostly laziness (...)

It was an argument for changing the base used by the mashal :-)

31 bits would involve rewriting the powering algorithm, which assumes that
PyLong_SHIFT is divisible by 5

Powering is an simple algorithm. If it was the division, it would be much
harder :-) Python stores the sign of the number in the first digit. Because
of that, we are limited to 15/30 bits. Storing the sign in the size (which
size? no idea yet) would allows to use a bigger base (31 bits? 63 bits?).

One problem is again the mismatch between C and assembler: detecting
overflow when adding two 32-bit unsigned integers is trivial in x86
assembler, but it's not so obvious how to write portable C code that has a
decent chance of being compiled optimally on a wide variety of compilers.

I wrote an example to detect overflows in C on the mailing list. Copy of my
email:
------------------------------- 8< ----------------------
About 31, 32, 63 or 64 bits: I guess that you want to avoid integer overflow.
Intel has an "overflow" flag, changed for all instructions. For other CPUs,
you can use emulate this flag. Example with the type int that I used in my
GMP patch:

Add:
  int a, b, sum;
  sum = a + b;
  exact = ((a < 0) ^ (b < 0)) || ((a < 0) == (sum < 0));

Substract:
  int a, b, diff;
  diff = a + b;
  exact = ((a < 0) == (b < 0)) || ((a < 0) == (diff < 0));

Multiply:
  int a, b, product;
  if (a == 0 || b == 0) {
     product = 0;  /* exact */
  } else if (a != INT_MIN || (b == 1)) {
     product = a * b;
     exact = (product / b) == a);
  } else {
     /* INT_MIN * -1 = -INT_MIN: overflow */
  }

Divide:
  int a, b, q;
  if (a != INT_MIN || b != -1) {
     q = a / b;   /* exact */
  } else {
     /* INT_MIN / -1 = -INT_MIN: overflow */
  }

Checking overflow may costs more than using a smaller base. Only a benchmark
can answer ;-)
------------------------------- 8< ----------------------

I guess my feeling is simply that the 15-bit to 30-bit change seems
incredibly easy and cheap: very little code change, and hence low risk of
accidental breakage.

Python has an amazing regression test suite! I used it to fix my GMP patch. We
can experiment new bases using this suite.

Anyway, i love the idea of using 30 bits instead of 15! Most computer are now
32 or 64 bits! But it's safe to keep the 15 bits version to support older
computers or buggy compilers.

I started to work with GIT. You may be interrested to work together on GIT.
It's much easier to exchanges changeset and play with branches. I will try to
publish my GIT tree somewhere.

mdickinson · 2008-11-05T23:04:36Z

Following Victor's suggestion, here's an updated patch; same as before,
except that marshal now uses base 2**15 for reading and writing,
independently of whether PyLong_SHIFT is 15 or 30.

vstinner · 2008-11-05T23:13:28Z

Mark: would it be possible to keep the "2 digits" hack in
PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT ==
15". The base 2^15 slow, so don't make it slower :-)

/* 2 digits */
if (!(ival >> 2*PyLong_SHIFT)) {
```
  v = \_PyLong_New(2);
```
```
  if (v) {
```
```
  	Py_SIZE(v) = 2\*sign;
```

  	v-\>ob_digit[0] = (digit)ival & PyLong_MASK;

  	v-\>ob_digit[1] = ival \>\> PyLong_SHIFT;

```
  }
```
```
  return (PyObject*)v;
```
}

vstinner · 2008-11-06T01:35:18Z

marshal now uses base 2**15 for reading and writing

Yes, it uses base 2**15 but it's not the correct conversion to base
2**15. You convert each PyLong digit to base 2**15 but not the whole
number. As a result, the format is different than the current mashal
version.

vstinner · 2008-11-06T02:32:19Z

PyLong_FromLong() doesn't go into the 1 digit special case for 
negative number. You should use:
	/* Fast path for single-digits ints */
	if (!(abs_ival>>PyLong_SHIFT)) {
		v = _PyLong_New(1);
		if (v) {
			Py_SIZE(v) = sign;
			v->ob_digit[0] = abs_ival;
		}
		return (PyObject*)v;
	}

mdickinson · 2008-11-06T09:01:55Z

Yes, it uses base 2**15 but it's not the correct conversion to base
2**15. You convert each PyLong digit to base 2**15 but not the whole
number.

I don't understand: yes, each base 2**30 digit is converted to a pair
of base 2**15 digits, and if necessary (i.e., if the top 15 bits of the
most significant base 2**30 digit are zero) the size is adjusted. How
is this not converting the whole number?

As a result, the format is different than the current mashal version.

Can you give an example of an integer n such that marshal.dumps(n) gives
you different results with and without the patch? As far as I can tell,
I'm getting the same marshal results both with the unpatched version and
with the patch applied.

mdickinson · 2008-11-06T09:21:03Z

Other responses...

It was an argument for changing the base used by the mashal :-)

Ah. I think I'm with you now. You're saying that ideally, marshal
shouldn't have to care about how Python stores its longs: it should
just ask some function in longobject.c to provide an already-converted-
to-base-256 representation. Is that right?

I also find the idea of making the marshal representation base 256 quite
attractive. There are already functions in longobject.c that could be
used for this: _PyLong_{From,As}ByteArray. And then you wouldn't need
to touch marshal.c when swapping the GMP version of longobject.c in and
out.

Python stores the sign of the number in the first digit. Because
of that, we are limited to 15/30 bits.

No: the sign is stored in the size: if v is a PyLongObject then
ABS(Py_SIZE(v)) gives the number of digits in the absolute value of the
integer represented by v, and the sign of Py_SIZE(v) gives the sign of
the integer.

would it be possible to keep the "2 digits" hack in
PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT ==
15". The base 2^15 slow, so don't make it slower :-)

Why don't we leave this kind of micro-optimization out until we've got
some benchmarks. (I'm also tempted to get rid of the long_mul fast path
for now.)

PyLong_FromLong() doesn't go into the 1 digit special case for
negative number.

Well spotted! Yes, this should be fixed. I have a nasty feeling that I
was responsible for introducing this bug some time ago...

mdickinson · 2008-11-06T11:17:32Z

Here's a pybench comparison, on OS X 10.5/Core 2 Duo/gcc 4.0.1 (32-bit
non-debug build of the py3k branch). I got this by doing:

[create clean build of py3k branch]
dickinsm$ ./python.exe Tools/pybench/pybench.py -f bench_unpatched
[apply 30bit patch and rebuild]
dickinsm$ ./python.exe Tools/pybench/pybench.py -c bench_unpatched

Highlights: SimpleLongArithmetic: around 10% faster.
SimpleComplexArithmetic: around 16% slower!
CompareFloatsIntegers: around 20% slower.

I'll investigate the slowdowns.

vstinner · 2008-11-06T11:30:56Z

I'll investigate the slowdowns

The problem may comes from int64_t on 32 bits CPU. 32x32 -> 64 may be
emulated on your CPU and so it's slower. I improved your patch to make
it faster, but I lost all my work because of a misuse of GIT... As I
remember:

I fixed PyLong_FromLong() for small negative integer
I unrolled the loop in PyLong_FromLong(): the loop is at least
called twice (the number has 2 digits or more)
I added special code for operations on two numbers of 1 digit
(each) for long_add(), long_mul(), long_div(), long_bitwise(), etc.
and I don't remember the other changes...

Oh, I have an old patch. I will attach it to this message. About
speed, it was:

unpatched: 20600..20900 pystones
your patch: 19900..20100 pystones
- my changes: 20200..20400 pytones

vstinner · 2008-11-06T12:47:18Z

I wrote a patch to compute stat about PyLong function calls.

make (use setup.py):

PyLong_FromLong: 168572 calls, min=( 0, ), avg=(1.4, ), max=( 3, )
long_bool: 48682 calls, min=( 0, ), avg=(0.2, ), max=( 2, )
long_add: 39527 calls, min=( 0, 0), avg=(0.9, 1.0), max=( 2, 3)
long_compare: 39145 calls, min=( 0, 0), avg=(1.2, 1.1), max=( 3, 3)
PyLong_AsLong: 33689 calls, min=( 0, ), avg=(0.9, ), max=( 45, )
long_sub: 13091 calls, min=( 0, 0), avg=(0.9, 0.8), max=( 1, 1)
long_bitwise: 4636 calls, min=( 0, 0), avg=(0.8, 0.6), max=( 2, 2)
long_hash: 1097 calls, min=( 0, ), avg=(0.9, ), max=( 3, )
long_mul: 221 calls, min=( 0, 0), avg=(0.8, 1.1), max=( 2, 2)
long_invert: 204 calls, min=( 0, ), avg=(1.0, ), max=( 1, )
long_neg: 35 calls, min=( 1, ), avg=(1.0, ), max=( 1, )
long_format: 3 calls, min=( 0, ), avg=(0.7, ), max=( 1, )
long_mod: 3 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)
long_pow: 1 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)

pystone:

PyLong_FromLong:1587652 calls, min=( 0, ), avg=(1.0, ), max=( 3, )
long_add: 902487 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 2, 2)
long_compare: 651165 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 3, 3)
PyLong_AsLong: 252476 calls, min=( 0, ), avg=(1.0, ), max=( 2, )
long_sub: 250032 calls, min=( 1, 0), avg=(1.0, 1.0), max=( 1, 1)
long_bool: 102655 calls, min=( 0, ), avg=(0.5, ), max=( 1, )
long_mul: 100015 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2)
long_div: 50000 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)
long_hash: 382 calls, min=( 0, ), avg=(1.1, ), max=( 2, )
long_bitwise: 117 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2)
long_format: 1 calls, min=( 2, ), avg=(2.0, ), max=( 2, )

min/avg/max are the integer digit count (minimum, average, maximum).

What can we learn from this numbers?

PyLong_FromLong(), long_add() and long_compare() are the 3 most common
operations on integers.

Except PyLong_FromLong(), long_compare() and long_format(), arguments of the
functions are mostly in range [-2^15; 2^15].

Biggest number is a number of 45 digits: maybe just one call to long_add().
Except this number/call, the biggest numbers have between 2 and 3 digits.

long_bool() is never called with number bigger than 2 digits.

long_sub() is never called with number bigger than 1 digit!

vstinner · 2008-11-06T13:24:24Z

And now the stat of Python patched with 30bit_longdigit3.patch.

min/avg/max are now the number of bits which gives better
informations. "bigger" is the number of arguments which are bigger than 1
digit (not in range [-2^30; 2^30]).

make
====

_FromLong: 169734 calls, min=( 0, ), avg=(11.6, ), max=( 32, )
\--> bigger=31086
long_bool: 48772 calls, min=( 0, ), avg=( 0.3, ), max=( 24, )
long_add: 39685 calls, min=( 0, 0), avg=( 6.5, 3.5), max=( 19, 32)
\--> bigger=1
long_compare: 39445 calls, min=( 0, 0), avg=( 9.3, 8.4), max=( 31, 33)
\--> bigger=10438
_AsLong: 33726 calls, min=( 0, ), avg=( 4.9, ), max=(1321, )
\--> bigger=10
long_sub: 13285 calls, min=( 0, 0), avg=( 7.6, 5.6), max=( 13, 13)
long_bitwise: 4690 calls, min=( 0, 0), avg=( 1.7, 1.9), max=( 16, 16)
long_hash: 1097 calls, min=( 0, ), avg=( 8.1, ), max=( 33, )
\--> bigger=4
long_mul: 236 calls, min=( 0, 0), avg=( 1.3, 5.4), max=( 17, 17)
long_invert: 204 calls, min=( 0, ), avg=( 2.4, ), max=( 3, )
long_neg: 35 calls, min=( 1, ), avg=( 4.3, ), max=( 7, )
long_format: 3 calls, min=( 0, ), avg=( 2.0, ), max=( 4, )
long_mod: 3 calls, min=( 1, 2), avg=( 1.7, 2.0), max=( 2, 2)
long_pow: 1 calls, min=( 2, 6), avg=( 2.0, 6.0), max=( 2, 6)

Notes about make:

PyLong_FromLong(), long_compare(), PyLong_AsLong() and long_hash()
gets integers not in [-2^30; 2^30] which means that all other functions
are only called with arguments of 1 digit!
PyLong_FromLong() gets ~30.000 (18%) integers of 32 bits
global average integer size is between 0.3 and 11.6 (~6.0 bits?)
There are 41.500 (12%) big integers on ~350.000 integers

pystone
=======

_FromLong: 1504983 calls, min=( 0, ), avg=( 5.1, ), max=( 31, )
\--> bigger=14
long_add: 902487 calls, min=( 0, 0), avg=( 3.9, 2.4), max=( 17, 17)
long_compare: 651165 calls, min=( 0, 0), avg=( 1.7, 1.4), max=( 31, 31)
\--> bigger=27
_AsLong: 252477 calls, min=( 0, ), avg=( 4.6, ), max=( 16, )
long_sub: 250032 calls, min=( 1, 0), avg=( 4.0, 1.6), max=( 7, 7)
long_bool: 102655 calls, min=( 0, ), avg=( 0.5, ), max=( 7, )
long_mul: 100015 calls, min=( 0, 0), avg=( 2.5, 2.0), max=( 4, 16)
long_truediv: 50000 calls, min=( 4, 2), avg=( 4.0, 2.0), max=( 4, 2)
long_hash: 382 calls, min=( 0, ), avg=( 8.1, ), max=( 28, )
long_bitwise: 117 calls, min=( 0, 0), avg=( 6.7, 6.6), max=( 15, 16)
long_format: 1 calls, min=(16, ), avg=(16.0, ), max=( 16, )

Notes about pystone:

very very few numbers are bigger than one digit: 41 / ~4.000.000
global average integer size is between 0.5 and 6.7 (~3.0 bits?)
the biggest number has only 31 bits (see long_compare)

Short summary:

pystone doesn't use big integer (1 big integer for 100.000 integers)
=> don't use pystone!
the average integer size is around 3 or 6 bits, which means that most
integers can be stored in 8 bits (-255..255)
=> we need to focus on the very small numbers
=> base 2^30 doesn't help for common Python code, it only helps programs
using really big numbers (128 bits or more?)

mdickinson · 2008-11-10T13:34:07Z

Here's a minor update to the patch, that does some extra cleanup:

don't include longintrepr.h in Objects/abstract.c or
Objects/boolobject.c --- it's not needed.
fix several places in longobject.c where int should have been size_t
or Py_ssize_t
remove some unnecessary forward declarations in longobject.c.
fix PyLong_FromLong for small negative integers

At some point I'll try to separate the pure bugfixes (missing casts, int
vs Py_ssize_t, etc.) from the 15-bit to 30-bit conversion.

vstinner · 2008-11-11T01:58:56Z

Using 30bit_longdigit4.patch, I get this
error: "Objects/longobject.c:700: erreur: "SIZE_T_MAX" undeclared
(first use in this function)". You might use the type Py_ssize_t with
PY_SSIZE_T_MAX. I used INT_MAX to compile the code.

vstinner · 2008-11-11T02:24:17Z

I like the idea of sys.int_info, but I would prefer a "base" attribute
than "bits_per_digit". A base different than 2^n might be used (eg. a
base like 10^n for fast conversion from/to string).

mdickinson · 2008-11-11T16:30:50Z

Here's a version of the 15-bit to 30-bit patch
that adds in a souped-up version of Mario Pernici's
faster multiplication.

I did some testing of 100x100 digit and 1000x1000 digit
multiplies. On 32-bit x86:
100 x 100 digits : around 2.5 times faster
1000 x 1000 digits : around 3 times faster.

On x86_64, I'm getting more spectacular results:
100 x 100 digits : around 5 times faster
1000 x 1000 digits: around 7 times faster!

The idea of the faster multiplication is quite simple:
with 30-bit digits, one can fit a sum of 16 30-bit by
30-bit products in a uint64_t. This means that the
inner loop for the basecase grade-school multiplication
contains one fewer addition and no mask and shift.

[Victor, please don't delete the old longdigit4.patch!]

pitrou · 2008-12-14T00:28:50Z

Just tested the patch, here are some benchmarks:

./python -m timeit -s "a=100000000;b=777777" "a//b"
-> 2.6: 0.253 usec per loop
-> 3.1: 0.61 usec per loop
-> 3.1 + patch: 0.331 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a*b"
-> 2.6: 0.431 usec per loop
-> 3.1: 0.302 usec per loop
-> 3.1 + patch: 0.225 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a+b"
-> 2.6: 0.173 usec per loop
-> 3.1: 0.229 usec per loop
-> 3.1 + patch: 0.217 usec per loop

But it seems there are some outliers:

./python -m timeit -s "a=100000000**5+1;b=777777**3" "a//b"
-> 2.6: 1.13 usec per loop
-> 3.1: 1.12 usec per loop
-> 3.1 + patch: 1.2 usec per loop

vstinner · 2008-12-14T01:09:27Z

I wrote a small benchmark tool dedicated to integer operations (+ -

/ etc.): bench_int.py attached to bpo-4294. See also Message75715
and Message75719 for my benchmark results. Short sum up: 2^30 base
helps a lot on 64 bits CPU (+26%) whereas the speedup is small (4%) on
32 bits CPU. But don't trust benchmarks :-p

mdickinson · 2009-02-14T20:32:47Z

The most recent patch is out of date and no longer applies cleanly. I'm
working on an update.

mdickinson · 2009-02-16T16:47:27Z

Updated patch against py3k. I'm interested in getting this into the trunk
as well, but py3k is more important (because *all* integers are long
integers). It's also a little more complicated to do this for py3k
(mostly because of all the small integer caching), so backporting to
2.7 is easier than trying to forward port a patch from 2.7 to 3.1.

Notes:

I've added a configure option --enable-big-digits (there are probably
better names), enabled by default. So you can use --disable-big-digits
to get the old 15-bit behaviour.
I *think* this patch should work on Windows; confirmation would be
appreciated.
I've removed the fast multiplication code, in the interests of keeping
the patch simple. If this patch goes in, we can concentrate on speeding
up multiplication afterwards. For now, note that 30-bit digits give
the *potential* for significant speedups in multiplication and division
(see next item).
There's a nasty 'feature' in x_divmod: the multiplication in the
innermost loop is digit-by-twodigits -> twodigits, when it should be
digit-by-digit -> twodigits; this probably causes significant slowdown
with 30-bit digits. This may explain Antoine's outlier.
Again, if this patch goes in I'll work on fixing x_divmod.
Re: Victor's comment about a 'base' attribute: I tried this, but
quickly discovered that we still need the 'bits_per_digit' for tests.
I think that binaryness is so ingrained that it's not really worth
worrying about the possibility of the base changing from a power of 2 to a
power of 10. So in the end I left base out.
It did occur to me that NSMALLPOSINTS and NSMALLNEGINTS might usefully
be exposed in sys.int_info, mostly for the purposes of testing. Thoughts?

mdickinson · 2009-02-16T16:51:12Z

Forgot to mention: you'll need to rerun autoconf and autoheader after
applying the patch and before doing ./configure

mdickinson · 2009-02-17T11:30:20Z

Antoine, were your posted results on a 64-bit or a 32-bit system?

gpshead · 2009-02-18T19:36:19Z

On 32-bit x86 (1.4Ghz Efficeon) using gcc 4.3.2-1ubuntu12 I see the
following perf with pidigits_noprint 2000:

py3k:
baseline longdigit14 longdigit13+optimizations
3709 ms 3664ms 4545ms

Those were from the best of five runs after a warmup loop while the
system was idle. --enable-big-digits was passed to configure and
sys.int_info confirmed it was 30bits on the longdigit versions.

baseline is entirely unpatched.

mdickinson · 2009-02-18T20:08:18Z

Gregory, are you sure you didn't swap the 30-bit and 30-bit+opt results? On OS X/Core 2 Duo my timings are the other way around: 30bit is
significantly slower than unpatched, 30bit+opt is a little faster than
unpatched. Here are sample numbers:

Macintosh-3:py3k-30bit-opt dickinsm$ ./python.exe ../pidigits_noprint.py
Time; 2181.3 ms
Macintosh-3:py3k-30bit dickinsm$ ./python.exe ../pidigits_noprint.py 2000
Time; 2987.9 ms
Macintosh-3:py3k dickinsm$ ./python.exe ../pidigits_noprint.py 2000
Time; 2216.2 ms

mdickinson · 2009-02-18T20:34:05Z

And here are results from 64-bit builds on the same machine as above (OS X
10.5.6/Core 2 Duo, gcc 4.0.1 from Apple).

./python.exe ../pidigits_noprint.py 2000 gives the following timings:

30-bit digits: Time; 1245.9 ms
30-bit digits + optimizations: Time; 1184.4 ms
unpatched py3k: Time; 2479.9 ms

gpshead · 2009-02-18T20:50:44Z

hmm yes, ignore my 13+optimize result. apparently that used 15bit
digits despite --enable-big-digits on configure. attempting to fix that
now and rerun.

mdickinson · 2009-02-18T21:16:04Z

apparently that used 15bit
digits despite --enable-big-digits on configure

Maybe configure didn't get updated properly? I'm almost sure I found a
case where autoconf and autoheader just reused the stuff in the
autom4te.cache directory even though configure.in had been changed.
I've been doing: rm -fr autom4te.cache; autoconf; autoheader
each time.

loewis · 2009-02-18T21:27:06Z

On all other follow-ups I agree, so no further comments there.

http://codereview.appspot.com/14105/diff/1/2
File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160
Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);

Presumably all of these should be fixed.

Ok, so I'd waive this for this patch; please do create a separate
report.

http://codereview.appspot.com/14105

loewis · 2009-02-18T21:33:19Z

Maybe configure didn't get updated properly? I'm almost sure I found a
case where autoconf and autoheader just reused the stuff in the
autom4te.cache directory even though configure.in had been changed.
I've been doing: rm -fr autom4te.cache; autoconf; autoheader
each time.

I think the docs say to run autoheader first, then autoconf.

pitrou · 2009-02-18T21:35:47Z

Maybe configure didn't get updated properly? I'm almost sure I found a
case where autoconf and autoheader just reused the stuff in the
autom4te.cache directory even though configure.in had been changed.
I've been doing: rm -fr autom4te.cache; autoconf; autoheader
each time.

Perhaps by doing "autoreconf" instead?

gpshead · 2009-02-18T23:45:31Z

new results after fixing my longdigit13 build to use 30 bits instead of
15 (the configure script in longdigit13+optimizations didn't work right,
i had to manually add the #define to pyconfig.h)

py3k:
baseline longdigit14 longdigit13+optimizations
3709 ms 3664 ms 3377 ms

thats much saner.

gpshead · 2009-02-19T01:03:16Z

attaching an updated pidigits benchmark script that does a warmup run
before reporting the best result of 5.

gpshead · 2009-02-19T02:40:16Z

Here are the results from 32-bit x86 on core2 duo gcc 4.0.1 using
pydigits_bestof.py 4000:

30-bit digits (14): 15719 ms
30-bit digits + optimizations (13+ops): 12490 ms
unpatched py3k : 13289 ms

(again, i had to manually add #define PYLONG_DIGIT_SIZE 30 to pyconfig.h
for the longdigit13+optimizations patch).

and pybench runs on the same builds vs unpatched:

30-bit digits (14): -1.4% (slightly slower)
30-bit digits + optimizations (13+ops): -0.2% (insignificant)

My votes:

Obviously use the optimized version (but fix the configure stuff).

+0 for enabling it by default on 32bit builds.
+10 for enabling it by default on 64bit builds.

loewis · 2009-02-19T06:33:29Z

Obviously use the optimized version (but fix the configure stuff).

Before such a version gets committed, I'd like to see it on Rietveld
again.

pernici · 2009-02-19T09:27:32Z

The attached patch uses mul1 in long_mul in the version patched with
30bit_longdigit13+optimizations.patch

Comparison between these two patches on hp pavilion Q8200 2.33GHz

pybench patch new patch
SimpleIntegerArithmetic 89 85
other tests equal

pidigits_noprint 2000 998 947

mdickinson · 2009-02-19T09:37:58Z

Before such a version gets committed, I'd like to see it on Rietveld
again.

Sure. My original plan was to get the structural changes in first, and
then worry about optimizations.

But now I think the x_divrem fix has to be considered a prerequisite for
the 30-bit digits patch. So I'll clean up the x_divrem code, include it
in the basic patch and upload to Rietveld. The other optimization (to
multiplication) is more optional, potentially more controversial (it
adds 60 or so lines of extra code), and should wait until after the
commit and get a separate issue and review.

The x_divrem work actually simplifies the code (whilst not altering the
underlying algorithm), as well as speeding it up. Still, it's changing
a subtle core algorithm, so we need to be *very* sure that the new
code's correct before it goes in. In particular, I want to add some
tests to test_long that exercise the rare corner cases in the algorithm
(which only become rarer when 30-bit digits are used).

I also want to understand Gregory's problems with configure before
committing anything.

None of this is going to happen before the weekend, I'm afraid.

mdickinson · 2009-02-19T09:42:16Z

http://codereview.appspot.com/14105/diff/1/2
File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160
Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);
On 2009/02/18 21:27:04, Martin v. Löwis wrote:

Ok, so I'd waive this for this patch; please do create a separate
report.

Done.

http://bugs.python.org/issue5308

http://codereview.appspot.com/14105

mdickinson · 2009-02-20T16:01:35Z

Here's an updated patch that includes the x_divrem fixes and
optimizations. I've also updated the Rietveld issue with these
fixes.

I think this is (modulo any requested changes) the version
that I'd like to get into py3k. After that we can look
at the possibility of optimizing the multiplication algorithm;
the discussion for this should probably go back to bpo-3944.

Here's a detailed list of the changes to x_divrem.

Most importantly, in the inner loop, we make sure that the
multiplication is digit x digit -> twodigits; previously it was
digit x twodigits -> twodigits, which is likely to expand to
several instructions on a 32-bit machine. I suspect that
this is the main cause of the slowdown that Victor was
seeing.
Make all variables unsigned. This eliminates the need for
Py_ARITHMETIC_RIGHT_SHIFT, and also removes the only essential use
of stwodigits in the entire long object implementation.
Save an iteration of the outer loop when possible by comparing top
digits.
Remove double tests in the main inner loop and correction loop.
Quotient digit correction step follows Knuth more closely, and uses
fewer operations. The extra exit condition (r >= PyLong_BASE) will
be true more than 50% of the time, and should be cheaper than the
main test.
In quotient digit estimation step, remove unnecessary special case
when vj == w->ob_digits[w_size-1]. Knuth only needs this special
case
because his base is the same as the wordsize.
There's no need to fill the eliminated digits of v with zero;
instead, set Py_SIZE(v) at the end of the outer loop.
More comments; some extra asserts.

There are many other optimizations possible; I've tried only to include
those that don't significantly complicate the code. An obvious
remaining inefficiency is that on every iteration of the outer loop we
divide by the top word of w; on many machines I suspect that it would
be more efficient to precompute an inverse and multiply by that instead.

pitrou · 2009-02-20T16:31:29Z

Updated benchmarks results with 30bit_longdigit17.patch:

Victor's bench_int.py:

32-bit with patch: 1178.3 ms (24% speedup)
64-bit with patch: 990.8 ms (27% speedup)

Calculating 2000 digits of pi:

32-bit with patch: 2.16 s. (25% speedup)
64-bit with patch: 1.5 s. (55% speedup)

This is very good work.

mdickinson · 2009-02-21T09:28:12Z

Adding Tim Peters to the nosy list, mainly to give him an opportunity to
throw up his hands in horror at my rewrite of his (I'm guessing)
implementation of division.

mdickinson · 2009-02-22T12:19:18Z

It finally occurred to me that what might be killing 32-bit performance
is the divisions, rather than the multiplications.

To test this, here's a version of 30bit_longdigit17.patch that replaces
just *two* of the divisions in Objects/longsobject.c by the appropriate
x86 divl assembler instruction. The result for pydigits is an
astonishing 10% speedup!

Results of running python pydigits_bestof.py 2000 on 32-bit OS X
10.5.6/Core 2 Duo:

upatched py3k
-------------
Best Time; 2212.6 ms

30bit_longdigit17.patch
-----------------------
Best Time; 2283.9 ms (-3.1% relative to py3k)

30bit_longdigit17+asm.patch
---------------------------
Best Time; 2085.7 ms (+6.1% relative to py3k)

The problem is that (e.g., in the main loop of x_divrem) we're doing a
64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit
remainder. From the analysis of the algorithm, *we* know that the
quotient will always fit into 32 bits, so that e.g., on x86, a divl
instruction is appropriate. But unless the compiler is extraordinarily
clever it doesn't know this, so it produces an expensive library call to
a function that probably involves multiple divisions and/or some
branches, that produces the full 64-bit quotient.

On 32-bit PowerPC things are even worse, since there there isn't even a
64-by-32 bit divide instruction; only a 32-bit by 32-bit division.

So I could still be persuaded that 30-bit digits should only be enabled
by default on 64-bit machines...

mdickinson · 2009-02-24T18:18:16Z

Okay, let's abandon 30-bit digits on 32-bit machines: it's still
unclear whether there's any real performance gain, and it's trivial to
re-enable 30-bit digits by default later. I'm also going to abandon the
optimizations for now; it'll be much easier to work on them once the base patch is in.

Here's a 'release-candidate' version of the patch: it's exactly the
same as before, except:

I removed all x_divrem changes (though I've left the extra division
tests in, since they're just as relevant to the old x_divrem).
I'll open a separate issue for these optimizations.
enable 30-bit digits by default only on 64-bit platforms, where
for the purposes of this patch a 64-bit platform is one with all
the necessary integer types available (signed and unsigned 32-
and 64-bit integer types) and SIZEOF_VOID_P >= 8.

I've also updated the version uploaded to Rietveld: the patchset there
should exactly match 30bit_longdigit20.patch. See:

http://codereview.appspot.com/14105

Martin, thank you for all your help with reviewing the previous patch.
Is it okay with you to check this version in? It's the same as the
version that you reviewed, except with some extra tests for correct
division, and with 30-bit digits disabled by default on 32-bit
platforms.

Gregory, if you have time, please could you double check that the
configure stuff is working okay for you with this patch? "configure --
enable-big-digits" should produce 30-bit digits; "configure --disable-
big-digits" should produce 15-bit digits, and a plain "configure" should
give 15-bit or 30-bit depending on your platform.

Anyone else have any objections to this going in exactly as it is? I'd
quite like to resolve this issue one way or the other and move on.

vstinner · 2009-02-26T23:52:49Z

1 comment and 1 question about 30bit_longdigit20.patch:

I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT ==
30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15
(eg. digit=unsigned short). Even if short is always 16 bits, I would
prefer fixed size types in any case.
In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is
it for the Windows world (which doesn't use configure script)?

I prefer the patch version 20 because it's much simplier than the
other with the algorithm optimisations. The patch is already huge, so
it's better to split it into smaller parts ;-)

mdickinson · 2009-03-09T15:01:36Z

[Victor]

In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is
it for the Windows world (which doesn't use configure script)?

Yes, but it's also for Unix: PYLONG_BITS_IN_DIGIT is only set when the --
enable-big-digits option is supplied to configure. So the code in
pyport.h always gets used for a plain ./configure && make.

I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT ==
30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15
(eg. digit=unsigned short). Even if short is always 16 bits, I would
prefer fixed size types in any case.

I agree with you to some extent, but I'd still prefer to leave the 15-bit
definitions as they are, partly out of a desire not to make unnecessary
changes. The 'unsigned short' type used for 15-bit digits is both
theoretically sufficient and not wasteful in practice.

Barring further objections, I'm planning to get the
30bit_longdigit20.patch in later this week.

pitrou · 2009-03-11T13:02:48Z

I tried the patch on a 64-bit Linux system and it's ok.

mdickinson · 2009-03-18T20:18:17Z

Committed 30bit_longdigit20.patch to py3k in r70460, with some minor
variations (incorporate Nick Coghlan's improved error messages
for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries).

Thanks all for your feedback.

I might get around to backporting this to 2.7 one day...

mdickinson · 2009-03-18T20:19:07Z

That should be r70459, of course.

pitrou · 2009-03-18T20:19:11Z

Great!

Committed 30bit_longdigit20.patch to py3k in r70460, with some minor
variations (incorporate Nick Coghlan's improved error messages
for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries).

mdickinson · 2009-03-20T16:08:04Z

Backported to trunk in r70479.

Mario, thanks for the long multiplication tweaks you submitted: could you
possibly regenerate your Feb 19th patch against the trunk (or py3k,
whichever you prefer) and attach it to bpo-3944?

mdickinson added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Nov 4, 2008

mdickinson self-assigned this Feb 24, 2009

mdickinson closed this as completed Mar 20, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Use 30-bit digits instead of 15-bit digits for Python integers. #48508

Use 30-bit digits instead of 15-bit digits for Python integers. #48508

Comments

mdickinson commented Nov 4, 2008

mdickinson commented Nov 4, 2008

mdickinson commented Nov 4, 2008

mdickinson commented Nov 4, 2008

vstinner commented Nov 5, 2008

gpshead commented Nov 5, 2008

vstinner commented Nov 5, 2008

mdickinson commented Nov 5, 2008

vstinner commented Nov 5, 2008

mdickinson commented Nov 5, 2008

vstinner commented Nov 5, 2008

vstinner commented Nov 6, 2008

vstinner commented Nov 6, 2008

mdickinson commented Nov 6, 2008

mdickinson commented Nov 6, 2008

mdickinson commented Nov 6, 2008

vstinner commented Nov 6, 2008

vstinner commented Nov 6, 2008

vstinner commented Nov 6, 2008

mdickinson commented Nov 10, 2008

vstinner commented Nov 11, 2008

vstinner commented Nov 11, 2008

mdickinson commented Nov 11, 2008

pitrou commented Dec 14, 2008

vstinner commented Dec 14, 2008

mdickinson commented Feb 14, 2009

mdickinson commented Feb 16, 2009

mdickinson commented Feb 16, 2009

mdickinson commented Feb 17, 2009

gpshead commented Feb 18, 2009

mdickinson commented Feb 18, 2009

mdickinson commented Feb 18, 2009

gpshead commented Feb 18, 2009

mdickinson commented Feb 18, 2009

loewis mannequin commented Feb 18, 2009

loewis mannequin commented Feb 18, 2009

pitrou commented Feb 18, 2009

gpshead commented Feb 18, 2009

gpshead commented Feb 19, 2009

gpshead commented Feb 19, 2009

loewis mannequin commented Feb 19, 2009

pernici mannequin commented Feb 19, 2009

mdickinson commented Feb 19, 2009

mdickinson commented Feb 19, 2009

mdickinson commented Feb 20, 2009

pitrou commented Feb 20, 2009

mdickinson commented Feb 21, 2009

mdickinson commented Feb 22, 2009

mdickinson commented Feb 24, 2009

vstinner commented Feb 26, 2009

mdickinson commented Mar 9, 2009

pitrou commented Mar 11, 2009

mdickinson commented Mar 18, 2009

mdickinson commented Mar 18, 2009

pitrou commented Mar 18, 2009

mdickinson commented Mar 20, 2009