classification
Title: Use 30-bit digits instead of 15-bit digits for Python integers.
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: accepted
Dependencies: 5260 Superseder:
Assigned To: mark.dickinson Nosy List: christian.heimes, collinwinter, gregory.p.smith, haypo, jyasskin, loewis, mark.dickinson, pernici, pitrou, schuppenies, tim.peters
Priority: low Keywords: patch

Created on 2008-11-04 10:45 by mark.dickinson, last changed 2009-03-20 16:08 by mark.dickinson. This issue is now closed.

Files
File name Uploaded Description Edit
pybench_results.txt mark.dickinson, 2008-11-06 11:29 pybench results
optimize.patch haypo, 2008-11-06 11:30 Improve mark's patch
long_stat.patch haypo, 2008-11-06 12:47
30bit_longdigit13.patch mark.dickinson, 2009-02-16 16:47
30bit_longdigit13+optimizations.patch mark.dickinson, 2009-02-17 12:14
pidigits.py pitrou, 2009-02-17 17:25
30bit_longdigit14.patch mark.dickinson, 2009-02-17 18:17
pidigits_noprint.py haypo, 2009-02-17 23:02
pidigits_bestof.py gregory.p.smith, 2009-02-19 01:03 pidigits benchmark w/warmup and 5 runs
30bit_longdigit13+optimizations1.patch pernici, 2009-02-19 09:27 Improve mark's patch
30bit_longdigit17.patch mark.dickinson, 2009-02-20 16:01
30bit_longdigit17+asm.patch mark.dickinson, 2009-02-22 12:19
30bit_longdigit20.patch mark.dickinson, 2009-02-24 18:18
Messages (78)
msg75491 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-04 10:45
Here's an experimental patch, against the py3k branch, that makes Python 
represent its long integers internally in base 2**30 instead of base 
2**15, on platforms that have 32-bit and 64-bit integers available.

On platforms for which autoconf is unable to find both 32-bit and 64-bit 
integers, the representation falls back to the usual one.


See also issue 1814 (GMP for longs), and the discussion at

http://mail.python.org/pipermail/python-dev/2008-November/083315.html

(note particularly Tim Peter's message at:

http://mail.python.org/pipermail/python-dev/2008-November/083355.html
)
msg75492 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-04 11:04
Note that to avoid "bad marshal data" errors, you'll probably need to do a 
'make distclean' before rebuilding with this patch.
msg75494 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-04 15:28
Here's an updated patch, with the following changes from the original:

- make the size of a digit (both the conceptual size
in bits and actual size in bytes) available to Python
via a new structseq sys.int_info.  This information
is useful for the sys.getsizeof tests.

- fix a missing cast in long_hash

- better fast path for 1-by-1 multiplication that
doesn't go via PyLong_FromLongLong.
msg75512 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-05 00:29
> Note that to avoid "bad marshal data" errors, 
> you'll probably need to do a 'make distclean' 
> before rebuilding with this patch.

I saw that you choosed to use the base 2^30 for marshal. For a better 
portability (be able to use .pyc generated without your patch), you 
may keep the base 2^15. I implemented that in my GMP patch (manual 
conversion from/to base 2^15).

If we change the marshal format of long, the magic number should be 
different (we might use a tag like the "full unicode" tag used in 
Python3 magic number) and/or the bytecode (actual bytecode is 'l'). 
The base should be independent of the implementation, like Python does 
with text: UTF-8 for files and UCS-4 in memory. We may use the base 
2^8 (256) or another power or 2^8 (2^16, 2^32, 2^64?). The base 256 
sounds interresting because any CPU is able to process 8 bits digits.

Cons: Use a different bases makes Python slower for loading/writing 
from/to .pyc.
msg75513 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-11-05 00:57
oh yay, thanks.  it looks like you did approximately what i had started
working on testing a while back but have gone much further and added
autoconf magic to try and determine when which size should be used.  good.

(i haven't reviewed your autoconf stuff yet)

As for marhsalled data and pyc compatibility, yes that is important to
consider.

We should probably base the decision on which digit size to use
internally on benchmarks, not just if the platform can support 64bit
ints.  Many archs support 64bit numbers as a native C type but require
multiple instructions or are very slow when doing it.

(embedded arm, mips or ppc come to mind as obvious things to test that on)
msg75518 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-05 08:55
> As for marhsalled data and pyc compatibility, yes that is important to
> consider.

The problem is also that with the 30-bit digit patch, some Python will use 15 
bits whereas some other will use 30 bits. The base in marshal should be the 
same in both cases.

And why 30 bits and not 31 bits, or 63 bits, or 120 bits? We may use other 
bases in the future. That's why I prefer to use a common base like base 256. 
And GMP has functions (mpz_import) to load data in base 256, but it's more 
complicate to load data in base 2^15.
msg75520 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-05 10:53
[Victor Stinner]
> I saw that you choosed to use the base 2^30 for marshal. For a better 
> portability (be able to use .pyc generated without your patch), you 
> may keep the base 2^15. I implemented that in my GMP patch (manual 
> conversion from/to base 2^15).

Agreed.  I'll fix this so that the .pyc format is unchanged. Thanks!

> And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

Mostly laziness: the change from 15 to 30 bits turned out to be extraordinarily easy.  Note that the longobject.c part of 
the patch does almost nothing except adding a bunch of casts here and there.

31 bits would involve rewriting the powering algorithm, which assumes that PyLong_SHIFT is divisible by 5.  It would gain 
very little over 30 bits, and if Pernici Mario's optimizations are considered (issue 3944) multiplication would likely be 
slower with 31-bit digits than with 30-bit digits.

63 (or 62, or 60) bits is simply too large right now:  you'd need access to a hardware 64 x 64 -> 128 bit multiply to make 
this worth it, and I'd guess there are many fewer platforms that make this easy, though I don't really know.  I know it's 
possible on gcc/x86_64 by making use of the (undocumented) __uint128_t type.  But even where this type is available, base 
2**63 might still be slower than base 2**30. I've done some experiments with multiprecision *decimal* arithmetic in C that 
showed that even on a 64-bit machine, using base 10**9 (i.e. approx 30 bits) was significantly faster than using base 
10**18.

120 bits?  Does GMP even go this far?  I guess part of the attraction is that it's a multiple of 8...

The other obvious choices to consider would be 32 bits (or 16 bits, or 64 bits).  This is possible, and may even be worth 
it, but it would be a *much* bigger effort; most of the algorithms would need to be rewritten.  One problem is again the 
mismatch between C and assembler:  detecting overflow when adding two 32-bit unsigned integers is trivial in x86 assembler, 
but it's not so obvious how to write portable C code that has a decent chance of being compiled optimally on a wide variety 
of compilers. 

I guess my feeling is simply that the 15-bit to 30-bit change seems incredibly easy and cheap: very little code change, and 
hence low risk of accidental breakage.  So if there are indeed significant efficiency benefits (still to be determined) then 
it looks like a clear win to me.

[Gregory Smith]
> (i haven't reviewed your autoconf stuff yet)

The changes to configure and pyconfig.h are just from rerunning autoconf and autoheader;  it's only configure.in that should 
need looking at.
msg75522 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-05 11:24
> > And why 30 bits and not 31 bits, or 63 bits, or 120 bits?
>
> Mostly laziness (...)

It was an argument for changing the base used by the mashal :-)

> 31 bits would involve rewriting the powering algorithm, which assumes that
> PyLong_SHIFT is divisible by 5

Powering is an simple algorithm. If it was the division, it would be much 
harder :-) Python stores the sign of the number in the first digit. Because 
of that, we are limited to 15/30 bits. Storing the sign in the size (which 
size? no idea yet) would allows to use a bigger base (31 bits? 63 bits?).

> One problem is again the mismatch between C and assembler:  detecting
> overflow when adding two 32-bit unsigned integers is trivial in x86
> assembler, but it's not so obvious how to write portable C code that has a
> decent chance of being compiled optimally on a wide variety of compilers.

I wrote an example to detect overflows in C on the mailing list. Copy of my 
email:
------------------------------- 8< ----------------------
About 31, 32, 63 or 64 bits: I guess that you want to avoid integer overflow. 
Intel has an "overflow" flag, changed for all instructions. For other CPUs, 
you can use emulate this flag. Example with the type int that I used in my 
GMP patch:

Add:
  int a, b, sum;
  sum = a + b;
  exact = ((a < 0) ^ (b < 0)) || ((a < 0) == (sum < 0));

Substract:
  int a, b, diff;
  diff = a + b;
  exact = ((a < 0) == (b < 0)) || ((a < 0) == (diff < 0));

Multiply:
  int a, b, product;
  if (a == 0 || b == 0) {
     product = 0;  /* exact */
  } else if (a != INT_MIN || (b == 1)) {
     product = a * b;
     exact = (product / b) == a);
  } else {
     /* INT_MIN * -1 = -INT_MIN: overflow */
  }

Divide:
  int a, b, q;
  if (a != INT_MIN || b != -1) {
     q = a / b;   /* exact */
  } else {
     /* INT_MIN / -1 = -INT_MIN: overflow */
  }

Checking overflow may costs more than using a smaller base. Only a benchmark 
can answer ;-)
------------------------------- 8< ----------------------

> I guess my feeling is simply that the 15-bit to 30-bit change seems
> incredibly easy and cheap: very little code change, and hence low risk of
> accidental breakage.

Python has an amazing regression test suite! I used it to fix my GMP patch. We 
can experiment new bases using this suite.

Anyway, i love the idea of using 30 bits instead of 15! Most computer are now 
32 or 64 bits! But it's safe to keep the 15 bits version to support older 
computers or buggy compilers.

I started to work with GIT. You may be interrested to work together on GIT. 
It's much easier to exchanges changeset and play with branches. I will try to 
publish my GIT tree somewhere.
msg75534 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-05 23:04
Following Victor's suggestion, here's an updated patch; same as before, 
except that marshal now uses base 2**15 for reading and writing, 
independently of whether PyLong_SHIFT is 15 or 30.
msg75536 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-05 23:13
Mark: would it be possible to keep the "2 digits" hack in 
PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT == 
15". The base 2^15 slow, so don't make it slower :-)

-	/* 2 digits */
-	if (!(ival >> 2*PyLong_SHIFT)) {
-		v = _PyLong_New(2);
-		if (v) {
-			Py_SIZE(v) = 2*sign;
-			v->ob_digit[0] = (digit)ival & PyLong_MASK;
-			v->ob_digit[1] = ival >> PyLong_SHIFT;
-		}
-		return (PyObject*)v;
-	}
msg75539 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-06 01:35
> marshal now uses base 2**15 for reading and writing

Yes, it uses base 2**15 but it's not the correct conversion to base 
2**15. You convert each PyLong digit to base 2**15 but not the whole 
number. As a result, the format is different than the current mashal 
version.
msg75540 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-06 02:32
PyLong_FromLong() doesn't go into the 1 digit special case for 
negative number. You should use:
	/* Fast path for single-digits ints */
	if (!(abs_ival>>PyLong_SHIFT)) {
		v = _PyLong_New(1);
		if (v) {
			Py_SIZE(v) = sign;
			v->ob_digit[0] = abs_ival;
		}
		return (PyObject*)v;
	}
msg75551 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-06 09:01
> Yes, it uses base 2**15 but it's not the correct conversion to base 
> 2**15. You convert each PyLong digit to base 2**15 but not the whole 
> number.

I don't understand:  yes, each base 2**30 digit is converted to a pair 
of base 2**15 digits, and if necessary (i.e., if the top 15 bits of the 
most significant base 2**30 digit are zero) the size is adjusted.  How 
is this not converting the whole number?

> As a result, the format is different than the current mashal version.

Can you give an example of an integer n such that marshal.dumps(n) gives 
you different results with and without the patch?  As far as I can tell, 
I'm getting the same marshal results both with the unpatched version and 
with the patch applied.
msg75553 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-06 09:21
Other responses...
> It was an argument for changing the base used by the mashal :-)

Ah.  I think I'm with you now.  You're saying that ideally, marshal 
shouldn't have to care about how Python stores its longs:  it should 
just ask some function in longobject.c to provide an already-converted-
to-base-256 representation.  Is that right?

I also find the idea of making the marshal representation base 256 quite 
attractive.  There are already functions in longobject.c that could be 
used for this: _PyLong_{From,As}ByteArray.  And then you wouldn't need 
to touch marshal.c when swapping the GMP version of longobject.c in and 
out.

> Python stores the sign of the number in the first digit. Because 
> of that, we are limited to 15/30 bits.

No: the sign is stored in the size:  if v is a PyLongObject then 
ABS(Py_SIZE(v)) gives the number of digits in the absolute value of the 
integer represented by v, and the sign of Py_SIZE(v) gives the sign of 
the integer.

> would it be possible to keep the "2 digits" hack in 
> PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT == 
> 15". The base 2^15 slow, so don't make it slower :-)

Why don't we leave this kind of micro-optimization out until we've got 
some benchmarks.  (I'm also tempted to get rid of the long_mul fast path 
for now.)

> PyLong_FromLong() doesn't go into the 1 digit special case for 
> negative number.

Well spotted!  Yes, this should be fixed.  I have a nasty feeling that I 
was responsible for introducing this bug some time ago...
msg75559 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-06 11:17
Here's a pybench comparison, on OS X 10.5/Core 2 Duo/gcc 4.0.1 (32-bit 
non-debug build of the py3k branch).  I got this by doing:

[create clean build of py3k branch]
dickinsm$ ./python.exe Tools/pybench/pybench.py -f bench_unpatched
[apply 30bit patch and rebuild]
dickinsm$ ./python.exe Tools/pybench/pybench.py -c bench_unpatched

Highlights: SimpleLongArithmetic: around 10% faster.
            SimpleComplexArithmetic: around 16% slower!
            CompareFloatsIntegers: around 20% slower.

I'll investigate the slowdowns.
msg75560 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-06 11:30
> I'll investigate the slowdowns

The problem may comes from int64_t on 32 bits CPU. 32x32 -> 64 may be 
emulated on your CPU and so it's slower. I improved your patch to make 
it faster, but I lost all my work because of a misuse of GIT... As I 
remember:
 - I fixed PyLong_FromLong() for small negative integer
 - I unrolled the loop in PyLong_FromLong(): the loop is at least 
called twice (the number has 2 digits or more)
 - I added special code for operations on two numbers of 1 digit 
(each) for long_add(), long_mul(), long_div(), long_bitwise(), etc.
 - and I don't remember the other changes...

Oh, I have an old patch. I will attach it to this message. About 
speed, it was:
 * unpatched: 20600..20900 pystones
 * your patch: 19900..20100 pystones
 * + my changes: 20200..20400 pytones
msg75561 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-06 12:47
I wrote a patch to compute stat about PyLong function calls.

make (use setup.py):

PyLong_FromLong: 168572 calls, min=( 0,  ), avg=(1.4,    ), max=(  3,    )
long_bool:        48682 calls, min=( 0,  ), avg=(0.2,    ), max=(  2,    )
long_add:         39527 calls, min=( 0, 0), avg=(0.9, 1.0), max=(  2,   3)
long_compare:     39145 calls, min=( 0, 0), avg=(1.2, 1.1), max=(  3,   3)
PyLong_AsLong:    33689 calls, min=( 0,  ), avg=(0.9,    ), max=( 45,    )
long_sub:         13091 calls, min=( 0, 0), avg=(0.9, 0.8), max=(  1,   1)
long_bitwise:      4636 calls, min=( 0, 0), avg=(0.8, 0.6), max=(  2,   2)
long_hash:         1097 calls, min=( 0,  ), avg=(0.9,    ), max=(  3,    )
long_mul:           221 calls, min=( 0, 0), avg=(0.8, 1.1), max=(  2,   2)
long_invert:        204 calls, min=( 0,  ), avg=(1.0,    ), max=(  1,    )
long_neg:            35 calls, min=( 1,  ), avg=(1.0,    ), max=(  1,    )
long_format:          3 calls, min=( 0,  ), avg=(0.7,    ), max=(  1,    )
long_mod:             3 calls, min=( 1, 1), avg=(1.0, 1.0), max=(  1,   1)
long_pow:             1 calls, min=( 1, 1), avg=(1.0, 1.0), max=(  1,   1)

pystone:

PyLong_FromLong:1587652 calls, min=( 0,  ), avg=(1.0,    ), max=(  3,    )
long_add:        902487 calls, min=( 0, 0), avg=(1.0, 1.0), max=(  2,   2)
long_compare:    651165 calls, min=( 0, 0), avg=(1.0, 1.0), max=(  3,   3)
PyLong_AsLong:   252476 calls, min=( 0,  ), avg=(1.0,    ), max=(  2,    )
long_sub:        250032 calls, min=( 1, 0), avg=(1.0, 1.0), max=(  1,   1)
long_bool:       102655 calls, min=( 0,  ), avg=(0.5,    ), max=(  1,    )
long_mul:        100015 calls, min=( 0, 0), avg=(1.0, 1.0), max=(  1,   2)
long_div:         50000 calls, min=( 1, 1), avg=(1.0, 1.0), max=(  1,   1)
long_hash:          382 calls, min=( 0,  ), avg=(1.1,    ), max=(  2,    )
long_bitwise:       117 calls, min=( 0, 0), avg=(1.0, 1.0), max=(  1,   2)
long_format:          1 calls, min=( 2,  ), avg=(2.0,    ), max=(  2,    )

min/avg/max are the integer digit count (minimum, average, maximum).

What can we learn from this numbers?

PyLong_FromLong(), long_add() and long_compare() are the 3 most common 
operations on integers. 

Except PyLong_FromLong(), long_compare() and long_format(), arguments of the 
functions are mostly in range [-2^15; 2^15].

Biggest number is a number of 45 digits: maybe just one call to long_add(). 
Except this number/call, the biggest numbers have between 2 and 3 digits. 

long_bool() is never called with number bigger than 2 digits.

long_sub() is never called with number bigger than 1 digit!
msg75562 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-06 13:24
And now the stat of Python patched with 30bit_longdigit3.patch.

min/avg/max are now the number of bits which gives better 
informations. "bigger" is the number of arguments which are bigger than 1 
digit (not in range [-2^30; 2^30]).

make
====

_FromLong:   169734 calls, min=( 0,  ), avg=(11.6,     ), max=(  32,     )
  \--> bigger=31086
long_bool:    48772 calls, min=( 0,  ), avg=( 0.3,     ), max=(  24,     )
long_add:     39685 calls, min=( 0, 0), avg=( 6.5,  3.5), max=(  19,   32)
  \--> bigger=1
long_compare: 39445 calls, min=( 0, 0), avg=( 9.3,  8.4), max=(  31,   33)
  \--> bigger=10438
_AsLong:      33726 calls, min=( 0,  ), avg=( 4.9,     ), max=(1321,     )
  \--> bigger=10
long_sub:     13285 calls, min=( 0, 0), avg=( 7.6,  5.6), max=(  13,   13)
long_bitwise:  4690 calls, min=( 0, 0), avg=( 1.7,  1.9), max=(  16,   16)
long_hash:     1097 calls, min=( 0,  ), avg=( 8.1,     ), max=(  33,     )
  \--> bigger=4
long_mul:       236 calls, min=( 0, 0), avg=( 1.3,  5.4), max=(  17,   17)
long_invert:    204 calls, min=( 0,  ), avg=( 2.4,     ), max=(   3,     )
long_neg:        35 calls, min=( 1,  ), avg=( 4.3,     ), max=(   7,     )
long_format:      3 calls, min=( 0,  ), avg=( 2.0,     ), max=(   4,     )
long_mod:         3 calls, min=( 1, 2), avg=( 1.7,  2.0), max=(   2,    2)
long_pow:         1 calls, min=( 2, 6), avg=( 2.0,  6.0), max=(   2,    6)

Notes about make:
 - PyLong_FromLong(), long_compare(), PyLong_AsLong() and long_hash() 
   gets integers not in [-2^30; 2^30] which means that all other functions
   are only called with arguments of 1 digit!
 - PyLong_FromLong() gets ~30.000 (18%) integers of 32 bits
 - global average integer size is between 0.3 and 11.6 (~6.0 bits?)
 - There are 41.500 (12%) big integers on ~350.000 integers

pystone
=======

_FromLong:   1504983 calls, min=( 0,  ), avg=( 5.1,     ), max=(  31,     )
  \--> bigger=14
long_add:     902487 calls, min=( 0, 0), avg=( 3.9,  2.4), max=(  17,   17)
long_compare: 651165 calls, min=( 0, 0), avg=( 1.7,  1.4), max=(  31,   31)
  \--> bigger=27
_AsLong:      252477 calls, min=( 0,  ), avg=( 4.6,     ), max=(  16,     )
long_sub:     250032 calls, min=( 1, 0), avg=( 4.0,  1.6), max=(   7,    7)
long_bool:    102655 calls, min=( 0,  ), avg=( 0.5,     ), max=(   7,     )
long_mul:     100015 calls, min=( 0, 0), avg=( 2.5,  2.0), max=(   4,   16)
long_truediv:  50000 calls, min=( 4, 2), avg=( 4.0,  2.0), max=(   4,    2)
long_hash:       382 calls, min=( 0,  ), avg=( 8.1,     ), max=(  28,     )
long_bitwise:    117 calls, min=( 0, 0), avg=( 6.7,  6.6), max=(  15,   16)
long_format:       1 calls, min=(16,  ), avg=(16.0,     ), max=(  16,     )

Notes about pystone:
 - very very few numbers are bigger than one digit: 41 / ~4.000.000
 - global average integer size is between 0.5 and 6.7 (~3.0 bits?)
 - the biggest number has only 31 bits (see long_compare)

Short summary:
 - pystone doesn't use big integer (1 big integer for 100.000 integers)
   => don't use pystone!
 - the average integer size is around 3 or 6 bits, which means that most
   integers can be stored in 8 bits (-255..255)
   => we need to focus on the very small numbers
   => base 2^30 doesn't help for common Python code, it only helps programs
      using really big numbers (128 bits or more?)
msg75695 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-10 13:34
Here's a minor update to the patch, that does some extra cleanup:

- don't include longintrepr.h in Objects/abstract.c or 
Objects/boolobject.c --- it's not needed.

- fix several places in longobject.c where int should have been size_t 
or Py_ssize_t

- remove some unnecessary forward declarations in longobject.c.

- fix PyLong_FromLong for small negative integers

At some point I'll try to separate the pure bugfixes (missing casts, int 
vs Py_ssize_t, etc.) from the 15-bit to 30-bit conversion.
msg75718 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-11 01:58
Using 30bit_longdigit4.patch, I get this 
error: "Objects/longobject.c:700: erreur: "SIZE_T_MAX" undeclared 
(first use in this function)". You might use the type Py_ssize_t with 
PY_SSIZE_T_MAX. I used INT_MAX to compile the code.
msg75720 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-11 02:24
I like the idea of sys.int_info, but I would prefer a "base" attribute 
than "bits_per_digit". A base different than 2^n might be used (eg. a 
base like 10^n for fast conversion from/to string).
msg75746 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008-11-11 16:30
Here's a version of the 15-bit to 30-bit patch
that adds in a souped-up version of Mario Pernici's
faster multiplication.

I did some testing of 100x100 digit and 1000x1000 digit
multiplies.  On 32-bit x86:
  100 x 100 digits  : around 2.5 times faster
 1000 x 1000 digits : around 3 times faster.

On x86_64, I'm getting more spectacular results:
  100 x 100 digits : around 5 times faster
 1000 x 1000 digits: around 7 times faster!

The idea of the faster multiplication is quite simple:
with 30-bit digits, one can fit a sum of 16 30-bit by
30-bit products in a uint64_t.  This means that the
inner loop for the basecase grade-school multiplication
contains one fewer addition and no mask and shift.

[Victor, please don't delete the old longdigit4.patch!]
msg77769 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-14 00:28
Just tested the patch, here are some benchmarks:

./python -m timeit -s "a=100000000;b=777777" "a//b"
 -> 2.6: 0.253 usec per loop
 -> 3.1: 0.61 usec per loop
 -> 3.1 + patch: 0.331 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a*b"
 -> 2.6: 0.431 usec per loop
 -> 3.1: 0.302 usec per loop
 -> 3.1 + patch: 0.225 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a+b"
 -> 2.6: 0.173 usec per loop
 -> 3.1: 0.229 usec per loop
 -> 3.1 + patch: 0.217 usec per loop

But it seems there are some outliers:

./python -m timeit -s "a=100000000**5+1;b=777777**3" "a//b"
 -> 2.6: 1.13 usec per loop
 -> 3.1: 1.12 usec per loop
 -> 3.1 + patch: 1.2 usec per loop
msg77770 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-12-14 01:09
I wrote a small benchmark tool dedicated to integer operations (+ - 
* / etc.): bench_int.py attached to issue4294. See also Message75715 
and Message75719 for my benchmark results. Short sum up: 2^30 base 
helps a lot on 64 bits CPU (+26%) whereas the speedup is small (4%) on 
32 bits CPU. But don't trust benchmarks :-p
msg82116 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-14 20:32
The most recent patch is out of date and no longer applies cleanly.  I'm 
working on an update.
msg82250 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-16 16:47
Updated patch against py3k.  I'm interested in getting this into the trunk 
as well, but py3k is more important (because *all* integers are long 
integers).  It's also a little more complicated to do this for py3k 
(mostly because of all the small integer caching), so backporting to
2.7 is easier than trying to forward port a patch from 2.7 to 3.1.

Notes:

- I've added a configure option --enable-big-digits (there are probably 
better names), enabled by default.  So you can use --disable-big-digits
to get the old 15-bit behaviour.

- I *think* this patch should work on Windows; confirmation would be 
appreciated.

- I've removed the fast multiplication code, in the interests of keeping 
the patch simple.  If this patch goes in, we can concentrate on speeding 
up multiplication afterwards.  For now, note that 30-bit digits give
the *potential* for significant speedups in multiplication and division 
(see next item).

- There's a nasty 'feature' in x_divmod: the multiplication in the 
innermost loop is digit-by-twodigits -> twodigits, when it should be 
digit-by-digit -> twodigits;  this probably causes significant slowdown 
with 30-bit digits.  This may explain Antoine's outlier.
Again, if this patch goes in I'll work on fixing x_divmod.

- Re: Victor's comment about a 'base' attribute:  I tried this, but
quickly discovered that we still need the 'bits_per_digit' for tests.
I think that binaryness is so ingrained that it's not really worth
worrying about the possibility of the base changing from a power of 2 to a 
power of 10.  So in the end I left base out.

- It did occur to me that NSMALLPOSINTS and NSMALLNEGINTS might usefully 
be exposed in sys.int_info, mostly for the purposes of testing.  Thoughts?
msg82251 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-16 16:51
Forgot to mention:  you'll need to rerun autoconf and autoheader after 
applying the patch and before doing ./configure
msg82318 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 11:30
Antoine, were your posted results on a 64-bit or a 32-bit system?
msg82319 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 11:34
Mark, I think it was 32-bit at the time.
msg82320 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 11:44
Now with the latest patch, and under a 64-bit system (the same one
actually, but with a 64-bit distro):

* pybench is roughly 2% slower 
* timeit -s "a=100000000;b=777777" "a//b"
- before: 0.563 usec per loop
- after: 0.226 usec per loop
* timeit -s "a=100000000;b=777777" "a*b"
- before: 0.179 usec per loop
- after: 0.131 usec per loop
* timeit -s "a=100000000;b=777777" "a+b"
- before: 0.174 usec per loop
- after: 0.134 usec per loop
msg82321 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 11:49
Actually, I think my previous results were in 64-bit mode already.

By the way, I don't think unconditionally using uint64_t is a good thing
on 32-bit CPUs. uint64_t might be an emulated type, and operations will
then be very slow.
It would be better to switch based on sizeof(long) (for LP64 systems) or
sizeof(void *) (for LLP64 systems).
msg82325 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 12:12
Actually, I still get a speedup on a 32-bit build. :)
msg82326 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 12:14
Here's a version of the patch that includes optimizations to
basecase multiplication, and a streamlined x_divrem for faster division.
 With Victor's benchmark, I'm getting 43% speed increase on 64-bit
Linux/Core 2 Duo.

Note: the base patch is stable and ready for review;  in contrast, the
optimizations are still in a state of flux, so the +optimizations
patch is just there as an example of what might be possible.

About using uint64_t:

the 64-bit type isn't really used very much:  its main role is as the
result type of a 32-bit by 32-bit multiplication.  So it might
not matter too much if it's an emulated type;  what's
important is that the 32-bit by 32-bit multiply with 64-bit results
is done in a single CPU instruction.  I don't know how to test for
this.  Do you know of a mainstream system where this isn't true?

I'll test this tonight on 32-bit PPC and 32=bit Intel, and report back.

I don't care very much about trying to *automatically* do the right
thing for small or embedded systems:  they can use the
--disable-big-digits configure option to turn 30-bit digits off.

Antoine, do you think we should be using 30-bit digits by default *only*
on 64-bit machines?  I guess I could go with that, if it
can be manually overridden by the configure option.
msg82342 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 17:09
As I said, I actually see a speedup as well on a 32-bit build on a
64-bit CPU. So the current patch (30bit_longdigit13.patch) is fine.
msg82344 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 17:22
Some more benchmarks results (with 30bit_longdigit13.patch):

* Victor's bench_int.py:
- 32-bit without patch: 1370.1 ms
- 32-bit with patch:    1197.8 ms (23% speedup)
- 64-bit without patch: 1357.6 ms
- 64-bit with patch:    981.6 ms  (28% speedup)

* calculating 2000 digits of pi (*):
- 32-bit without patch: 2.87 s.
- 32-bit with patch:    2.87 s. (0% speedup: ???)
- 64-bit without patch: 3.35 s. 
- 64-bit with patch:    1.68 s. (50% speedup)


(*) using the following script adapted for py3k:
http://shootout.alioth.debian.org/u64q/benchmark.php?test=pidigits&lang=python&id=1
msg82345 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 17:25
Here's the py3k version of pidigits.py.
msg82346 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 17:27
Thanks, Antoine.  I've reworked the configure stuff anyway:  the decision
about what size digits to use should take place in pyport.h rather than 
Include/longintrepr.h.  Updated patches will arrive shortly!
msg82347 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 17:36
Updated non-optimized patch.  The only real change is that I've moved 
some of the configuration stuff around (so not worth re-benchmarking 
this one);  I hope that I've now got the division of labour correct:

- configure script simply parses the --enable-big-digits option and sets
  PYLONG_DIGIT_SIZE in pyconfig.h to 15 or 30, or leaves it undefined
  if that option wasn't given to configure

- PC/pyconfig.h doesn't define PYLONG_DIGIT_SIZE

- pyport.h chooses a suitable value for PYLONG_DIGIT_SIZE if it's not
  already defined

- Include/longintrepr.h just follows orders: it expects 
PYLONG_DIGIT_SIZE
  to be defined already, and complains if PYLONG_DIGIT_SIZE=30 but the
  necessary integer types aren't available.

Thanks for all the benchmarking.

I'd probably better check on python-dev before pushing this in, since 
it's a new feature.  I hope no-one wants a PEP. :-)
msg82348 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-17 17:47
The last patch (30bit_longdigit14.patch) is obviously missing some
stuff, but other than that I think everything's fine and you could commit.
msg82350 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 17:55
Oops.  Here's the correct patch.
msg82362 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-17 20:23
Has any conclusion been reached wrt. overhead of 30-bit multiplication
on 32-bit systems? IIUC, the single-digit multiplication is equivalent
to the C program

unsigned long long m(unsigned long long a, unsigned long b)
{
        return a*b;
}

(i.e. one digit is cast into two digits, and multiplied with the other
one). gcc 4.3.3, on x86, compiles this into

        movl    12(%esp), %eax
        movl    8(%esp), %ecx
        imull   %eax, %ecx
        mull    4(%esp)
        leal    (%ecx,%edx), %edx
        ret

In pseudo-code, this is

        tmp = high_a * b;
        high_res:low_res = low_a * b;
        high_res += tmp

So it does use two multiply instructions (plus an add), since one
argument got cast into 64 bits.

VS2008 compiles it into

	push	eax
	push	ecx
	push	0
	push	edx
	call	__allmu

i.e. it widens both arguments to 64 bits, then calls a library routine.
msg82364 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 20:30
> unsigned long long m(unsigned long long a, unsigned long b)
> {
>        return a*b;
> }

I think that's doing a 32 x 64 -> 64 multiplication;  what's being used is 
more like this:

unsigned long long m(unsigned long a, unsigned long b)
{
    return (unsigned long long)a*b;
}

which gcc -O3 compiles to:

	pushl	%ebp
	movl	%esp, %ebp
	movl	12(%ebp), %eax
	mull	8(%ebp)
	leave
	ret
msg82368 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 20:52
Patch uploaded to Rietveld (assuming that I did it right):

http://codereview.appspot.com/14105
msg82374 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-17 21:56
It looks as though Visual Studio 2008 does the 'right' thing, too, at 
least in some circumstances.  Here's some assembler output (MSVC Express 
Edition, 32-bit Windows XP / Macbook Pro).

; 3    : unsigned long long mul(unsigned long x, unsigned long y) {

	push	ebp
	mov	ebp, esp

; 4    :     return (unsigned long long)x * y;

	mov	eax, DWORD PTR _x$[ebp]
	mov	ecx, DWORD PTR _y$[ebp]
	mul	ecx

; 5    : }

	pop	ebp
	ret	0
msg82375 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-02-17 22:00
> Patch uploaded to Rietveld (assuming that I did it right):
> http://codereview.appspot.com/14105

Hehe, your configure's patch is too huge for Rietveld which displays 
a "MemoryError" :-) Bug reported at:
http://code.google.com/p/rietveld/issues/detail?id=87
msg82385 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-02-17 23:00
Ok, let's try 30bit_longdigit14.patch:

   patch -p0 < 30bit_longdigit14.patch
   autoconf && autoheader
   ./configure && make

I'm using two computers:

 - marge: Pentium4, 32 bits, 3 GHz (32 bits)
 - lisa: Core Quad (Q9300), 64 bits, 2.5 GHz (64 bits)

Both uses 32 bits digit (and 64 bits twodigits), py3k trunk and Linux.

* My bench_int.py:

  - 32-bit without patch: 1670.7 ms
  - 32-bit with patch:    1547.8 ms (+7.4%)
  - 64-bit without patch: 885.2 ms
  - 64-bit with patch:    627.1 ms (+29.2%)

* pidigits 2000 (I removed the calls to print): 
  lowest result on 5 runs:

  - 32-bit without patch: 2991.5 ms
  - 32-bit with patch:    3445.4 ms (-15.2%) SLOWER!
  - 64-bit without patch: 1949.9 ms
  - 64-bit with patch:     973.0 ms (+50.1%)

* pybench.py (minimum total)

  - 32-bit without patch: 9209 ms
  - 32-bit with patch:
  - 64-bit without patch: 4430 ms
  - 64-bit with patch:    4330 ms (=)

pybench details:

Test            32 bits (without,patch) | 64 bits (without,patch)
-----------------------------------------------------------------
   CompareFloatsIntegers:  293ms  325ms |  113ms    96ms
         CompareIntegers:  188ms  176ms |  129ms    98ms
     DictWithIntegerKeys:  117ms  119ms |   73ms    69ms
SimpleIntFloatArithmetic:  192ms  204ms |   84ms    80ms
 SimpleIntegerArithmetic:  188ms  196ms |   84ms    80ms
-----------------------------------------------------------------

On 64 bits, all integer related tests are faster. On 32 bits, some
tests are slower.

Sum up: on 64 bits, your patch is between cool (30%) and awesome 
(50%) :-) On 32 bits, it's not a good idea to use 32 bits digit 
because it's a little bit slower.

=> I would suggest to use 2^30 base only if sizeof(long)>=8 (64 bits 
CPU).

Note: I already get similar result (2^30 is slower on 32 bits CPU) in 
older tests.
msg82386 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-02-17 23:02
New attachment: pidigits_noprint.py, hacked version of pidigits.py to 
remove the print and add the computation time in millisecond. Print 
was useless because we don't want to benchmark int->str conversion, 
especially when the integer is in [0; 9].
msg82404 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-18 03:32
Thanks very much for the timings, Victor.

Just out of interest, could you try the pydigits script with the 
+optimizations patch on 32-bit?

As mentioned above, there's a significant (for 30-bit digits) problem 
with x_divrem:  the inner loop does a 32 x 64-bit multiply when it 
should be doing a 32 x 32-bit multiply  (the variable q is declared as 
twodigits, but always fits into a digit).  This is fixed in the 
+optimizations patch, and pi_digits is heavy on the divisions, so I
wonder whether this might make a difference.
msg82407 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-18 05:13
> I would suggest to use 2^30 base only if sizeof(long)>=8 (64 bits
CPU).

Thats not the correct test.  Test for an actual 64-bit build target. 
sizeof(long) and sizeof(long long) are not usefully related to that in
any sort of cross platform manner.  On windows, we'd define the flag for
15 vs 30 bit longs in the build configs for the various build targets. 
On every thing else (autoconf), we can use a configure test to check the
same things that platform.architecture() checks to return '32bit' vs
'64bit'.
msg82424 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-18 17:06
Reviewers: Martin v. Löwis,

http://codereview.appspot.com/14105/diff/1/11
File Doc/library/sys.rst (right):

http://codereview.appspot.com/14105/diff/1/11#newcode418
Line 418: A struct sequence that holds information about Python's
Agreed.  All that's important here is the attribute access.

http://codereview.appspot.com/14105/diff/1/6
File Include/longintrepr.h (right):

http://codereview.appspot.com/14105/diff/1/6#newcode24
Line 24: Furthermore, NSMALLNEGINTS and NSMALLPOSINTS should fit in a
digit. */
On 2009/02/17 22:39:18, Martin v. Löwis wrote:
> Merge the comments into a single on. There is no need to preserve the
evolution
> of the code in the comment structure.

Done, along with a general rewrite of this set of comments.

http://codereview.appspot.com/14105/diff/1/9
File Objects/longobject.c (right):

http://codereview.appspot.com/14105/diff/1/9#newcode2872
Line 2872: /* XXX benchmark this! Is is worth keeping? */
On 2009/02/17 22:39:18, Martin v. Löwis wrote:
> Why not PyLong_FromLongLong if available (no special case if not)?

Yes, PyLong_FromLongLong would make sense.  If this is not available,
we still need to make sure that CHECK_SMALL_INT gets called.

http://codereview.appspot.com/14105/diff/1/10
File PC/pyconfig.h (right):

http://codereview.appspot.com/14105/diff/1/10#newcode318
Line 318: #define PY_UINT64_T unsigned __int64
On 2009/02/17 22:39:18, Martin v. Löwis wrote:
> I think this should use PY_LONG_LONG, to support MingW32; likewise,
__int32
> shouldn't be used, as it is MSC specific

Ok.  I'll use PY_LONG_LONG for 64-bit, and try int and long for 32-bit.

http://codereview.appspot.com/14105/diff/1/2
File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160
Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);
On 2009/02/17 22:39:18, Martin v. Löwis wrote:
> This needs to deal with overflow (sizeof(size_t) > sizeof(long))

Hmm.  It looks as though there are many places in this file,
particularly in w_object, that do "w_long((long)n, p), where
n has type Py_ssize_t.  Presumably all of these should
be fixed.

http://codereview.appspot.com/14105/diff/1/2#newcode540
Line 540: if (n < -INT_MAX || n > INT_MAX)
On 2009/02/17 22:39:18, Martin v. Löwis wrote:
> I think this is obsolete now; longs can have up to ssize_t_max digits.

Agreed. Again, this needs to be fixed throughout marshal.c (many
occurrences in r_object).

http://codereview.appspot.com/14105/diff/1/8
File configure.in (right):

http://codereview.appspot.com/14105/diff/1/8#newcode3132
Line 3132: # determine what size digit to use for Python's longs
On 2009/02/17 22:39:18, Martin v. Löwis wrote:
> I'm skeptical (-0) that we really need to have such a configure
option.

I think it's potentially useful to be able to do --disable-big-digits
on platforms where the compiler isn't smart enough to translate
a 32-bit by 32-bit multiply into the appropriate CPU instruction,
so that using 30-bit digits might hurt performance.

I've also found it handy during debugging and testing.  But I guess
I'm only +0.5 on the configure option;  if others think that it's just
unnecessary clutter then I'll remove it.

http://codereview.appspot.com/14105/diff/1/14
File pyconfig.h.in (left):

http://codereview.appspot.com/14105/diff/1/14#oldcode9
Line 9: #undef AC_APPLE_UNIVERSAL_BUILD
On 2009/02/17 22:39:18, Martin v. Löwis wrote:
> We should find out why this is gone.

Looks like an autoconf 2.63/autoconf 2.61 difference.
Whoever previously ran autoconf and autoheader used 2.63;
I used 2.61. (Which explains the huge configure diff as well.)

Description:
This patchset makes it possible for Python to use base 2**30 instead
of base 2**15 for its internal representation of arbitrary-precision
integers.

The aim is both to improve performance of integer arithmetic, and to
make possible some additional optimizations (not currently included in
this patchset).

The patchset includes:

- a new configure option --enable-big-digits
- a new structseq sys.int_info giving information about the
    internal representation

See http://bugs.python.org/issue4258 for the related tracker discussion.

Please review this at http://codereview.appspot.com/14105

Affected files:
   M     Doc/library/sys.rst
   M     Include/longintrepr.h
   M     Include/longobject.h
   M     Include/pyport.h
   M     Lib/test/test_long.py
   M     Lib/test/test_sys.py
   M     Objects/longobject.c
   M     PC/pyconfig.h
   M     Python/marshal.c
   M     Python/sysmodule.c
   M     configure
   M     configure.in
   M     pyconfig.h.in
msg82425 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-18 18:10
Le mercredi 18 février 2009 à 17:06 +0000, Mark Dickinson a écrit :
> Looks like an autoconf 2.63/autoconf 2.61 difference.
> Whoever previously ran autoconf and autoheader used 2.63;

Sorry, that was me. autoconf seems unable to maintain reasonably similar
output between two different versions...
msg82426 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-18 19:36
On 32-bit x86 (1.4Ghz Efficeon) using gcc 4.3.2-1ubuntu12 I see the 
following perf with pidigits_noprint 2000:

py3k:
baseline   longdigit14    longdigit13+optimizations
3709 ms    3664ms         4545ms

Those were from the best of five runs after a warmup loop while the 
system was idle.  --enable-big-digits was passed to configure and 
sys.int_info confirmed it was 30bits on the longdigit versions.

baseline is entirely unpatched.
msg82428 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-18 20:08
Gregory, are you sure you didn't swap the 30-bit and 30-bit+opt results?  On OS X/Core 2 Duo my timings are the other way around:  30bit is 
significantly slower than unpatched, 30bit+opt is a little faster than 
unpatched.  Here are sample numbers:

Macintosh-3:py3k-30bit-opt dickinsm$ ./python.exe ../pidigits_noprint.py 
Time; 2181.3 ms
Macintosh-3:py3k-30bit dickinsm$ ./python.exe ../pidigits_noprint.py 2000
Time; 2987.9 ms
Macintosh-3:py3k dickinsm$ ./python.exe ../pidigits_noprint.py 2000
Time; 2216.2 ms
msg82429 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-18 20:34
And here are results from 64-bit builds on the same machine as above (OS X 
10.5.6/Core 2 Duo, gcc 4.0.1 from Apple).

./python.exe ../pidigits_noprint.py 2000 gives the following timings:

30-bit digits:                 Time; 1245.9 ms
30-bit digits + optimizations: Time; 1184.4 ms
unpatched py3k:                Time; 2479.9 ms
msg82430 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-18 20:50
hmm yes, ignore my 13+optimize result.  apparently that used 15bit 
digits despite --enable-big-digits on configure.  attempting to fix that 
now and rerun.
msg82431 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-18 21:16
> apparently that used 15bit 
> digits despite --enable-big-digits on configure

Maybe configure didn't get updated properly?  I'm almost sure I found a 
case where autoconf and autoheader just reused the stuff in the 
autom4te.cache directory even though configure.in had been changed.
I've been doing:  rm -fr autom4te.cache; autoconf; autoheader
each time.
msg82432 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-18 21:27
On all other follow-ups I agree, so no further comments there.

http://codereview.appspot.com/14105/diff/1/2
File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160
Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);
> Presumably all of these should be fixed.

Ok, so I'd waive this for this patch; please do create a separate
report.

http://codereview.appspot.com/14105
msg82433 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-18 21:33
> Maybe configure didn't get updated properly?  I'm almost sure I found a 
> case where autoconf and autoheader just reused the stuff in the 
> autom4te.cache directory even though configure.in had been changed.
> I've been doing:  rm -fr autom4te.cache; autoconf; autoheader
> each time.

I think the docs say to run autoheader first, then autoconf.
msg82434 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-18 21:35
> Maybe configure didn't get updated properly?  I'm almost sure I found a 
> case where autoconf and autoheader just reused the stuff in the 
> autom4te.cache directory even though configure.in had been changed.
> I've been doing:  rm -fr autom4te.cache; autoconf; autoheader
> each time.

Perhaps by doing "autoreconf" instead?
msg82444 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-18 23:45
new results after fixing my longdigit13 build to use 30 bits instead of 
15 (the configure script in longdigit13+optimizations didn't work right, 
i had to manually add the #define to pyconfig.h)

py3k:
baseline   longdigit14    longdigit13+optimizations
3709 ms    3664 ms        3377 ms

thats much saner.
msg82445 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-19 01:03
attaching an updated pidigits benchmark script that does a warmup run 
before reporting the best result of 5.
msg82448 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-19 02:40
Here are the results from 32-bit x86 on core2 duo gcc 4.0.1 using 
pydigits_bestof.py 4000:

30-bit digits                     (14):  15719 ms
30-bit digits + optimizations (13+ops):  12490 ms
unpatched py3k                        :  13289 ms

(again, i had to manually add #define PYLONG_DIGIT_SIZE 30 to pyconfig.h 
for the longdigit13+optimizations patch).

and pybench runs on the same builds vs unpatched:

30-bit digits                     (14):  -1.4% (slightly slower)
30-bit digits + optimizations (13+ops):  -0.2% (insignificant)


My votes:

Obviously use the optimized version (but fix the configure stuff).

+0  for enabling it by default on 32bit builds.
+10 for enabling it by default on 64bit builds.
msg82458 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-19 06:33
> Obviously use the optimized version (but fix the configure stuff).

Before such a version gets committed, I'd like to see it on Rietveld
again.
msg82466 - (view) Author: Pernici Mario (pernici) Date: 2009-02-19 09:27
The attached patch uses mul1 in long_mul in the version patched with
30bit_longdigit13+optimizations.patch

Comparison between these two patches on hp pavilion Q8200 2.33GHz

pybench                  patch   new patch
SimpleIntegerArithmetic  89      85
other tests equal

pidigits_noprint 2000   998      947
msg82468 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-19 09:37
> Before such a version gets committed, I'd like to see it on Rietveld
> again.

Sure.  My original plan was to get the structural changes in first, and 
then worry about optimizations.

But now I think the x_divrem fix has to be considered a prerequisite for 
the 30-bit digits patch.  So I'll clean up the x_divrem code, include it 
in the basic patch and upload to Rietveld.  The other optimization (to 
multiplication) is more optional, potentially more controversial (it 
adds 60 or so lines of extra code), and should wait until after the 
commit and get a separate issue and review.

The x_divrem work actually simplifies the code (whilst not altering the 
underlying algorithm), as well as speeding it up.  Still, it's changing 
a subtle core algorithm, so we need to be *very* sure that the new 
code's correct before it goes in.  In particular, I want to add some 
tests to test_long that exercise the rare corner cases in the algorithm 
(which only become rarer when 30-bit digits are used).

I also want to understand Gregory's problems with configure before 
committing anything.

None of this is going to happen before the weekend, I'm afraid.
msg82469 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-19 09:42
http://codereview.appspot.com/14105/diff/1/2
File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160
Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);
On 2009/02/18 21:27:04, Martin v. Löwis wrote:
> Ok, so I'd waive this for this patch; please do create a separate
report.

Done.

http://bugs.python.org/issue5308

http://codereview.appspot.com/14105
msg82533 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-20 16:01
Here's an updated patch that includes the x_divrem fixes and
optimizations.  I've also updated the Rietveld issue with these
fixes.

I think this is (modulo any requested changes) the version
that I'd like to get into py3k.  After that we can look
at the possibility of optimizing the multiplication algorithm;
the discussion for this should probably go back to issue 3944.

Here's a detailed list of the changes to x_divrem.

0. Most importantly, in the inner loop, we make sure that the
   multiplication is digit x digit -> twodigits; previously it was
   digit x twodigits -> twodigits, which is likely to expand to
   several instructions on a 32-bit machine.  I suspect that
   this is the main cause of the slowdown that Victor was
   seeing.

1. Make all variables unsigned.  This eliminates the need for
   Py_ARITHMETIC_RIGHT_SHIFT, and also removes the only essential use
   of stwodigits in the entire long object implementation.

2. Save an iteration of the outer loop when possible by comparing top
   digits.

3. Remove double tests in the main inner loop and correction loop.

4. Quotient digit correction step follows Knuth more closely, and uses
   fewer operations.  The extra exit condition (r >= PyLong_BASE) will
   be true more than 50% of the time, and should be cheaper than the
   main test.

5. In quotient digit estimation step, remove unnecessary special case
   when vj == w->ob_digits[w_size-1].  Knuth only needs this special 
case
   because his base is the same as the wordsize.

6. There's no need to fill the eliminated digits of v with zero;
   instead, set Py_SIZE(v) at the end of the outer loop.

7. More comments; some extra asserts.

There are many other optimizations possible;  I've tried only to include 
those that don't significantly complicate the code.  An obvious 
remaining inefficiency is that on every iteration of the outer loop we 
divide by the top word of w;  on many machines I suspect that it would 
be more efficient to precompute an inverse and multiply by that instead.
msg82538 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-20 16:31
Updated benchmarks results with 30bit_longdigit17.patch:

* Victor's bench_int.py:
- 32-bit with patch:    1178.3 ms (24% speedup)
- 64-bit with patch:    990.8 ms  (27% speedup)

* Calculating 2000 digits of pi:
- 32-bit with patch:    2.16 s. (25% speedup)
- 64-bit with patch:    1.5 s.  (55% speedup)

This is very good work.
msg82563 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-21 09:28
Adding Tim Peters to the nosy list, mainly to give him an opportunity to 
throw up his hands in horror at my rewrite of his (I'm guessing) 
implementation of division.
msg82599 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-22 12:19
It finally occurred to me that what might be killing 32-bit performance 
is the divisions, rather than the multiplications.

To test this, here's a version of 30bit_longdigit17.patch that replaces 
just *two* of the divisions in Objects/longsobject.c by the appropriate 
x86 divl assembler instruction.  The result for pydigits is an 
astonishing 10% speedup!

Results of running python pydigits_bestof.py 2000 on 32-bit OS X 
10.5.6/Core 2 Duo:

upatched py3k
-------------
Best Time; 2212.6 ms

30bit_longdigit17.patch
-----------------------
Best Time; 2283.9 ms (-3.1% relative to py3k)

30bit_longdigit17+asm.patch
---------------------------
Best Time; 2085.7 ms (+6.1% relative to py3k)

The problem is that (e.g., in the main loop of x_divrem) we're doing a 
64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit 
remainder.  From the analysis of the algorithm, *we* know that the 
quotient will always fit into 32 bits, so that e.g., on x86, a divl 
instruction is appropriate.  But unless the compiler is extraordinarily 
clever it doesn't know this, so it produces an expensive library call to 
a function that probably involves multiple divisions and/or some 
branches, that produces the full 64-bit quotient.

On 32-bit PowerPC things are even worse, since there there isn't even a 
64-by-32 bit divide instruction;  only a 32-bit by 32-bit division.

So I could still be persuaded that 30-bit digits should only be enabled 
by default on 64-bit machines...
msg82669 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-02-24 18:18
Okay, let's abandon 30-bit digits on 32-bit machines:  it's still 
unclear whether there's any real performance gain, and it's trivial to 
re-enable 30-bit digits by default later.  I'm also going to abandon the 
optimizations for now;  it'll be much easier to work on them once the base patch is in.

Here's a 'release-candidate' version of the patch:  it's exactly the 
same as before, except:

  - I removed all x_divrem changes (though I've left the extra division
    tests in, since they're just as relevant to the old x_divrem).
    I'll open a separate issue for these optimizations.

  - enable 30-bit digits by default only on 64-bit platforms, where
    for the purposes of this patch a 64-bit platform is one with all
    the necessary integer types available (signed and unsigned 32-
    and 64-bit integer types) and SIZEOF_VOID_P >= 8.

I've also updated the version uploaded to Rietveld:  the patchset there 
should exactly match 30bit_longdigit20.patch.  See:

  http://codereview.appspot.com/14105


Martin, thank you for all your help with reviewing the previous patch.  
Is it okay with you to check this version in?  It's the same as the 
version that you reviewed, except with some extra tests for correct 
division, and with 30-bit digits disabled by default on 32-bit 
platforms.

Gregory, if you have time, please could you double check that the 
configure stuff is working okay for you with this patch?  "configure --
enable-big-digits" should produce 30-bit digits; "configure --disable-
big-digits" should produce 15-bit digits, and a plain "configure" should 
give 15-bit or 30-bit depending on your platform.

Anyone else have any objections to this going in exactly as it is?  I'd 
quite like to resolve this issue one way or the other and move on.
msg82792 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-02-26 23:52
1 comment and 1 question about 30bit_longdigit20.patch:
 - I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT == 
30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15 
(eg. digit=unsigned short). Even if short is always 16 bits, I would 
prefer fixed size types in any case.
 - In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is 
it for the Windows world (which doesn't use configure script)?

I prefer the patch version 20 because it's much simplier than the 
other with the algorithm optimisations. The patch is already huge, so 
it's better to split it into smaller parts ;-)
msg83389 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-09 15:01
[Victor]
> In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is 
> it for the Windows world (which doesn't use configure script)?

Yes, but it's also for Unix: PYLONG_BITS_IN_DIGIT is only set when the --
enable-big-digits option is supplied to configure.  So the code in 
pyport.h always gets used for a plain ./configure && make.

> I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT == 
> 30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15 
> (eg. digit=unsigned short). Even if short is always 16 bits, I would 
> prefer fixed size types in any case.

I agree with you to some extent, but I'd still prefer to leave the 15-bit 
definitions as they are, partly out of a desire not to make unnecessary 
changes.  The 'unsigned short' type used for 15-bit digits is both 
theoretically sufficient and not wasteful in practice.

Barring further objections, I'm planning to get the 
30bit_longdigit20.patch in later this week.
msg83467 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-11 13:02
I tried the patch on a 64-bit Linux system and it's ok.
msg83772 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 20:18
Committed 30bit_longdigit20.patch to py3k in r70460, with some minor 
variations (incorporate Nick Coghlan's improved error messages
for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries). 

Thanks all for your feedback.

I might get around to backporting this to 2.7 one day...
msg83773 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 20:19
That should be r70459, of course.
msg83774 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-18 20:19
Great!

> Committed 30bit_longdigit20.patch to py3k in r70460, with some minor 
> variations (incorporate Nick Coghlan's improved error messages
> for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries).
msg83865 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-20 16:08
Backported to trunk in r70479.

Mario, thanks for the long multiplication tweaks you submitted:  could you 
possibly regenerate your Feb 19th patch against the trunk (or py3k, 
whichever you prefer) and attach it to issue 3944?
History
Date User Action Args
2009-03-20 16:08:05mark.dickinsonsetstatus: open -> closed
resolution: accepted
messages: + msg83865
2009-03-18 20:19:11pitrousetmessages: + msg83774
2009-03-18 20:19:06mark.dickinsonsetmessages: + msg83773
2009-03-18 20:18:18mark.dickinsonsetpriority: high -> low

messages: + msg83772
versions: - Python 3.1
2009-03-11 13:02:48pitrousetmessages: + msg83467
2009-03-09 15:01:38mark.dickinsonsetmessages: + msg83389
2009-02-26 23:52:50hayposetmessages: + msg82792
2009-02-24 18:18:30mark.dickinsonsetfiles: + 30bit_longdigit20.patch
assignee: mark.dickinson
messages: + msg82669
2009-02-22 12:19:28mark.dickinsonsetfiles: + 30bit_longdigit17+asm.patch
messages: + msg82599
2009-02-21 09:28:12mark.dickinsonsetnosy: + tim.peters
messages: + msg82563
2009-02-20 16:31:28pitrousetmessages: + msg82538
2009-02-20 16:01:52mark.dickinsonsetfiles: + 30bit_longdigit17.patch
messages: + msg82533
2009-02-19 09:42:16mark.dickinsonsetmessages: + msg82469
2009-02-19 09:37:59mark.dickinsonsetmessages: + msg82468
2009-02-19 09:27:35pernicisetfiles: + 30bit_longdigit13+optimizations1.patch
nosy: + pernici
messages: + msg82466
2009-02-19 06:33:29loewissetmessages: + msg82458
2009-02-19 02:40:16gregory.p.smithsetmessages: + msg82448
2009-02-19 01:03:17gregory.p.smithsetfiles: + pidigits_bestof.py
messages: + msg82445
2009-02-18 23:45:31gregory.p.smithsetmessages: + msg82444
2009-02-18 21:35:47pitrousetmessages: + msg82434
2009-02-18 21:33:19loewissetmessages: + msg82433
2009-02-18 21:27:06loewissetmessages: + msg82432
2009-02-18 21:16:04mark.dickinsonsetmessages: + msg82431
2009-02-18 20:50:44gregory.p.smithsetmessages: + msg82430
2009-02-18 20:34:05mark.dickinsonsetmessages: + msg82429
2009-02-18 20:08:18mark.dickinsonsetmessages: + msg82428
2009-02-18 19:36:20gregory.p.smithsetmessages: + msg82426
2009-02-18 18:10:28pitrousetmessages: + msg82425
2009-02-18 17:06:41mark.dickinsonsetmessages: + msg82424
2009-02-18 05:13:30gregory.p.smithsetmessages: + msg82407
2009-02-18 03:32:11mark.dickinsonsetmessages: + msg82404
2009-02-17 23:02:28hayposetfiles: + pidigits_noprint.py
messages: + msg82386
2009-02-17 23:00:20hayposetmessages: + msg82385
2009-02-17 22:00:46hayposetmessages: + msg82375
2009-02-17 21:56:36mark.dickinsonsetmessages: + msg82374
2009-02-17 20:52:40mark.dickinsonsetmessages: + msg82368
2009-02-17 20:30:57mark.dickinsonsetmessages: + msg82364
2009-02-17 20:23:47loewissetnosy: + loewis
messages: + msg82362
2009-02-17 20:10:55schuppeniessetnosy: + schuppenies
2009-02-17 18:43:23jyasskinsetnosy: + collinwinter, jyasskin
2009-02-17 18:17:21mark.dickinsonsetfiles: - 30bit_longdigit14.patch
2009-02-17 18:17:13mark.dickinsonsetfiles: + 30bit_longdigit14.patch
2009-02-17 17:56:12mark.dickinsonsetfiles: - 30bit_longdigit14.patch
2009-02-17 17:55:20mark.dickinsonsetfiles: + 30bit_longdigit14.patch
messages: + msg82350
2009-02-17 17:47:03pitrousetmessages: + msg82348
2009-02-17 17:36:53mark.dickinsonsetfiles: + 30bit_longdigit14.patch
messages: + msg82347
2009-02-17 17:27:48mark.dickinsonsetmessages: + msg82346
2009-02-17 17:25:57pitrousetfiles: + pidigits.py
messages: + msg82345
2009-02-17 17:22:27pitrousetmessages: + msg82344
2009-02-17 17:09:54pitrousetmessages: + msg82342
2009-02-17 12:15:36mark.dickinsonsetfiles: - 30bit_longdigit6.patch
2009-02-17 12:14:16mark.dickinsonsetfiles: + 30bit_longdigit13+optimizations.patch
messages: + msg82326
2009-02-17 12:12:07pitrousetmessages: + msg82325
2009-02-17 11:49:49pitrousetmessages: + msg82321
2009-02-17 11:44:25pitrousetmessages: + msg82320
2009-02-17 11:34:45pitrousetmessages: + msg82319
2009-02-17 11:30:20mark.dickinsonsetmessages: + msg82318
2009-02-16 16:51:11mark.dickinsonsetmessages: + msg82251
2009-02-16 16:48:50mark.dickinsonsetfiles: - 30bit_longdigit4.patch
2009-02-16 16:47:37mark.dickinsonsetfiles: + 30bit_longdigit13.patch
priority: high
versions: + Python 2.7
messages: + msg82250
stage: patch review
2009-02-14 20:32:47mark.dickinsonsetmessages: + msg82116
2009-02-14 19:11:47gregory.p.smithsetdependencies: + longobject.c: minor fixes, cleanups and optimizations
2008-12-14 01:09:28hayposetmessages: + msg77770
2008-12-14 00:28:50pitrousetnosy: + pitrou
messages: + msg77769
2008-11-11 16:31:00mark.dickinsonsetfiles: + 30bit_longdigit6.patch
messages: + msg75746
2008-11-11 02:24:17hayposetmessages: + msg75720
2008-11-11 01:58:56hayposetmessages: + msg75718
2008-11-11 01:55:15hayposetfiles: - 30bit_longdigit3.patch
2008-11-10 13:34:21mark.dickinsonsetfiles: + 30bit_longdigit4.patch
messages: + msg75695
2008-11-06 13:24:24hayposetmessages: + msg75562
2008-11-06 12:47:18hayposetfiles: + long_stat.patch
messages: + msg75561
2008-11-06 11:30:58hayposetfiles: + optimize.patch
messages: + msg75560
2008-11-06 11:29:05mark.dickinsonsetfiles: + pybench_results.txt
2008-11-06 11:17:32mark.dickinsonsetmessages: + msg75559
2008-11-06 09:21:04mark.dickinsonsetmessages: + msg75553
2008-11-06 09:01:55mark.dickinsonsetmessages: + msg75551
2008-11-06 02:32:19hayposetmessages: + msg75540
2008-11-06 01:35:19hayposetmessages: + msg75539
2008-11-05 23:13:27hayposetmessages: + msg75536
2008-11-05 23:09:22hayposetfiles: - 30bit_longdigit2.patch
2008-11-05 23:04:44mark.dickinsonsetfiles: + 30bit_longdigit3.patch
messages: + msg75534
2008-11-05 11:24:27hayposetmessages: + msg75522
2008-11-05 10:53:51mark.dickinsonsetmessages: + msg75520
2008-11-05 08:55:58hayposetmessages: + msg75518
2008-11-05 01:12:19christian.heimessetpriority: normal -> (no value)
nosy: + christian.heimes
2008-11-05 00:57:50gregory.p.smithsetpriority: normal
nosy: + gregory.p.smith
messages: + msg75513
2008-11-05 00:29:14hayposetfiles: - 30bit_longdigit.patch
2008-11-05 00:29:08hayposetnosy: + haypo
messages: + msg75512
2008-11-04 15:29:02mark.dickinsonsetfiles: + 30bit_longdigit2.patch
messages: + msg75494
2008-11-04 11:04:16mark.dickinsonsetmessages: + msg75492
2008-11-04 10:45:10mark.dickinsoncreate