Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 30-bit digits instead of 15-bit digits for Python integers. #48508

Closed
mdickinson opened this issue Nov 4, 2008 · 78 comments
Closed

Use 30-bit digits instead of 15-bit digits for Python integers. #48508

mdickinson opened this issue Nov 4, 2008 · 78 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@mdickinson
Copy link
Member

BPO 4258
Nosy @tim-one, @loewis, @gpshead, @mdickinson, @pitrou, @vstinner, @tiran
Dependencies
  • bpo-5260: longobject.c: minor fixes, cleanups and optimizations
  • Files
  • pybench_results.txt: pybench results
  • optimize.patch: Improve mark's patch
  • long_stat.patch
  • 30bit_longdigit13.patch
  • 30bit_longdigit13+optimizations.patch
  • pidigits.py
  • 30bit_longdigit14.patch
  • pidigits_noprint.py
  • pidigits_bestof.py: pidigits benchmark w/warmup and 5 runs
  • 30bit_longdigit13+optimizations1.patch: Improve mark's patch
  • 30bit_longdigit17.patch
  • 30bit_longdigit17+asm.patch
  • 30bit_longdigit20.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/mdickinson'
    closed_at = <Date 2009-03-20.16:08:05.934>
    created_at = <Date 2008-11-04.10:45:10.677>
    labels = ['interpreter-core', 'performance']
    title = 'Use 30-bit digits instead of 15-bit digits for Python integers.'
    updated_at = <Date 2009-03-20.16:08:05.899>
    user = 'https://github.com/mdickinson'

    bugs.python.org fields:

    activity = <Date 2009-03-20.16:08:05.899>
    actor = 'mark.dickinson'
    assignee = 'mark.dickinson'
    closed = True
    closed_date = <Date 2009-03-20.16:08:05.934>
    closer = 'mark.dickinson'
    components = ['Interpreter Core']
    creation = <Date 2008-11-04.10:45:10.677>
    creator = 'mark.dickinson'
    dependencies = ['5260']
    files = ['11950', '11951', '11952', '13108', '13118', '13119', '13122', '13127', '13132', '13137', '13142', '13149', '13164']
    hgrepos = []
    issue_num = 4258
    keywords = ['patch']
    message_count = 78.0
    messages = ['75491', '75492', '75494', '75512', '75513', '75518', '75520', '75522', '75534', '75536', '75539', '75540', '75551', '75553', '75559', '75560', '75561', '75562', '75695', '75718', '75720', '75746', '77769', '77770', '82116', '82250', '82251', '82318', '82319', '82320', '82321', '82325', '82326', '82342', '82344', '82345', '82346', '82347', '82348', '82350', '82362', '82364', '82368', '82374', '82375', '82385', '82386', '82404', '82407', '82424', '82425', '82426', '82428', '82429', '82430', '82431', '82432', '82433', '82434', '82444', '82445', '82448', '82458', '82466', '82468', '82469', '82533', '82538', '82563', '82599', '82669', '82792', '83389', '83467', '83772', '83773', '83774', '83865']
    nosy_count = 11.0
    nosy_names = ['tim.peters', 'loewis', 'collinwinter', 'gregory.p.smith', 'mark.dickinson', 'pitrou', 'pernici', 'vstinner', 'christian.heimes', 'jyasskin', 'schuppenies']
    pr_nums = []
    priority = 'low'
    resolution = 'accepted'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue4258'
    versions = ['Python 2.7']

    @mdickinson
    Copy link
    Member Author

    Here's an experimental patch, against the py3k branch, that makes Python
    represent its long integers internally in base 2**30 instead of base
    2**15, on platforms that have 32-bit and 64-bit integers available.

    On platforms for which autoconf is unable to find both 32-bit and 64-bit
    integers, the representation falls back to the usual one.

    See also bpo-1814 (GMP for longs), and the discussion at

    http://mail.python.org/pipermail/python-dev/2008-November/083315.html

    (note particularly Tim Peter's message at:

    http://mail.python.org/pipermail/python-dev/2008-November/083355.html
    )

    @mdickinson mdickinson added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Nov 4, 2008
    @mdickinson
    Copy link
    Member Author

    Note that to avoid "bad marshal data" errors, you'll probably need to do a
    'make distclean' before rebuilding with this patch.

    @mdickinson
    Copy link
    Member Author

    Here's an updated patch, with the following changes from the original:

    • make the size of a digit (both the conceptual size
      in bits and actual size in bytes) available to Python
      via a new structseq sys.int_info. This information
      is useful for the sys.getsizeof tests.

    • fix a missing cast in long_hash

    • better fast path for 1-by-1 multiplication that
      doesn't go via PyLong_FromLongLong.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2008

    Note that to avoid "bad marshal data" errors,
    you'll probably need to do a 'make distclean'
    before rebuilding with this patch.

    I saw that you choosed to use the base 2^30 for marshal. For a better
    portability (be able to use .pyc generated without your patch), you
    may keep the base 2^15. I implemented that in my GMP patch (manual
    conversion from/to base 2^15).

    If we change the marshal format of long, the magic number should be
    different (we might use a tag like the "full unicode" tag used in
    Python3 magic number) and/or the bytecode (actual bytecode is 'l').
    The base should be independent of the implementation, like Python does
    with text: UTF-8 for files and UCS-4 in memory. We may use the base
    2^8 (256) or another power or 2^8 (2^16, 2^32, 2^64?). The base 256
    sounds interresting because any CPU is able to process 8 bits digits.

    Cons: Use a different bases makes Python slower for loading/writing
    from/to .pyc.

    @gpshead
    Copy link
    Member

    gpshead commented Nov 5, 2008

    oh yay, thanks. it looks like you did approximately what i had started
    working on testing a while back but have gone much further and added
    autoconf magic to try and determine when which size should be used. good.

    (i haven't reviewed your autoconf stuff yet)

    As for marhsalled data and pyc compatibility, yes that is important to
    consider.

    We should probably base the decision on which digit size to use
    internally on benchmarks, not just if the platform can support 64bit
    ints. Many archs support 64bit numbers as a native C type but require
    multiple instructions or are very slow when doing it.

    (embedded arm, mips or ppc come to mind as obvious things to test that on)

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2008

    As for marhsalled data and pyc compatibility, yes that is important to
    consider.

    The problem is also that with the 30-bit digit patch, some Python will use 15
    bits whereas some other will use 30 bits. The base in marshal should be the
    same in both cases.

    And why 30 bits and not 31 bits, or 63 bits, or 120 bits? We may use other
    bases in the future. That's why I prefer to use a common base like base 256.
    And GMP has functions (mpz_import) to load data in base 256, but it's more
    complicate to load data in base 2^15.

    @mdickinson
    Copy link
    Member Author

    [Victor Stinner]

    I saw that you choosed to use the base 2^30 for marshal. For a better
    portability (be able to use .pyc generated without your patch), you
    may keep the base 2^15. I implemented that in my GMP patch (manual
    conversion from/to base 2^15).

    Agreed. I'll fix this so that the .pyc format is unchanged. Thanks!

    And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

    Mostly laziness: the change from 15 to 30 bits turned out to be extraordinarily easy. Note that the longobject.c part of
    the patch does almost nothing except adding a bunch of casts here and there.

    31 bits would involve rewriting the powering algorithm, which assumes that PyLong_SHIFT is divisible by 5. It would gain
    very little over 30 bits, and if Pernici Mario's optimizations are considered (bpo-3944) multiplication would likely be
    slower with 31-bit digits than with 30-bit digits.

    63 (or 62, or 60) bits is simply too large right now: you'd need access to a hardware 64 x 64 -> 128 bit multiply to make
    this worth it, and I'd guess there are many fewer platforms that make this easy, though I don't really know. I know it's
    possible on gcc/x86_64 by making use of the (undocumented) __uint128_t type. But even where this type is available, base
    2**63 might still be slower than base 2**30. I've done some experiments with multiprecision *decimal* arithmetic in C that
    showed that even on a 64-bit machine, using base 10**9 (i.e. approx 30 bits) was significantly faster than using base
    10**18.

    120 bits? Does GMP even go this far? I guess part of the attraction is that it's a multiple of 8...

    The other obvious choices to consider would be 32 bits (or 16 bits, or 64 bits). This is possible, and may even be worth
    it, but it would be a *much* bigger effort; most of the algorithms would need to be rewritten. One problem is again the
    mismatch between C and assembler: detecting overflow when adding two 32-bit unsigned integers is trivial in x86 assembler,
    but it's not so obvious how to write portable C code that has a decent chance of being compiled optimally on a wide variety
    of compilers.

    I guess my feeling is simply that the 15-bit to 30-bit change seems incredibly easy and cheap: very little code change, and
    hence low risk of accidental breakage. So if there are indeed significant efficiency benefits (still to be determined) then
    it looks like a clear win to me.

    [Gregory Smith]

    (i haven't reviewed your autoconf stuff yet)

    The changes to configure and pyconfig.h are just from rerunning autoconf and autoheader; it's only configure.in that should
    need looking at.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2008

    > And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

    Mostly laziness (...)

    It was an argument for changing the base used by the mashal :-)

    31 bits would involve rewriting the powering algorithm, which assumes that
    PyLong_SHIFT is divisible by 5

    Powering is an simple algorithm. If it was the division, it would be much
    harder :-) Python stores the sign of the number in the first digit. Because
    of that, we are limited to 15/30 bits. Storing the sign in the size (which
    size? no idea yet) would allows to use a bigger base (31 bits? 63 bits?).

    One problem is again the mismatch between C and assembler: detecting
    overflow when adding two 32-bit unsigned integers is trivial in x86
    assembler, but it's not so obvious how to write portable C code that has a
    decent chance of being compiled optimally on a wide variety of compilers.

    I wrote an example to detect overflows in C on the mailing list. Copy of my
    email:
    ------------------------------- 8< ----------------------
    About 31, 32, 63 or 64 bits: I guess that you want to avoid integer overflow.
    Intel has an "overflow" flag, changed for all instructions. For other CPUs,
    you can use emulate this flag. Example with the type int that I used in my
    GMP patch:

    Add:
      int a, b, sum;
      sum = a + b;
      exact = ((a < 0) ^ (b < 0)) || ((a < 0) == (sum < 0));
    
    Substract:
      int a, b, diff;
      diff = a + b;
      exact = ((a < 0) == (b < 0)) || ((a < 0) == (diff < 0));
    
    Multiply:
      int a, b, product;
      if (a == 0 || b == 0) {
         product = 0;  /* exact */
      } else if (a != INT_MIN || (b == 1)) {
         product = a * b;
         exact = (product / b) == a);
      } else {
         /* INT_MIN * -1 = -INT_MIN: overflow */
      }
    
    Divide:
      int a, b, q;
      if (a != INT_MIN || b != -1) {
         q = a / b;   /* exact */
      } else {
         /* INT_MIN / -1 = -INT_MIN: overflow */
      }

    Checking overflow may costs more than using a smaller base. Only a benchmark
    can answer ;-)
    ------------------------------- 8< ----------------------

    I guess my feeling is simply that the 15-bit to 30-bit change seems
    incredibly easy and cheap: very little code change, and hence low risk of
    accidental breakage.

    Python has an amazing regression test suite! I used it to fix my GMP patch. We
    can experiment new bases using this suite.

    Anyway, i love the idea of using 30 bits instead of 15! Most computer are now
    32 or 64 bits! But it's safe to keep the 15 bits version to support older
    computers or buggy compilers.

    I started to work with GIT. You may be interrested to work together on GIT.
    It's much easier to exchanges changeset and play with branches. I will try to
    publish my GIT tree somewhere.

    @mdickinson
    Copy link
    Member Author

    Following Victor's suggestion, here's an updated patch; same as before,
    except that marshal now uses base 2**15 for reading and writing,
    independently of whether PyLong_SHIFT is 15 or 30.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 5, 2008

    Mark: would it be possible to keep the "2 digits" hack in
    PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT ==
    15". The base 2^15 slow, so don't make it slower :-)

    • /* 2 digits */
    • if (!(ival >> 2*PyLong_SHIFT)) {
    •   v = \_PyLong_New(2);
      
    •   if (v) {
      
    •   	Py_SIZE(v) = 2\*sign;
      
    •   	v-\>ob_digit[0] = (digit)ival & PyLong_MASK;
      
    •   	v-\>ob_digit[1] = ival \>\> PyLong_SHIFT;
      
    •   }
      
    •   return (PyObject*)v;
      
    • }

    @vstinner
    Copy link
    Member

    vstinner commented Nov 6, 2008

    marshal now uses base 2**15 for reading and writing

    Yes, it uses base 2**15 but it's not the correct conversion to base
    2**15. You convert each PyLong digit to base 2**15 but not the whole
    number. As a result, the format is different than the current mashal
    version.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 6, 2008

    PyLong_FromLong() doesn't go into the 1 digit special case for 
    negative number. You should use:
    	/* Fast path for single-digits ints */
    	if (!(abs_ival>>PyLong_SHIFT)) {
    		v = _PyLong_New(1);
    		if (v) {
    			Py_SIZE(v) = sign;
    			v->ob_digit[0] = abs_ival;
    		}
    		return (PyObject*)v;
    	}

    @mdickinson
    Copy link
    Member Author

    Yes, it uses base 2**15 but it's not the correct conversion to base
    2**15. You convert each PyLong digit to base 2**15 but not the whole
    number.

    I don't understand: yes, each base 2**30 digit is converted to a pair
    of base 2**15 digits, and if necessary (i.e., if the top 15 bits of the
    most significant base 2**30 digit are zero) the size is adjusted. How
    is this not converting the whole number?

    As a result, the format is different than the current mashal version.

    Can you give an example of an integer n such that marshal.dumps(n) gives
    you different results with and without the patch? As far as I can tell,
    I'm getting the same marshal results both with the unpatched version and
    with the patch applied.

    @mdickinson
    Copy link
    Member Author

    Other responses...

    It was an argument for changing the base used by the mashal :-)

    Ah. I think I'm with you now. You're saying that ideally, marshal
    shouldn't have to care about how Python stores its longs: it should
    just ask some function in longobject.c to provide an already-converted-
    to-base-256 representation. Is that right?

    I also find the idea of making the marshal representation base 256 quite
    attractive. There are already functions in longobject.c that could be
    used for this: _PyLong_{From,As}ByteArray. And then you wouldn't need
    to touch marshal.c when swapping the GMP version of longobject.c in and
    out.

    Python stores the sign of the number in the first digit. Because
    of that, we are limited to 15/30 bits.

    No: the sign is stored in the size: if v is a PyLongObject then
    ABS(Py_SIZE(v)) gives the number of digits in the absolute value of the
    integer represented by v, and the sign of Py_SIZE(v) gives the sign of
    the integer.

    would it be possible to keep the "2 digits" hack in
    PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT ==
    15". The base 2^15 slow, so don't make it slower :-)

    Why don't we leave this kind of micro-optimization out until we've got
    some benchmarks. (I'm also tempted to get rid of the long_mul fast path
    for now.)

    PyLong_FromLong() doesn't go into the 1 digit special case for
    negative number.

    Well spotted! Yes, this should be fixed. I have a nasty feeling that I
    was responsible for introducing this bug some time ago...

    @mdickinson
    Copy link
    Member Author

    Here's a pybench comparison, on OS X 10.5/Core 2 Duo/gcc 4.0.1 (32-bit
    non-debug build of the py3k branch). I got this by doing:

    [create clean build of py3k branch]
    dickinsm$ ./python.exe Tools/pybench/pybench.py -f bench_unpatched
    [apply 30bit patch and rebuild]
    dickinsm$ ./python.exe Tools/pybench/pybench.py -c bench_unpatched

    Highlights: SimpleLongArithmetic: around 10% faster.
    SimpleComplexArithmetic: around 16% slower!
    CompareFloatsIntegers: around 20% slower.

    I'll investigate the slowdowns.

    @vstinner
    Copy link
    Member

    vstinner commented Nov 6, 2008

    I'll investigate the slowdowns

    The problem may comes from int64_t on 32 bits CPU. 32x32 -> 64 may be
    emulated on your CPU and so it's slower. I improved your patch to make
    it faster, but I lost all my work because of a misuse of GIT... As I
    remember:

    • I fixed PyLong_FromLong() for small negative integer
    • I unrolled the loop in PyLong_FromLong(): the loop is at least
      called twice (the number has 2 digits or more)
    • I added special code for operations on two numbers of 1 digit
      (each) for long_add(), long_mul(), long_div(), long_bitwise(), etc.
    • and I don't remember the other changes...

    Oh, I have an old patch. I will attach it to this message. About
    speed, it was:

    • unpatched: 20600..20900 pystones
    • your patch: 19900..20100 pystones
      • my changes: 20200..20400 pytones

    @vstinner
    Copy link
    Member

    vstinner commented Nov 6, 2008

    I wrote a patch to compute stat about PyLong function calls.

    make (use setup.py):

    PyLong_FromLong: 168572 calls, min=( 0, ), avg=(1.4, ), max=( 3, )
    long_bool: 48682 calls, min=( 0, ), avg=(0.2, ), max=( 2, )
    long_add: 39527 calls, min=( 0, 0), avg=(0.9, 1.0), max=( 2, 3)
    long_compare: 39145 calls, min=( 0, 0), avg=(1.2, 1.1), max=( 3, 3)
    PyLong_AsLong: 33689 calls, min=( 0, ), avg=(0.9, ), max=( 45, )
    long_sub: 13091 calls, min=( 0, 0), avg=(0.9, 0.8), max=( 1, 1)
    long_bitwise: 4636 calls, min=( 0, 0), avg=(0.8, 0.6), max=( 2, 2)
    long_hash: 1097 calls, min=( 0, ), avg=(0.9, ), max=( 3, )
    long_mul: 221 calls, min=( 0, 0), avg=(0.8, 1.1), max=( 2, 2)
    long_invert: 204 calls, min=( 0, ), avg=(1.0, ), max=( 1, )
    long_neg: 35 calls, min=( 1, ), avg=(1.0, ), max=( 1, )
    long_format: 3 calls, min=( 0, ), avg=(0.7, ), max=( 1, )
    long_mod: 3 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)
    long_pow: 1 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)

    pystone:

    PyLong_FromLong:1587652 calls, min=( 0, ), avg=(1.0, ), max=( 3, )
    long_add: 902487 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 2, 2)
    long_compare: 651165 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 3, 3)
    PyLong_AsLong: 252476 calls, min=( 0, ), avg=(1.0, ), max=( 2, )
    long_sub: 250032 calls, min=( 1, 0), avg=(1.0, 1.0), max=( 1, 1)
    long_bool: 102655 calls, min=( 0, ), avg=(0.5, ), max=( 1, )
    long_mul: 100015 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2)
    long_div: 50000 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)
    long_hash: 382 calls, min=( 0, ), avg=(1.1, ), max=( 2, )
    long_bitwise: 117 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2)
    long_format: 1 calls, min=( 2, ), avg=(2.0, ), max=( 2, )

    min/avg/max are the integer digit count (minimum, average, maximum).

    What can we learn from this numbers?

    PyLong_FromLong(), long_add() and long_compare() are the 3 most common
    operations on integers.

    Except PyLong_FromLong(), long_compare() and long_format(), arguments of the
    functions are mostly in range [-2^15; 2^15].

    Biggest number is a number of 45 digits: maybe just one call to long_add().
    Except this number/call, the biggest numbers have between 2 and 3 digits.

    long_bool() is never called with number bigger than 2 digits.

    long_sub() is never called with number bigger than 1 digit!

    @vstinner
    Copy link
    Member

    vstinner commented Nov 6, 2008

    And now the stat of Python patched with 30bit_longdigit3.patch.

    min/avg/max are now the number of bits which gives better
    informations. "bigger" is the number of arguments which are bigger than 1
    digit (not in range [-2^30; 2^30]).

    make
    ====

    _FromLong: 169734 calls, min=( 0, ), avg=(11.6, ), max=( 32, )
    \--> bigger=31086
    long_bool: 48772 calls, min=( 0, ), avg=( 0.3, ), max=( 24, )
    long_add: 39685 calls, min=( 0, 0), avg=( 6.5, 3.5), max=( 19, 32)
    \--> bigger=1
    long_compare: 39445 calls, min=( 0, 0), avg=( 9.3, 8.4), max=( 31, 33)
    \--> bigger=10438
    _AsLong: 33726 calls, min=( 0, ), avg=( 4.9, ), max=(1321, )
    \--> bigger=10
    long_sub: 13285 calls, min=( 0, 0), avg=( 7.6, 5.6), max=( 13, 13)
    long_bitwise: 4690 calls, min=( 0, 0), avg=( 1.7, 1.9), max=( 16, 16)
    long_hash: 1097 calls, min=( 0, ), avg=( 8.1, ), max=( 33, )
    \--> bigger=4
    long_mul: 236 calls, min=( 0, 0), avg=( 1.3, 5.4), max=( 17, 17)
    long_invert: 204 calls, min=( 0, ), avg=( 2.4, ), max=( 3, )
    long_neg: 35 calls, min=( 1, ), avg=( 4.3, ), max=( 7, )
    long_format: 3 calls, min=( 0, ), avg=( 2.0, ), max=( 4, )
    long_mod: 3 calls, min=( 1, 2), avg=( 1.7, 2.0), max=( 2, 2)
    long_pow: 1 calls, min=( 2, 6), avg=( 2.0, 6.0), max=( 2, 6)

    Notes about make:

    • PyLong_FromLong(), long_compare(), PyLong_AsLong() and long_hash()
      gets integers not in [-2^30; 2^30] which means that all other functions
      are only called with arguments of 1 digit!
    • PyLong_FromLong() gets ~30.000 (18%) integers of 32 bits
    • global average integer size is between 0.3 and 11.6 (~6.0 bits?)
    • There are 41.500 (12%) big integers on ~350.000 integers

    pystone
    =======

    _FromLong: 1504983 calls, min=( 0, ), avg=( 5.1, ), max=( 31, )
    \--> bigger=14
    long_add: 902487 calls, min=( 0, 0), avg=( 3.9, 2.4), max=( 17, 17)
    long_compare: 651165 calls, min=( 0, 0), avg=( 1.7, 1.4), max=( 31, 31)
    \--> bigger=27
    _AsLong: 252477 calls, min=( 0, ), avg=( 4.6, ), max=( 16, )
    long_sub: 250032 calls, min=( 1, 0), avg=( 4.0, 1.6), max=( 7, 7)
    long_bool: 102655 calls, min=( 0, ), avg=( 0.5, ), max=( 7, )
    long_mul: 100015 calls, min=( 0, 0), avg=( 2.5, 2.0), max=( 4, 16)
    long_truediv: 50000 calls, min=( 4, 2), avg=( 4.0, 2.0), max=( 4, 2)
    long_hash: 382 calls, min=( 0, ), avg=( 8.1, ), max=( 28, )
    long_bitwise: 117 calls, min=( 0, 0), avg=( 6.7, 6.6), max=( 15, 16)
    long_format: 1 calls, min=(16, ), avg=(16.0, ), max=( 16, )

    Notes about pystone:

    • very very few numbers are bigger than one digit: 41 / ~4.000.000
    • global average integer size is between 0.5 and 6.7 (~3.0 bits?)
    • the biggest number has only 31 bits (see long_compare)

    Short summary:

    • pystone doesn't use big integer (1 big integer for 100.000 integers)
      => don't use pystone!
    • the average integer size is around 3 or 6 bits, which means that most
      integers can be stored in 8 bits (-255..255)
      => we need to focus on the very small numbers
      => base 2^30 doesn't help for common Python code, it only helps programs
      using really big numbers (128 bits or more?)

    @mdickinson
    Copy link
    Member Author

    Here's a minor update to the patch, that does some extra cleanup:

    • don't include longintrepr.h in Objects/abstract.c or
      Objects/boolobject.c --- it's not needed.

    • fix several places in longobject.c where int should have been size_t
      or Py_ssize_t

    • remove some unnecessary forward declarations in longobject.c.

    • fix PyLong_FromLong for small negative integers

    At some point I'll try to separate the pure bugfixes (missing casts, int
    vs Py_ssize_t, etc.) from the 15-bit to 30-bit conversion.

    @vstinner
    Copy link
    Member

    Using 30bit_longdigit4.patch, I get this
    error: "Objects/longobject.c:700: erreur: "SIZE_T_MAX" undeclared
    (first use in this function)". You might use the type Py_ssize_t with
    PY_SSIZE_T_MAX. I used INT_MAX to compile the code.

    @vstinner
    Copy link
    Member

    I like the idea of sys.int_info, but I would prefer a "base" attribute
    than "bits_per_digit". A base different than 2^n might be used (eg. a
    base like 10^n for fast conversion from/to string).

    @mdickinson
    Copy link
    Member Author

    Here's a version of the 15-bit to 30-bit patch
    that adds in a souped-up version of Mario Pernici's
    faster multiplication.

    I did some testing of 100x100 digit and 1000x1000 digit
    multiplies. On 32-bit x86:
    100 x 100 digits : around 2.5 times faster
    1000 x 1000 digits : around 3 times faster.

    On x86_64, I'm getting more spectacular results:
    100 x 100 digits : around 5 times faster
    1000 x 1000 digits: around 7 times faster!

    The idea of the faster multiplication is quite simple:
    with 30-bit digits, one can fit a sum of 16 30-bit by
    30-bit products in a uint64_t. This means that the
    inner loop for the basecase grade-school multiplication
    contains one fewer addition and no mask and shift.

    [Victor, please don't delete the old longdigit4.patch!]

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2008

    Just tested the patch, here are some benchmarks:

    ./python -m timeit -s "a=100000000;b=777777" "a//b"
    -> 2.6: 0.253 usec per loop
    -> 3.1: 0.61 usec per loop
    -> 3.1 + patch: 0.331 usec per loop

    ./python -m timeit -s "a=100000000;b=777777" "a*b"
    -> 2.6: 0.431 usec per loop
    -> 3.1: 0.302 usec per loop
    -> 3.1 + patch: 0.225 usec per loop

    ./python -m timeit -s "a=100000000;b=777777" "a+b"
    -> 2.6: 0.173 usec per loop
    -> 3.1: 0.229 usec per loop
    -> 3.1 + patch: 0.217 usec per loop

    But it seems there are some outliers:

    ./python -m timeit -s "a=100000000**5+1;b=777777**3" "a//b"
    -> 2.6: 1.13 usec per loop
    -> 3.1: 1.12 usec per loop
    -> 3.1 + patch: 1.2 usec per loop

    @vstinner
    Copy link
    Member

    I wrote a small benchmark tool dedicated to integer operations (+ -

    • / etc.): bench_int.py attached to bpo-4294. See also Message75715
      and Message75719 for my benchmark results. Short sum up: 2^30 base
      helps a lot on 64 bits CPU (+26%) whereas the speedup is small (4%) on
      32 bits CPU. But don't trust benchmarks :-p

    @mdickinson
    Copy link
    Member Author

    The most recent patch is out of date and no longer applies cleanly. I'm
    working on an update.

    @mdickinson
    Copy link
    Member Author

    Updated patch against py3k. I'm interested in getting this into the trunk
    as well, but py3k is more important (because *all* integers are long
    integers). It's also a little more complicated to do this for py3k
    (mostly because of all the small integer caching), so backporting to
    2.7 is easier than trying to forward port a patch from 2.7 to 3.1.

    Notes:

    • I've added a configure option --enable-big-digits (there are probably
      better names), enabled by default. So you can use --disable-big-digits
      to get the old 15-bit behaviour.

    • I *think* this patch should work on Windows; confirmation would be
      appreciated.

    • I've removed the fast multiplication code, in the interests of keeping
      the patch simple. If this patch goes in, we can concentrate on speeding
      up multiplication afterwards. For now, note that 30-bit digits give
      the *potential* for significant speedups in multiplication and division
      (see next item).

    • There's a nasty 'feature' in x_divmod: the multiplication in the
      innermost loop is digit-by-twodigits -> twodigits, when it should be
      digit-by-digit -> twodigits; this probably causes significant slowdown
      with 30-bit digits. This may explain Antoine's outlier.
      Again, if this patch goes in I'll work on fixing x_divmod.

    • Re: Victor's comment about a 'base' attribute: I tried this, but
      quickly discovered that we still need the 'bits_per_digit' for tests.
      I think that binaryness is so ingrained that it's not really worth
      worrying about the possibility of the base changing from a power of 2 to a
      power of 10. So in the end I left base out.

    • It did occur to me that NSMALLPOSINTS and NSMALLNEGINTS might usefully
      be exposed in sys.int_info, mostly for the purposes of testing. Thoughts?

    @mdickinson
    Copy link
    Member Author

    Forgot to mention: you'll need to rerun autoconf and autoheader after
    applying the patch and before doing ./configure

    @mdickinson
    Copy link
    Member Author

    Antoine, were your posted results on a 64-bit or a 32-bit system?

    @gpshead
    Copy link
    Member

    gpshead commented Feb 18, 2009

    On 32-bit x86 (1.4Ghz Efficeon) using gcc 4.3.2-1ubuntu12 I see the
    following perf with pidigits_noprint 2000:

    py3k:
    baseline longdigit14 longdigit13+optimizations
    3709 ms 3664ms 4545ms

    Those were from the best of five runs after a warmup loop while the
    system was idle. --enable-big-digits was passed to configure and
    sys.int_info confirmed it was 30bits on the longdigit versions.

    baseline is entirely unpatched.

    @mdickinson
    Copy link
    Member Author

    Gregory, are you sure you didn't swap the 30-bit and 30-bit+opt results? On OS X/Core 2 Duo my timings are the other way around: 30bit is
    significantly slower than unpatched, 30bit+opt is a little faster than
    unpatched. Here are sample numbers:

    Macintosh-3:py3k-30bit-opt dickinsm$ ./python.exe ../pidigits_noprint.py
    Time; 2181.3 ms
    Macintosh-3:py3k-30bit dickinsm$ ./python.exe ../pidigits_noprint.py 2000
    Time; 2987.9 ms
    Macintosh-3:py3k dickinsm$ ./python.exe ../pidigits_noprint.py 2000
    Time; 2216.2 ms

    @mdickinson
    Copy link
    Member Author

    And here are results from 64-bit builds on the same machine as above (OS X
    10.5.6/Core 2 Duo, gcc 4.0.1 from Apple).

    ./python.exe ../pidigits_noprint.py 2000 gives the following timings:

    30-bit digits: Time; 1245.9 ms
    30-bit digits + optimizations: Time; 1184.4 ms
    unpatched py3k: Time; 2479.9 ms

    @gpshead
    Copy link
    Member

    gpshead commented Feb 18, 2009

    hmm yes, ignore my 13+optimize result. apparently that used 15bit
    digits despite --enable-big-digits on configure. attempting to fix that
    now and rerun.

    @mdickinson
    Copy link
    Member Author

    apparently that used 15bit
    digits despite --enable-big-digits on configure

    Maybe configure didn't get updated properly? I'm almost sure I found a
    case where autoconf and autoheader just reused the stuff in the
    autom4te.cache directory even though configure.in had been changed.
    I've been doing: rm -fr autom4te.cache; autoconf; autoheader
    each time.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 18, 2009

    On all other follow-ups I agree, so no further comments there.

    http://codereview.appspot.com/14105/diff/1/2
    File Python/marshal.c (right):

    http://codereview.appspot.com/14105/diff/1/2#newcode160
    Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);

    Presumably all of these should be fixed.

    Ok, so I'd waive this for this patch; please do create a separate
    report.

    http://codereview.appspot.com/14105

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 18, 2009

    Maybe configure didn't get updated properly? I'm almost sure I found a
    case where autoconf and autoheader just reused the stuff in the
    autom4te.cache directory even though configure.in had been changed.
    I've been doing: rm -fr autom4te.cache; autoconf; autoheader
    each time.

    I think the docs say to run autoheader first, then autoconf.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 18, 2009

    Maybe configure didn't get updated properly? I'm almost sure I found a
    case where autoconf and autoheader just reused the stuff in the
    autom4te.cache directory even though configure.in had been changed.
    I've been doing: rm -fr autom4te.cache; autoconf; autoheader
    each time.

    Perhaps by doing "autoreconf" instead?

    @gpshead
    Copy link
    Member

    gpshead commented Feb 18, 2009

    new results after fixing my longdigit13 build to use 30 bits instead of
    15 (the configure script in longdigit13+optimizations didn't work right,
    i had to manually add the #define to pyconfig.h)

    py3k:
    baseline longdigit14 longdigit13+optimizations
    3709 ms 3664 ms 3377 ms

    thats much saner.

    @gpshead
    Copy link
    Member

    gpshead commented Feb 19, 2009

    attaching an updated pidigits benchmark script that does a warmup run
    before reporting the best result of 5.

    @gpshead
    Copy link
    Member

    gpshead commented Feb 19, 2009

    Here are the results from 32-bit x86 on core2 duo gcc 4.0.1 using
    pydigits_bestof.py 4000:

    30-bit digits (14): 15719 ms
    30-bit digits + optimizations (13+ops): 12490 ms
    unpatched py3k : 13289 ms

    (again, i had to manually add #define PYLONG_DIGIT_SIZE 30 to pyconfig.h
    for the longdigit13+optimizations patch).

    and pybench runs on the same builds vs unpatched:

    30-bit digits (14): -1.4% (slightly slower)
    30-bit digits + optimizations (13+ops): -0.2% (insignificant)

    My votes:

    Obviously use the optimized version (but fix the configure stuff).

    +0 for enabling it by default on 32bit builds.
    +10 for enabling it by default on 64bit builds.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 19, 2009

    Obviously use the optimized version (but fix the configure stuff).

    Before such a version gets committed, I'd like to see it on Rietveld
    again.

    @pernici
    Copy link
    Mannequin

    pernici mannequin commented Feb 19, 2009

    The attached patch uses mul1 in long_mul in the version patched with
    30bit_longdigit13+optimizations.patch

    Comparison between these two patches on hp pavilion Q8200 2.33GHz

    pybench patch new patch
    SimpleIntegerArithmetic 89 85
    other tests equal

    pidigits_noprint 2000 998 947

    @mdickinson
    Copy link
    Member Author

    Before such a version gets committed, I'd like to see it on Rietveld
    again.

    Sure. My original plan was to get the structural changes in first, and
    then worry about optimizations.

    But now I think the x_divrem fix has to be considered a prerequisite for
    the 30-bit digits patch. So I'll clean up the x_divrem code, include it
    in the basic patch and upload to Rietveld. The other optimization (to
    multiplication) is more optional, potentially more controversial (it
    adds 60 or so lines of extra code), and should wait until after the
    commit and get a separate issue and review.

    The x_divrem work actually simplifies the code (whilst not altering the
    underlying algorithm), as well as speeding it up. Still, it's changing
    a subtle core algorithm, so we need to be *very* sure that the new
    code's correct before it goes in. In particular, I want to add some
    tests to test_long that exercise the rare corner cases in the algorithm
    (which only become rarer when 30-bit digits are used).

    I also want to understand Gregory's problems with configure before
    committing anything.

    None of this is going to happen before the weekend, I'm afraid.

    @mdickinson
    Copy link
    Member Author

    http://codereview.appspot.com/14105/diff/1/2
    File Python/marshal.c (right):

    http://codereview.appspot.com/14105/diff/1/2#newcode160
    Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);
    On 2009/02/18 21:27:04, Martin v. Löwis wrote:

    Ok, so I'd waive this for this patch; please do create a separate
    report.

    Done.

    http://bugs.python.org/issue5308

    http://codereview.appspot.com/14105

    @mdickinson
    Copy link
    Member Author

    Here's an updated patch that includes the x_divrem fixes and
    optimizations. I've also updated the Rietveld issue with these
    fixes.

    I think this is (modulo any requested changes) the version
    that I'd like to get into py3k. After that we can look
    at the possibility of optimizing the multiplication algorithm;
    the discussion for this should probably go back to bpo-3944.

    Here's a detailed list of the changes to x_divrem.

    1. Most importantly, in the inner loop, we make sure that the
      multiplication is digit x digit -> twodigits; previously it was
      digit x twodigits -> twodigits, which is likely to expand to
      several instructions on a 32-bit machine. I suspect that
      this is the main cause of the slowdown that Victor was
      seeing.

    2. Make all variables unsigned. This eliminates the need for
      Py_ARITHMETIC_RIGHT_SHIFT, and also removes the only essential use
      of stwodigits in the entire long object implementation.

    3. Save an iteration of the outer loop when possible by comparing top
      digits.

    4. Remove double tests in the main inner loop and correction loop.

    5. Quotient digit correction step follows Knuth more closely, and uses
      fewer operations. The extra exit condition (r >= PyLong_BASE) will
      be true more than 50% of the time, and should be cheaper than the
      main test.

    6. In quotient digit estimation step, remove unnecessary special case
      when vj == w->ob_digits[w_size-1]. Knuth only needs this special
      case
      because his base is the same as the wordsize.

    7. There's no need to fill the eliminated digits of v with zero;
      instead, set Py_SIZE(v) at the end of the outer loop.

    8. More comments; some extra asserts.

    There are many other optimizations possible; I've tried only to include
    those that don't significantly complicate the code. An obvious
    remaining inefficiency is that on every iteration of the outer loop we
    divide by the top word of w; on many machines I suspect that it would
    be more efficient to precompute an inverse and multiply by that instead.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 20, 2009

    Updated benchmarks results with 30bit_longdigit17.patch:

    • Victor's bench_int.py:
    • 32-bit with patch: 1178.3 ms (24% speedup)
    • 64-bit with patch: 990.8 ms (27% speedup)
    • Calculating 2000 digits of pi:
    • 32-bit with patch: 2.16 s. (25% speedup)
    • 64-bit with patch: 1.5 s. (55% speedup)

    This is very good work.

    @mdickinson
    Copy link
    Member Author

    Adding Tim Peters to the nosy list, mainly to give him an opportunity to
    throw up his hands in horror at my rewrite of his (I'm guessing)
    implementation of division.

    @mdickinson
    Copy link
    Member Author

    It finally occurred to me that what might be killing 32-bit performance
    is the divisions, rather than the multiplications.

    To test this, here's a version of 30bit_longdigit17.patch that replaces
    just *two* of the divisions in Objects/longsobject.c by the appropriate
    x86 divl assembler instruction. The result for pydigits is an
    astonishing 10% speedup!

    Results of running python pydigits_bestof.py 2000 on 32-bit OS X
    10.5.6/Core 2 Duo:

    upatched py3k
    -------------
    Best Time; 2212.6 ms

    30bit_longdigit17.patch
    -----------------------
    Best Time; 2283.9 ms (-3.1% relative to py3k)

    30bit_longdigit17+asm.patch
    ---------------------------
    Best Time; 2085.7 ms (+6.1% relative to py3k)

    The problem is that (e.g., in the main loop of x_divrem) we're doing a
    64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit
    remainder. From the analysis of the algorithm, *we* know that the
    quotient will always fit into 32 bits, so that e.g., on x86, a divl
    instruction is appropriate. But unless the compiler is extraordinarily
    clever it doesn't know this, so it produces an expensive library call to
    a function that probably involves multiple divisions and/or some
    branches, that produces the full 64-bit quotient.

    On 32-bit PowerPC things are even worse, since there there isn't even a
    64-by-32 bit divide instruction; only a 32-bit by 32-bit division.

    So I could still be persuaded that 30-bit digits should only be enabled
    by default on 64-bit machines...

    @mdickinson
    Copy link
    Member Author

    Okay, let's abandon 30-bit digits on 32-bit machines: it's still
    unclear whether there's any real performance gain, and it's trivial to
    re-enable 30-bit digits by default later. I'm also going to abandon the
    optimizations for now; it'll be much easier to work on them once the base patch is in.

    Here's a 'release-candidate' version of the patch: it's exactly the
    same as before, except:

    • I removed all x_divrem changes (though I've left the extra division
      tests in, since they're just as relevant to the old x_divrem).
      I'll open a separate issue for these optimizations.

    • enable 30-bit digits by default only on 64-bit platforms, where
      for the purposes of this patch a 64-bit platform is one with all
      the necessary integer types available (signed and unsigned 32-
      and 64-bit integer types) and SIZEOF_VOID_P >= 8.

    I've also updated the version uploaded to Rietveld: the patchset there
    should exactly match 30bit_longdigit20.patch. See:

    http://codereview.appspot.com/14105

    Martin, thank you for all your help with reviewing the previous patch.
    Is it okay with you to check this version in? It's the same as the
    version that you reviewed, except with some extra tests for correct
    division, and with 30-bit digits disabled by default on 32-bit
    platforms.

    Gregory, if you have time, please could you double check that the
    configure stuff is working okay for you with this patch? "configure --
    enable-big-digits" should produce 30-bit digits; "configure --disable-
    big-digits" should produce 15-bit digits, and a plain "configure" should
    give 15-bit or 30-bit depending on your platform.

    Anyone else have any objections to this going in exactly as it is? I'd
    quite like to resolve this issue one way or the other and move on.

    @mdickinson mdickinson self-assigned this Feb 24, 2009
    @vstinner
    Copy link
    Member

    1 comment and 1 question about 30bit_longdigit20.patch:

    • I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT ==
      30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15
      (eg. digit=unsigned short). Even if short is always 16 bits, I would
      prefer fixed size types in any case.
    • In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is
      it for the Windows world (which doesn't use configure script)?

    I prefer the patch version 20 because it's much simplier than the
    other with the algorithm optimisations. The patch is already huge, so
    it's better to split it into smaller parts ;-)

    @mdickinson
    Copy link
    Member Author

    [Victor]

    In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is
    it for the Windows world (which doesn't use configure script)?

    Yes, but it's also for Unix: PYLONG_BITS_IN_DIGIT is only set when the --
    enable-big-digits option is supplied to configure. So the code in
    pyport.h always gets used for a plain ./configure && make.

    I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT ==
    30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15
    (eg. digit=unsigned short). Even if short is always 16 bits, I would
    prefer fixed size types in any case.

    I agree with you to some extent, but I'd still prefer to leave the 15-bit
    definitions as they are, partly out of a desire not to make unnecessary
    changes. The 'unsigned short' type used for 15-bit digits is both
    theoretically sufficient and not wasteful in practice.

    Barring further objections, I'm planning to get the
    30bit_longdigit20.patch in later this week.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 11, 2009

    I tried the patch on a 64-bit Linux system and it's ok.

    @mdickinson
    Copy link
    Member Author

    Committed 30bit_longdigit20.patch to py3k in r70460, with some minor
    variations (incorporate Nick Coghlan's improved error messages
    for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries).

    Thanks all for your feedback.

    I might get around to backporting this to 2.7 one day...

    @mdickinson
    Copy link
    Member Author

    That should be r70459, of course.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 18, 2009

    Great!

    Committed 30bit_longdigit20.patch to py3k in r70460, with some minor
    variations (incorporate Nick Coghlan's improved error messages
    for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries).

    @mdickinson
    Copy link
    Member Author

    Backported to trunk in r70479.

    Mario, thanks for the long multiplication tweaks you submitted: could you
    possibly regenerate your Feb 19th patch against the trunk (or py3k,
    whichever you prefer) and attach it to bpo-3944?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants