"""
Proof of correctness for crsum
==============================
Terminology
-----------
Call a list p of floats *nonoverlapping* if all its elements are
nonzero and nonspecial, if the elements are increasing in magnitude,
and if p[i-1] does not overlap p[i] for any 0 < i < len(p).
Notation
--------
Write Sum(p) for the (exact) sum of the values in a list p, as a real
number. Note that Sum(p) may not be exactly representable as a
floating-point number.
Lemma
-----
Let e be a power of 2. If p is nonoverlapping and nonempty, and
0 < p[-1] < e
then
0 < Sum(p) < e.
Similarly, if 0 > p[-1] > -e then 0 > Sum(p) > -e.
Proof
-----
The second part follows from the first by negating. We prove the
first part.
First inequality: Since p is nonoverlapping, p[i-1] <= p[i]/2 for all
0 < i < len(p). Hence
Sum(p) >= p[-1] - 1/2 p[-1] - 1/4 p[-1] - ... > 0.
Second inequality: if p[-1] < 2^n then 2^n-p[-1] is larger than, and
nonoverlapping with, p[0] through p[-2]. (The least significant bit
of 2^n - p[-1] is the same as that of p[-1].) By the first inequality
applied to -p[0], ..., -p[-2], 2^n - p[-1], it follows that
0 < -p[0] + -p[1] + ... + -p[-2] + 2^n - p[-1] == 2^n - Sum(p)
hence Sum(p) < 2^n, as claimed. The second part of the lemma follows
by symmetry.
Lemma
-----
Suppose that p is nonoverlapping and len(p) >= 2. Suppose furthermore
that p[-1] is the correctly rounded value of p[-1] + p[-2]. Then
p[-1] is the correctly rounded value of Sum(p) unless:
p[-1] + 2*p[-2] is exactly representable, and
len(p) >= 3, and
p[-3] and p[-2] have the same sign.
If all of these conditions are satisfied then the correctly rounded
value of Sum(p) is p[-1] + 2*p[-2].
Proof
-----
Since, by assumption, p[-2] is nonzero, p[-1] + p[-2] is not exactly
representable. Let p[-1] and p[-1]+e be the two floating-point
numbers closest to p[-1]+p[-2]. Then abs(e) is a power of 2, and
either 0 < p[-2] <= e/2 or 0 > p[-2] >= e/2. We give the proof in the
former case; the proof in the latter is virtually identical.
So 0 < p[-2] <= e/2. Divide into cases:
Case 1: p[-2] < e/2. Then by the previous lemma, 0 < Sum(p[:-1]) <
e/2, so p[-1] < Sum(p) < p[-1] + e/2, hence the sum is closer to p[-1]
than p[-1] + e and p[-1] gives the correctly rounded value.
Case 2: len(p) == 2. Then by assumption, p[-1] is the correctly
rounded value of p[-1] + p[-2] == Sum(p).
Case 3: p[-2] == e/2, len(p) >= 3, and p[-2] and p[-3] have opposite
signs. Then -e/2 < p[-3] < 0, so by the preceding lemma,
-e/2 < Sum(p[:-2]) < 0.
Adding p[-1] + p[-2] == p[-1] + e/2 gives:
p[-1] < Sum(p) < p[-1] + e/2
Hence p[-1] is the closest floating-point number to Sum(p).
Case 4: p[-2] == e/2, len(p) >= 3, and p[-2] and p[-3] have the same
sign. Then 0 < p[-3] < e/2, so by the preceding lemma, 0 <
Sum(p[:-2]) < e/2. Hence
p[-1] + e/2 < Sum(p) < p[-1] + e
So p[-1] + e is the closest floating-point number to Sum(p), and the
correctly rounded result is p[-1] + e = p[-1] + 2*p[-2].
Finally, the case where p[-1] is +- the largest representable float
and p[-2] has the same sign as p[-1] needs special treatment. But if
p[-1] is the largest representable float and p[-2] is positive then,
letting e be the difference between p[-1] and 2**exp_max,
p[-1] < p[-1] + p[-2] < p[-1] + e/2
where the second inequality is strict, because p[-1]+e/2 would be
rounded up to 2**exp_max (and hence would overflow), contradicting the
assumption that p[-1] + p[-2] rounds to p[-1]. Now as in case 1
above, it follows that Sum(p) rounds to p[-1].
Notes on correctness of msum
==============================
(Assuming the proof of correctness for the original algorithm.)
Divide into two stages:
Stage 1: accumulating partials
Stage 2: summing partials
Stage 1: we follow the original algorithm, but whenever an
overflow occurs in computing x+y == hi + lo, we instead compute
x+y-2**1024 = hi+lo. As before, hi is the correctly rounded value
of x+y-2**1024, and abs(lo) <= min(abs(x), abs(y)). From these
properties it follows as in the original proof that the elements
of partials[1:] are nonzero, nonadjacent, and increasing in
magnitude.
We keep track of the extra multiples of 2.**1024 in partials[0].
It's theoretically possible for partials[0] itself to overflow,
but only if there are more than 2**53 overflows, which seems
exceedingly unlikely in practice.
Lemma
-----
Suppose that x and y are floats with abs(y) <= abs(x). If x+y
overflows and x is positive then x-2**1024 is exactly
representable, and can be exactly computed as x-2**1023-2**1023.
If x+y overflows and x is negative then x+2**1024 is exactly
representable, and can be computed as x+2**1023+2**1023.
Proof
-----
If x+y overflows and x is positive then 2**1023 <= x < 2**1024,
and it follows that both x-2**1023 and x-2**1024 are exactly
representable.
At the end of stage 1, the value of the sum of the original
iterable is Sum(partials[1:]) + 2**1024*partials[0].
(Special values: if a NaN is encountered the function exits
immediately, returning that NaN. Otherwise, partials[0] will be
nan if the iterable contained both +inf and -inf, +inf if the
iterable contained +inf but not -inf, and -inf if the iterable
contained -inf but not +inf.)
Stage 2: if partials[0] == 0.0, we just sum partials[1:], using
crsum. If abs(partials[0]) >= 2.0 then the sum overflowed. The
difficult case is when abs(partials[0]) == 1.0.
For notational simplicity, assume that partials[0] == 1.0, so that
we want to compute 2**1024 + Sum(partials[1:]). Write p for
partials. Then we have the following:
Lemma
-----
Assume len(p) > 1. Assume IEEE 754 format and semantics
(float_info.mant_dig = 53, float_info.max_exp = 1024)
If 2**1024 + p[-1] > 2**1024 - 2**970 then 2**1024 + Sum(p[1:])
overflows.
If 2**1024 + p[-1] < 2**1024 - 2**970 then 2**1024 + Sum(p[1:]) is
representable.
If 2**1024 + p[-1] == 2**1024 - 2**970 then 2**1024 + Sum(p[1:])
rounds to 2**1024-2**971 if len(p) > 2 and p[-2] < 0, and
overflows otherwise.
Proof
-----
Similar to the proof for crsum above.
To detect whether 2**1024 + p[-1] >= 2**1024 - 2**970, we test
whether 2.*(2**1023 + p[-1]/2.0) overflows. (p[-1]/2.0 is exact
except possibly when p[-1] is subnormal, but when p[-1] is
subnormal the test still produces the right result.)
To detect whether 2**1024 + p[-1] == 2**1024 - 2**970, check whether
2**1023 + p[-1]/2 == 2**1023 and 2**1023 + p[-1] - 2**1023 == p[-1].
"""
from math import isinf, isnan
from sys import float_info
twopow = 2.0**(float_info.max_exp - 1)
def samesign(x, y):
return (x >= 0.0) == (y >= 0.0)
def twosum(x, y):
# assumes that abs(x) >= abs(y)
hi = x + y
lo = y - (hi - x)
return hi, lo
def crsum(partials):
"""Compute the sum of a list of nonoverlapping floats.
On input, partials is a list of nonzero, nonspecial,
nonoverlapping floats, strictly increasing in magnitude, but
possibly not all having the same sign.
On output, the sum of partials gives the error in the returned
result, which is correctly rounded (using the round-half-to-even
rule). The elements of partials remain nonzero, nonspecial,
nonoverlapping, and increasing in magnitude.
Assumes IEEE 754 float format and semantics.
"""
if not partials:
return 0.0
# sum from the top, stopping as soon as the sum is inexact.
total_so_far = partials.pop()
while partials:
total_so_far, lo = twosum(total_so_far, partials.pop())
if lo:
partials.append(lo)
break
# adjust for correct rounding if necessary
if len(partials) >= 2 and samesign(partials[-1], partials[-2]) and \
total_so_far + 2*partials[-1] - total_so_far == 2*partials[-1]:
total_so_far += 2*partials[-1]
partials[-1] = -partials[-1]
return total_so_far
def msum(iterable):
"""Full precision sum of values in iterable. Returns the value of the
sum, rounded to the nearest representable floating-point number
using the round-half-to-even rule.
"""
# Stage 1: accumulate partials
partials = [0.0]
for x in iterable:
if isnan(x):
return x
elif isinf(x):
partials[0] += x
else:
i = 1
for y in partials[1:]:
if abs(x) < abs(y):
x, y = y, x
hi, lo = twosum(x, y)
if isinf(hi):
sign = 1 if hi > 0 else -1
x = x - twopow*sign - twopow*sign
partials[0] += sign
if abs(x) < abs(y):
x, y = y, x
hi, lo = twosum(x, y)
if lo:
partials[i] = lo
i += 1
x = hi
partials[i:] = [x] if x else []
# special cases arising from infinities
if isinf(partials[0]):
return partials[0]
elif isnan(partials[0]):
raise ValueError('infinities of both signs in summands')
# Stage 2: sum partials[1:] + 2**exp_max * partials[0]
if abs(partials[0]) == 1.0 and len(partials) > 1 and \
not samesign(partials[-1], partials[0]):
# problem case: decide whether result is representable
hi, lo = twosum(partials[0]*twopow, partials[-1]/2)
if isinf(2*hi):
# overflow, except in edge case...
if hi+2*lo-hi == 2*lo and \
len(partials) > 2 and samesign(lo, partials[-2]):
return 2*(hi+2*lo)
else:
partials[-1:] = [2*lo, 2*hi] if lo else [2*hi]
partials[0] = 0.0
if not partials[0]:
return crsum(partials[1:])
raise OverflowError('overflow in msum')
def test():
inf = float('inf')
nan = float('nan')
test_values = [
([], 0.0),
([0.0], 0.0),
([1e100, 1.0, -1e100, 1e-100, 1e50, -1.0, -1e50], 1e-100),
([1e308, 1e308, -1e308], 1e308),
([-1e308, 1e308, 1e308], 1e308),
([1e308, -1e308, 1e308], 1e308),
([2.0**1023, 2.0**1023, -2.0**1000], 1.7976930277114552e+308),
([twopow, twopow, twopow, twopow, -twopow, -twopow, -twopow], 8.9884656743115795e+307),
([2.0**53, -0.5, -2.0**-54], 2.0**53-1.0),
([2.0**53, 1.0, 2.0**-100], 2.0**53+2.0),
([2.0**53+10.0, 1.0, 2.0**-100], 2.0**53+12.0),
([2.0**53-4.0, 0.5, 2.0**-54], 2.0**53-3.0),
([2.0**1023-2.0**970, -1.0, 2.0**1023], 1.7976931348623157e+308),
([float_info.max, float_info.max*2.**-54], float_info.max),
([float_info.max, float_info.max*2.**-53], OverflowError),
([1./n for n in range(1, 1001)], 7.4854708605503451),
([(-1.)**n/n for n in range(1, 1001)], -0.69264743055982025),
([1.7**(i+1)-1.7**i for i in range(1000)] + [-1.7**1000], -1.0),
([inf, -inf, nan], nan),
([nan, inf, -inf], nan),
([inf, nan, inf], nan),
([inf, inf], inf),
([inf, -inf], ValueError),
([-inf, 1e308, 1e308, -inf], -inf),
([2.0**1023-2.0**970, 0.0, 2.0**1023], OverflowError),
([2.0**1023-2.0**970, 1.0, 2.0**1023], OverflowError),
([2.0**1023, 2.0**1023], OverflowError),
([2.0**1023, 2.0**1023, -1.0], OverflowError),
([twopow, twopow, twopow, twopow, -twopow, -twopow], OverflowError),
([twopow, twopow, twopow, twopow, -twopow, twopow], OverflowError),
([-twopow, -twopow, -twopow, -twopow], OverflowError),
([2.**1023, 2.**1023, -2.**971], float_info.max),
([2.**1023, 2.**1023, -2.**970], OverflowError),
([-2.**970, 2.**1023, 2.**1023, -2.**-1074], float_info.max),
([2.**1023, 2.**1023, -2.**970, 2.**-1074], OverflowError),
([-2.**1023, 2.**971, -2.**1023], -float_info.max),
([-2.**1023, -2.**1023, 2.**970], OverflowError),
([-2.**1023, -2.**1023, 2.**970, 2.**-1074], -float_info.max),
([-2.**-1074, -2.**1023, -2.**1023, 2.**970], OverflowError),
([2.**930, -2.**980, 2.**1023, 2.**1023, twopow, -twopow], 1.7976931348622137e+308),
([2.**1023, 2.**1023, -1e307], 1.6976931348623159e+308),
([1e16, 1., 1e-16], 10000000000000002.0),
]
for i, (vals, s) in enumerate(test_values):
if isinstance(s, type) and issubclass(s, Exception):
try:
m = msum(vals)
except s:
pass
else:
print "Test %d failed: expected %s" % (i, s)
else:
m = msum(vals)
if not (m == s or isnan(m) and isnan(s)):
print "Test %d failed: got %r, expected %r for msum(%r)." % (
i, m, s, vals)
if __name__ == '__main__':
test()