This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Improve accuracy of math.hypot() and math.dist()
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: mark.dickinson, rhettinger, scoder, serhiy.storchaka, tim.peters
Priority: low Keywords: patch

Created on 2018-08-10 18:19 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
measure_hypot_alternatives.py rhettinger, 2018-08-10 18:19
math_hypot.s rhettinger, 2018-08-10 20:40 GCC-8 and Clang Disassembly of the patched portion of hypot()
hypot_accuracy.py rhettinger, 2018-08-10 22:35 Improved accuracy measurement tool
Pull Requests
URL Status Linked Edit
PR 8727 merged rhettinger, 2018-08-10 18:22
Messages (12)
msg323378 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-08-10 18:19
Apply two accuracy improvements that work well together and that are both cheap (less than 5% difference in speed).  

1. Moving the max value to the end saves an iteration and can improve accuracy (especially in the case where len(args) <= 3 where the summation order is always optimal).

2. Use Kahan summation which works especially well because all the inputs are non-negative.

The patched version generally gets the error down to within 1 ULP (tested with 100,000 trials using 100 arguments to hypot() where the arguments are generated using random.expovariate(1.0) and arranged in the least favorable order, largest-to-smallest):

Patched:
 [(0.0, 67276),
  (1.0, 16700),
  (-0.5, 14702),
  (-1.0, 1322)]

Baseline:
[(1.0, 30364),
 (0.0, 25328),
 (-0.5, 17407),
 (-1.0, 10554),
 (2.0, 6274),
 (-1.5, 4890),
 (-2.0, 3027),
 (-2.5, 897),
 (3.0, 752),
 (-3.0, 380),
 (-3.5, 61),
 (4.0, 51),
 (-4.0, 13),
 (-4.5, 1),
 (5.0, 1)]

The impact on performance is minimal (well under 5%).

Patched:
$ py -c 'import sys; print(sys.version)'
3.8.0a0 (heads/hypot-kahan-late-swap-2:e1d89184f0, Aug 10 2018, 11:06:21)
[Clang 9.1.0 (clang-902.0.39.2)]
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(15.0, 14.0, 13.0, 12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)'
1000000 loops, best of 7: 230 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)'
2000000 loops, best of 7: 170 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(5.0, 4.0, 3.0, 2.0, 1.0)'
2000000 loops, best of 7: 119 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(3.0, 2.0, 1.0)'
5000000 loops, best of 7: 95.7 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(2.0, 1.0)'
5000000 loops, best of 7: 81.5 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(1.0)'
5000000 loops, best of 7: 67.4 nsec per loop

Baseline:
$ py -c 'import sys; print(sys.version)'
3.8.0a0 (heads/master:077059e0f0, Aug 10 2018, 11:02:47)
[Clang 9.1.0 (clang-902.0.39.2)]
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(15.0, 14.0, 13.0, 12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)'
1000000 loops, best of 7: 237 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)'
2000000 loops, best of 7: 173 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(5.0, 4.0, 3.0, 2.0, 1.0)'
2000000 loops, best of 7: 120 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(3.0, 2.0, 1.0)'
5000000 loops, best of 7: 94.8 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(2.0, 1.0)'
5000000 loops, best of 7: 80.3 nsec per loop
$ py -m timeit -r 7 -s 'from math import hypot' 'hypot(1.0)'
5000000 loops, best of 7: 67.1 nsec per loop
msg323387 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-08-10 19:14
What results for all components equal? hypot(1.0, 1.0, ..., 1.0)
msg323388 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-08-10 19:28
Would it give a performance benefit if get rid of multiplication and division, and scale by the power of two approximation of the max using ldexp()?
msg323390 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-08-10 20:40
> What results for all components equal? hypot(1.0, 1.0, ..., 1.0)

The scaled summation will be exact (all elements scale to 1.0 and frac is always 0.0).

> Would it give a performance benefit if get rid of multiplication 
> and division, and scale by the power of two approximation of the 
> max using ldexp()?

You're right that the multiplication and division are the most expensive part (the adds and subtracts are cheaper and can be done in parallel with subsequent memory fetches).  See the attached Clang and GCC-8 disassemblies.

I've tried a number of variants and couldn't get any performance boost without breaking some of the test cases.  This patch is the best of a dozen attempts.
msg323392 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-08-10 21:18
Not that it matters:  "ulp" is a measure of absolute error, but the script is computing some notion of relative error and _calling_ that "ulp".  It can understate the true ulp error by up to a factor of 2 (the "wobble" of base 2 fp).

Staying away from denorms, this is an easy way to get one ulp with respect to a specific 754 double:

    def ulp(x):
        import math
        mant, exp = math.frexp(x)
        return math.ldexp(0.5, exp - 52)

Then, e.g.,

>>> x
1.9999999999999991
>>> y
1.9999999999999996
>>> y - x
4.440892098500626e-16
>>> oneulp = ulp(x)
>>> oneulp # the same as sys.float_info.epsilon for this x
2.220446049250313e-16
>>> (y - x) / oneulp
2.0

which is the true absolute error of y wrt x.

>>> x + 2 * oneulp == y
True

But:

>>> (y - x) / x
2.220446049250314e-16
>>> _ / oneulp
1.0000000000000004

understates the true ulp error by nearly a factor of 2, while the mathematically (but not numerically) equivalent spelling used in the script:

>>> (y/x - 1.0) / oneulp
1.0

understates it by exactly a factor of 2.
msg323394 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-08-10 21:51
Here's a little more performance data that might suggest where possible speed optimizations may lay (I was mostly going for accuracy improvements in this patch).

On my 2.6GHz (3.6Ghz burst) Haswell, the hypot() function for n arguments takes about 11*n+60 ns per call.

The 60 ns fixed portion goes to function call overhead, manipulating native Python objects scattered all over memory, Inf/NaN handling, and in the external calls to __PyArg_ParseStack(), PyObject_Malloc(), PyFloat_AsDouble(), PyObject_Free(), and PyFloat_FromDouble().

The inlined summation routine accesses native C doubles in consecutive memory addresses.  Per Agner Fog's instruction timing tables, the DIVSD takes 10-13 cycles which is about 3 ns, the MULSD takes 5 cycles which is about 2ns, and ADDSD/SUBSD each have a 3 cycle latency for another 1 ns each.  That accounts for most of the 11 ns per argument variable portion of the running time.
msg323397 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-08-10 22:35
Retested using Tim's ulp(x) function (see attached script).  The accuracy results are for 1,000 trials using 10,000 arguments to hypot() where the arguments are generated using triangular(0.999, 1.001) and arranged in the least favorable order, largest-to-smallest:

Patched:  
[(-1.0, 129),
 (0.0, 723),
 (1.0, 148)]

Baseline:
[(-33.0, 2),
 (-32.0, 1),
 (-31.0, 1),
 (-28.0, 5),
 (-27.0, 2),
 (-26.0, 4),
 (-25.0, 3),
 (-24.0, 4),
 (-23.0, 1),
 (-21.0, 9),
 (-20.0, 7),
 (-19.0, 6),
 (-18.0, 15),
 (-17.0, 12),
 (-16.0, 6),
 (-15.0, 12),
 (-14.0, 14),
 (-13.0, 15),
 (-12.0, 15),
 (-11.0, 25),
 (-10.0, 21),
 (-9.0, 24),
 (-8.0, 26),
 (-7.0, 29),
 (-6.0, 36),
 (-5.0, 33),
 (-4.0, 37),
 (-3.0, 31),
 (-2.0, 39),
 (-1.0, 43),
 (0.0, 48),
 (1.0, 45),
 (2.0, 32),
 (3.0, 37),
 (4.0, 34),
 (5.0, 25),
 (6.0, 36),
 (7.0, 29),
 (8.0, 35),
 (9.0, 27),
 (10.0, 24),
 (11.0, 17),
 (12.0, 18),
 (13.0, 18),
 (14.0, 18),
 (15.0, 11),
 (16.0, 8),
 (17.0, 6),
 (18.0, 9),
 (19.0, 15),
 (20.0, 8),
 (21.0, 5),
 (22.0, 6),
 (23.0, 4),
 (24.0, 2),
 (25.0, 1),
 (28.0, 2),
 (30.0, 1),
 (33.0, 1)]
msg323420 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-08-11 18:32
Applied in: https://mail.python.org/pipermail/python-checkins/2018-August/156572.html   Note, the BPO number was left off the checkin message.
msg323423 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-08-11 21:56
Thanks for doing the "real ulp" calc, Raymond!  It was intended to make the Kahan gimmick look better, and it succeeded ;-)  I don't personally care whether adding 10K things ends up with 50 ulp error, but to each their own.

Division can be mostly removed from scaling, but it's a bit delicate.  The obvious stab would be to multiply by 1/max instead of dividing by max.  That introduces another rounding error, but as you've already discovered this computation as a whole is relatively insensitive to rounding errors in scaling.  The real problem is that 1/max can overflow (when max is subnormal - the IEEE range isn't symmetric).

Instead a power of 2 could be used, chosen so that it and its reciprocal are both representable.  There's no particular virtue in scaling the largest value to 1.0 - the real point is to avoid spurious overflow/underflow when squaring the scaled x.  Code like the following could work.  Scaling would introduce essentially no rounding errors then.

But I bet a call to `ldexp()` is even more expensive than division on most platforms.  So it may well be slower when there are few points.

def hyp_ps(xs):
    b = max(map(abs, xs))
    _, exp = frexp(b)
    exp = -3 * exp // 4
    # scale and invscale are exact.
    # Multiplying by scale is exact, except when
    # the result becomes subnormal (in which case
    # |x| is insignifcant compared to |b|).
    # invscale isn't needed until the loop ends,
    # so the division time should overlap with the
    # loop body.
    scale = ldexp(1.0, exp) # exact
    invscale = 1.0 / scale  # exact
    # and `x * scale` is exact except for underflow
    s = fsum((x * scale)**2 for x in xs)
    return invscale * sqrt(s)
msg323430 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-08-12 05:39
I think that in most real cases (unless max is too large or too small) we can get rid of scaling at all.
msg323434 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2018-08-12 07:43
Could we maybe make an educated guess based on absmin and absmax whether scaling is needed or not?
msg323449 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-08-12 16:00
Sure, if we make more assumptions.  For 754 doubles, e.g., scaling isn't needed if `1e-100 < absmax < 1e100` unless there are a truly ludicrous number of points.  Because, if that holds, the true sum is between 1e-200 and number_of_points*1e200, both far from being near trouble.

Then the summation loop gets mostly duplicated (with and without scaling), a platform-dependent assumption is introduced, and we need two test-and-branches to determine which to run.  In the common two-argument cases, it saves one division in return.

Note that without the current form of scaling, we lose the guarantee that the sum is exact when all the arguments are the same (because they're all scaled to exactly 1.0 then, but in general each x*x loses half the product bits without scaling).  I don't care about that myself, but Serhiy seems to.
History
Date User Action Args
2022-04-11 14:59:04adminsetgithub: 78557
2018-08-12 16:00:36tim.peterssetmessages: + msg323449
2018-08-12 07:43:30scodersetnosy: + scoder
messages: + msg323434
2018-08-12 05:39:39serhiy.storchakasetmessages: + msg323430
2018-08-11 21:56:50tim.peterssetmessages: + msg323423
2018-08-11 18:32:59rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg323420

stage: patch review -> resolved
2018-08-10 22:35:27rhettingersetfiles: + hypot_accuracy.py

messages: + msg323397
2018-08-10 21:51:16rhettingersetmessages: + msg323394
2018-08-10 21:18:48tim.peterssetmessages: + msg323392
2018-08-10 20:40:32rhettingersetfiles: + math_hypot.s

messages: + msg323390
2018-08-10 19:28:31serhiy.storchakasetmessages: + msg323388
2018-08-10 19:14:13serhiy.storchakasetmessages: + msg323387
2018-08-10 18:22:40rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request8211
2018-08-10 18:19:58rhettingercreate