Issue 34376: Improve accuracy of math.hypot() and math.dist()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/78557

classification

Title:	Improve accuracy of math.hypot() and math.dist()
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	mark.dickinson, rhettinger, scoder, serhiy.storchaka, tim.peters
Priority:	low	Keywords:	patch

Created on 2018-08-10 18:19 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
measure_hypot_alternatives.py	rhettinger, 2018-08-10 18:19
math_hypot.s	rhettinger, 2018-08-10 20:40	GCC-8 and Clang Disassembly of the patched portion of hypot()
hypot_accuracy.py	rhettinger, 2018-08-10 22:35	Improved accuracy measurement tool

Pull Requests
URL	Status	Linked	Edit
PR 8727	merged	rhettinger, 2018-08-10 18:22

Messages (12)
msg323378 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-08-10 18:19
Apply two accuracy improvements that work well together and that are both cheap (less than 5% difference in speed). 1. Moving the max value to the end saves an iteration and can improve accuracy (especially in the case where len(args) <= 3 where the summation order is always optimal). 2. Use Kahan summation which works especially well because all the inputs are non-negative. The patched version generally gets the error down to within 1 ULP (tested with 100,000 trials using 100 arguments to hypot() where the arguments are generated using random.expovariate(1.0) and arranged in the least favorable order, largest-to-smallest): Patched: [(0.0, 67276), (1.0, 16700), (-0.5, 14702), (-1.0, 1322)] Baseline: [(1.0, 30364), (0.0, 25328), (-0.5, 17407), (-1.0, 10554), (2.0, 6274), (-1.5, 4890), (-2.0, 3027), (-2.5, 897), (3.0, 752), (-3.0, 380), (-3.5, 61), (4.0, 51), (-4.0, 13), (-4.5, 1), (5.0, 1)] The impact on performance is minimal (well under 5%). Patched: $ py -c 'import sys; print(sys.version)' 3.8.0a0 (heads/hypot-kahan-late-swap-2:e1d89184f0, Aug 10 2018, 11:06:21) [Clang 9.1.0 (clang-902.0.39.2)] $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(15.0, 14.0, 13.0, 12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)' 1000000 loops, best of 7: 230 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)' 2000000 loops, best of 7: 170 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(5.0, 4.0, 3.0, 2.0, 1.0)' 2000000 loops, best of 7: 119 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(3.0, 2.0, 1.0)' 5000000 loops, best of 7: 95.7 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(2.0, 1.0)' 5000000 loops, best of 7: 81.5 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(1.0)' 5000000 loops, best of 7: 67.4 nsec per loop Baseline: $ py -c 'import sys; print(sys.version)' 3.8.0a0 (heads/master:077059e0f0, Aug 10 2018, 11:02:47) [Clang 9.1.0 (clang-902.0.39.2)] $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(15.0, 14.0, 13.0, 12.0, 11.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)' 1000000 loops, best of 7: 237 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0)' 2000000 loops, best of 7: 173 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(5.0, 4.0, 3.0, 2.0, 1.0)' 2000000 loops, best of 7: 120 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(3.0, 2.0, 1.0)' 5000000 loops, best of 7: 94.8 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(2.0, 1.0)' 5000000 loops, best of 7: 80.3 nsec per loop $ py -m timeit -r 7 -s 'from math import hypot' 'hypot(1.0)' 5000000 loops, best of 7: 67.1 nsec per loop
msg323387 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-08-10 19:14
What results for all components equal? hypot(1.0, 1.0, ..., 1.0)
msg323388 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-08-10 19:28
Would it give a performance benefit if get rid of multiplication and division, and scale by the power of two approximation of the max using ldexp()?
msg323390 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-08-10 20:40
> What results for all components equal? hypot(1.0, 1.0, ..., 1.0) The scaled summation will be exact (all elements scale to 1.0 and frac is always 0.0). > Would it give a performance benefit if get rid of multiplication > and division, and scale by the power of two approximation of the > max using ldexp()? You're right that the multiplication and division are the most expensive part (the adds and subtracts are cheaper and can be done in parallel with subsequent memory fetches). See the attached Clang and GCC-8 disassemblies. I've tried a number of variants and couldn't get any performance boost without breaking some of the test cases. This patch is the best of a dozen attempts.
msg323392 - (view)	Author: Tim Peters (tim.peters) *	Date: 2018-08-10 21:18
Not that it matters: "ulp" is a measure of absolute error, but the script is computing some notion of relative error and _calling_ that "ulp". It can understate the true ulp error by up to a factor of 2 (the "wobble" of base 2 fp). Staying away from denorms, this is an easy way to get one ulp with respect to a specific 754 double: def ulp(x): import math mant, exp = math.frexp(x) return math.ldexp(0.5, exp - 52) Then, e.g., >>> x 1.9999999999999991 >>> y 1.9999999999999996 >>> y - x 4.440892098500626e-16 >>> oneulp = ulp(x) >>> oneulp # the same as sys.float_info.epsilon for this x 2.220446049250313e-16 >>> (y - x) / oneulp 2.0 which is the true absolute error of y wrt x. >>> x + 2 * oneulp == y True But: >>> (y - x) / x 2.220446049250314e-16 >>> _ / oneulp 1.0000000000000004 understates the true ulp error by nearly a factor of 2, while the mathematically (but not numerically) equivalent spelling used in the script: >>> (y/x - 1.0) / oneulp 1.0 understates it by exactly a factor of 2.
msg323394 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-08-10 21:51
Here's a little more performance data that might suggest where possible speed optimizations may lay (I was mostly going for accuracy improvements in this patch). On my 2.6GHz (3.6Ghz burst) Haswell, the hypot() function for n arguments takes about 11*n+60 ns per call. The 60 ns fixed portion goes to function call overhead, manipulating native Python objects scattered all over memory, Inf/NaN handling, and in the external calls to __PyArg_ParseStack(), PyObject_Malloc(), PyFloat_AsDouble(), PyObject_Free(), and PyFloat_FromDouble(). The inlined summation routine accesses native C doubles in consecutive memory addresses. Per Agner Fog's instruction timing tables, the DIVSD takes 10-13 cycles which is about 3 ns, the MULSD takes 5 cycles which is about 2ns, and ADDSD/SUBSD each have a 3 cycle latency for another 1 ns each. That accounts for most of the 11 ns per argument variable portion of the running time.
msg323397 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-08-10 22:35
Retested using Tim's ulp(x) function (see attached script). The accuracy results are for 1,000 trials using 10,000 arguments to hypot() where the arguments are generated using triangular(0.999, 1.001) and arranged in the least favorable order, largest-to-smallest: Patched: [(-1.0, 129), (0.0, 723), (1.0, 148)] Baseline: [(-33.0, 2), (-32.0, 1), (-31.0, 1), (-28.0, 5), (-27.0, 2), (-26.0, 4), (-25.0, 3), (-24.0, 4), (-23.0, 1), (-21.0, 9), (-20.0, 7), (-19.0, 6), (-18.0, 15), (-17.0, 12), (-16.0, 6), (-15.0, 12), (-14.0, 14), (-13.0, 15), (-12.0, 15), (-11.0, 25), (-10.0, 21), (-9.0, 24), (-8.0, 26), (-7.0, 29), (-6.0, 36), (-5.0, 33), (-4.0, 37), (-3.0, 31), (-2.0, 39), (-1.0, 43), (0.0, 48), (1.0, 45), (2.0, 32), (3.0, 37), (4.0, 34), (5.0, 25), (6.0, 36), (7.0, 29), (8.0, 35), (9.0, 27), (10.0, 24), (11.0, 17), (12.0, 18), (13.0, 18), (14.0, 18), (15.0, 11), (16.0, 8), (17.0, 6), (18.0, 9), (19.0, 15), (20.0, 8), (21.0, 5), (22.0, 6), (23.0, 4), (24.0, 2), (25.0, 1), (28.0, 2), (30.0, 1), (33.0, 1)]
msg323420 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-08-11 18:32
Applied in: https://mail.python.org/pipermail/python-checkins/2018-August/156572.html Note, the BPO number was left off the checkin message.
msg323423 - (view)	Author: Tim Peters (tim.peters) *	Date: 2018-08-11 21:56
Thanks for doing the "real ulp" calc, Raymond! It was intended to make the Kahan gimmick look better, and it succeeded ;-) I don't personally care whether adding 10K things ends up with 50 ulp error, but to each their own. Division can be mostly removed from scaling, but it's a bit delicate. The obvious stab would be to multiply by 1/max instead of dividing by max. That introduces another rounding error, but as you've already discovered this computation as a whole is relatively insensitive to rounding errors in scaling. The real problem is that 1/max can overflow (when max is subnormal - the IEEE range isn't symmetric). Instead a power of 2 could be used, chosen so that it and its reciprocal are both representable. There's no particular virtue in scaling the largest value to 1.0 - the real point is to avoid spurious overflow/underflow when squaring the scaled x. Code like the following could work. Scaling would introduce essentially no rounding errors then. But I bet a call to `ldexp()` is even more expensive than division on most platforms. So it may well be slower when there are few points. def hyp_ps(xs): b = max(map(abs, xs)) _, exp = frexp(b) exp = -3 * exp // 4 # scale and invscale are exact. # Multiplying by scale is exact, except when # the result becomes subnormal (in which case # \|x\| is insignifcant compared to \|b\|). # invscale isn't needed until the loop ends, # so the division time should overlap with the # loop body. scale = ldexp(1.0, exp) # exact invscale = 1.0 / scale # exact # and `x * scale` is exact except for underflow s = fsum((x * scale)*2 for x in xs) return invscale sqrt(s)
msg323430 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-08-12 05:39
I think that in most real cases (unless max is too large or too small) we can get rid of scaling at all.
msg323434 - (view)	Author: Stefan Behnel (scoder) *	Date: 2018-08-12 07:43
Could we maybe make an educated guess based on absmin and absmax whether scaling is needed or not?
msg323449 - (view)	Author: Tim Peters (tim.peters) *	Date: 2018-08-12 16:00
Sure, if we make more assumptions. For 754 doubles, e.g., scaling isn't needed if `1e-100 < absmax < 1e100` unless there are a truly ludicrous number of points. Because, if that holds, the true sum is between 1e-200 and number_of_points1e200, both far from being near trouble. Then the summation loop gets mostly duplicated (with and without scaling), a platform-dependent assumption is introduced, and we need two test-and-branches to determine which to run. In the common two-argument cases, it saves one division in return. Note that without the current form of scaling, we lose the guarantee that the sum is exact when all the arguments are the same (because they're all scaled to exactly 1.0 then, but in general each xx loses half the product bits without scaling). I don't care about that myself, but Serhiy seems to.

History
Date	User	Action	Args
2022-04-11 14:59:04	admin	set	github: 78557
2018-08-12 16:00:36	tim.peters	set	messages: + msg323449
2018-08-12 07:43:30	scoder	set	nosy: + scoder messages: + msg323434
2018-08-12 05:39:39	serhiy.storchaka	set	messages: + msg323430
2018-08-11 21:56:50	tim.peters	set	messages: + msg323423
2018-08-11 18:32:59	rhettinger	set	status: open -> closed resolution: fixed messages: + msg323420 stage: patch review -> resolved
2018-08-10 22:35:27	rhettinger	set	files: + hypot_accuracy.py messages: + msg323397
2018-08-10 21:51:16	rhettinger	set	messages: + msg323394
2018-08-10 21:18:48	tim.peters	set	messages: + msg323392
2018-08-10 20:40:32	rhettinger	set	files: + math_hypot.s messages: + msg323390
2018-08-10 19:28:31	serhiy.storchaka	set	messages: + msg323388
2018-08-10 19:14:13	serhiy.storchaka	set	messages: + msg323387
2018-08-10 18:22:40	rhettinger	set	keywords: + patch stage: patch review pull_requests: + pull_request8211
2018-08-10 18:19:58	rhettinger	create