OverflowError in statistics.mean when summing large floats #69364

DavidMacIver · 2015-09-19T11:40:40Z

BPO	25177
Nosy	@mdickinson, @larryhastings, @benjaminp, @stevendaprano, @wm75, @MojoVampire, @bharel
Files	issue25177.patch Issue25177_rev2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/stevendaprano'
closed_at = <Date 2016-05-05.19:23:25.858>
created_at = <Date 2015-09-19.11:40:40.207>
labels = ['type-bug', 'library']
title = 'OverflowError in statistics.mean when summing large floats'
updated_at = <Date 2018-04-08.20:02:24.810>
user = 'https://bugs.python.org/DavidMacIver'

bugs.python.org fields:

activity = <Date 2018-04-08.20:02:24.810>
actor = 'wolma'
assignee = 'steven.daprano'
closed = True
closed_date = <Date 2016-05-05.19:23:25.858>
closer = 'berker.peksag'
components = ['Library (Lib)']
creation = <Date 2015-09-19.11:40:40.207>
creator = 'David MacIver'
dependencies = []
files = ['40519', '40597']
hgrepos = []
issue_num = 25177
keywords = ['patch']
message_count = 24.0
messages = ['251076', '251080', '251081', '251084', '251085', '251088', '251103', '251108', '251331', '251697', '251702', '252726', '252748', '253201', '253211', '254154', '254157', '254208', '254209', '254312', '255651', '255687', '255714', '315094']
nosy_count = 9.0
nosy_names = ['mark.dickinson', 'larry', 'benjamin.peterson', 'steven.daprano', 'python-dev', 'wolma', 'josh.r', 'David MacIver', 'bar.harel']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25177'
versions = ['Python 3.4', 'Python 3.5']

DavidMacIver · 2015-09-19T11:40:40Z

The following code produces an OverflowError:

import statistics
statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])

The error is:

File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 293, in mean
return _sum(data)/n
File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 184, in _sum
return T(total)
File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/numbers.py", line 291, in __float__
return self.numerator / self.denominator
OverflowError: integer division result too large for a float

If this is intended behaviour then it is not documented: https://docs.python.org/3/library/statistics.html#statistics.mean only specifies that it may raise StatisticsError.

bharel · 2015-09-19T13:23:21Z

I believe it's an intended behavior as python's float has a limit after all. It's hard to reach it but definitely possible.
A workaround is to internally use Decimal (also take the advantage that it's implementation is now way faster) but it will cause a precision loss, and somewhat unexpected behavior.
Last thing we can do is catch that OverflowError and raise StatisticsError from exc.
I believe catching and re-raising as Statistics or just writing that it may also cause an Overflow error are the most viable ways.
Either way, I'd like another opinion before implementing either method as a patch.

DavidMacIver · 2015-09-19T13:58:54Z

I'm not sure what you mean by float having a limit here. It's certainly finite precision, but there is still a representable value with that finite precision closest to the mean.

As an example where there is an obvious correct answer that will trigger this error: statistics.mean([sys.float_info.max, sys.float_info.max]), this should return sys.float_info.max (which is definitely representable!), but instead raises this overflow error.

bharel · 2015-09-19T15:08:45Z

Whoop! I see the reason for it now. By limit I don't mean the precision limit, I mean the top limit in which float converts to "inf".
Seems like this bug is due to the change of python 3's division operator.
Under numbers it states:
"It's important that this conversion use the integer's "true"
division rather than casting one side to float before dividing
so that ratios of huge integers convert without overflowing."

It is meant to not to overflow but the "/" operator is now "//".
Lemme patch it up and see if it fixes the problem. If it states the integer's "true" division, I believe this small fix will be sufficient as it has been tested before.

bharel · 2015-09-19T15:25:32Z

Yup, it indeed fixes the problem. Sorry for thinking it's intended.
Seems like it's a small problem but it does affects all the uses of Integral or Rational.
I'll test it with the suite hoping it has a sufficient coverage.

bharel · 2015-09-19T16:11:48Z

Alright,
Seems like the problem is bigger than I thought. PEP-238 (https://www.python.org/dev/peps/pep-0238/) mentions that problem.
Using the new division, // will cause a floor while / will cause an OverflowError.
An ugly way around it involves type checking or reverting back to the classic division.

bharel · 2015-09-19T18:58:01Z

Seems like this is the only viable option. It fixes the OverflowError but comes at the cost of precision and time.
Another option would be to try/except the overflow error and only then return the slower method.
Regarding the data [8.988465674311579e+307, 8.98846567431158e+307], float has lesser precision than other types like Decimal so the patched method would return 8.98846567431158e+307.
A dataset to test the fix is [8.99e+307, 8.989e+307] which gives a correct result.

mdickinson · 2015-09-19T19:29:15Z

That patch doesn't really help, I'm afraid, since it introduces problems at the other end of the floating-point range: for example, mean([5e-324, 5e-324]) would give 0.0 with that patch (instead of the 5e-324 it currently gives).

So currently, when computing the mean of a sequence of floats (possibly mixed with ints), the code:

Computes the sum as a Fraction.
Converts that Fraction to a float.
Divides the result by the number of elements.

Reversing steps 2 and 3 here would solve the issue, but would require some refactoring. It's really up to Steven whether he thinks that that refactoring is worth it for these corner cases at the extremes of the floating-point range. (It's difficult to imagine that such numbers would turn up frequently in practical applications.)

stevendaprano · 2015-09-22T16:58:44Z

Bar, thanks for the time you put into diagnosing this error, it is
definitely a bug. The intention is for mean([huge, huge]) to return
huge, not raise OverflowError.

I'm reluctant to say that mean() will *never* raise OverflowError, but
it certainly shouldn't do so for this case if it can be avoided. I think
Mark's diagnosis is probably correct that refactoring the code will fix
this.

bharel · 2015-09-27T10:38:41Z

Alright, I issued a fix, now testing it

bharel · 2015-09-27T14:03:55Z

Alright, this patch passed all tests.
I've changed the function sum to include a new parameter called fraction which decides whether to return a fraction or not. It might be useful in more scenarios due to the fact fractions are more accurate.
I'm still unsure why not to support mixed types, as we can just again, return a mixed type as fraction but I left it like this in order not to cause unexpected behavior. I still think however that supporting mixed types should be fine as the bottleneck regarding precision is fraction, which is used as the base for conversion anyway.
I've added a few doctests on the way including the float max and min, which did not previously pass on the current implementation, and changed the unittest to match.

bharel · 2015-10-10T16:28:22Z

Any comments on the patch?

stevendaprano · 2015-10-10T22:05:02Z

On Sat, Oct 10, 2015 at 04:28:22PM +0000, Bar Harel wrote:

Any comments on the patch?

Not yet, I've been unable to look at it, but thank you. If I haven't
responded by the 17th of this month, please ping me again.

MojoVampire · 2015-10-20T02:18:41Z

Do you have any benchmarks on the before and after? I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case, particularly for longer input sequences, but that's a hunch only.

mdickinson · 2015-10-20T06:44:40Z

I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case

The existing code already converts each of the input items to Fraction; the only difference is that the old code converts the sum of those Fractions to float (or whatever the target type is) *before* dividing by the count, while the new code performs the sum/count division in Fraction-land, and only *then* converts to float. That is, it's the difference between "float(exact_sum) / count" and "float(exact_sum / count)".

IOW, the performance is already dead. Or rather, it's just resting: IIUC, the module design prioritises correctness over speed. I'm sure Steven would be open to suggestions for faster algorithms that maintain the current accuracy.

stevendaprano · 2015-11-06T01:06:43Z

Has anyone confirmed that this bug actually exists? I'm afraid that I cannot verify it. I get these results on three different computers:

py> x = 8.988465674311579e+307
py> statistics.mean([x, x])
8.988465674311579e+307
py> statistics.mean([x, x]) == x
True

running Python 3.4.3, a backport on 3.3.0rc3, and the default branch in the repo (3.6.0a).

stevendaprano · 2015-11-06T01:16:07Z

Has anyone confirmed that this bug actually exists?

Confirmed. The initial report is not quite correct: you need three
values to trigger the overflow, not two:

py> x = 8.988465674311579e+307
py> statistics.mean([x]*2) == x
True
py> statistics.mean([x]*3) == x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./statistics.py", line 289, in mean
    return _sum(data)/n
  File "./statistics.py", line 184, in _sum
    return T(total)
  File "/usr/local/lib/python3.3/numbers.py", line 296, in __float__
    return self.numerator / self.denominator
OverflowError: integer division result too large for a float

mdickinson · 2015-11-06T19:38:27Z

I can reproduce here (OS X 10.9, Python 3.5), exactly as described in the original post.

Python 3.5.0 (default, Sep 22 2015, 18:26:54) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import statistics
>>> statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 293, in mean
    return _sum(data)/n
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 184, in _sum
    return T(total)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/numbers.py", line 291, in __float__
    return self.numerator / self.denominator
OverflowError: integer division result too large for a float

mdickinson · 2015-11-06T19:40:55Z

Note that the two input values given in the original report are not the same:

[8.988465674311579e+307, 8.98846567431158e+307] != [8.988465674311579e+307] * 2.

bharel · 2015-11-07T22:48:01Z

Anyway, yes, it should be quite the same. I can provide some benchmarks tomorrow if you wish.

python-dev · 2015-12-01T14:11:18Z

New changeset 4bc9405c4f7b by Steven D'Aprano in branch '3.4':
Fix for issue bpo-25177 with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/4bc9405c4f7b

New changeset ed45a09e5a69 by Steven D'Aprano in branch '3.5':
Fixed issue bpo-25177, problems with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/ed45a09e5a69

New changeset 0eeb39fc8ff5 by Steven D'Aprano in branch 'default':
Issue bpo-25177: Fixed problem with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/0eeb39fc8ff5

stevendaprano · 2015-12-02T01:18:32Z

Larry,

Is it too late to get this into 3.5rc1?

changeset 99407:ed45a09e5a69

Thanks.

python-dev · 2015-12-02T13:36:32Z

New changeset a7d2307055e7 by Victor Stinner in branch 'default':
Null merge 3.5, patch was already applied to default (isuse bpo-25177)
https://hg.python.org/cpython/rev/a7d2307055e7

wm75 · 2018-04-08T20:02:25Z

Steven's commit here also fixed bpo-24068.

DavidMacIver mannequin added the stdlib Python modules in the Lib dir label Sep 19, 2015

stevendaprano self-assigned this Oct 20, 2015

berkerpeksag closed this as completed May 5, 2016

berkerpeksag added the type-bug An unexpected behavior, bug, or error label May 5, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError in statistics.mean when summing large floats #69364

OverflowError in statistics.mean when summing large floats #69364

DavidMacIver mannequin commented Sep 19, 2015

DavidMacIver mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

DavidMacIver mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

mdickinson commented Sep 19, 2015

stevendaprano commented Sep 22, 2015

bharel mannequin commented Sep 27, 2015

bharel mannequin commented Sep 27, 2015

bharel mannequin commented Oct 10, 2015

stevendaprano commented Oct 10, 2015

MojoVampire mannequin commented Oct 20, 2015

mdickinson commented Oct 20, 2015

stevendaprano commented Nov 6, 2015

stevendaprano commented Nov 6, 2015

mdickinson commented Nov 6, 2015

mdickinson commented Nov 6, 2015

bharel mannequin commented Nov 7, 2015

python-dev mannequin commented Dec 1, 2015

stevendaprano commented Dec 2, 2015

python-dev mannequin commented Dec 2, 2015

wm75 mannequin commented Apr 8, 2018

OverflowError in statistics.mean when summing large floats #69364

OverflowError in statistics.mean when summing large floats #69364

Comments

DavidMacIver mannequin commented Sep 19, 2015

DavidMacIver mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

DavidMacIver mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

bharel mannequin commented Sep 19, 2015

mdickinson commented Sep 19, 2015

stevendaprano commented Sep 22, 2015

bharel mannequin commented Sep 27, 2015

bharel mannequin commented Sep 27, 2015

bharel mannequin commented Oct 10, 2015

stevendaprano commented Oct 10, 2015

MojoVampire mannequin commented Oct 20, 2015

mdickinson commented Oct 20, 2015

stevendaprano commented Nov 6, 2015

stevendaprano commented Nov 6, 2015

mdickinson commented Nov 6, 2015

mdickinson commented Nov 6, 2015

bharel mannequin commented Nov 7, 2015

python-dev mannequin commented Dec 1, 2015

stevendaprano commented Dec 2, 2015

python-dev mannequin commented Dec 2, 2015

wm75 mannequin commented Apr 8, 2018