classification
Title: OverflowError in statistics.mean when summing large floats
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: David MacIver, bar.harel, benjamin.peterson, josh.r, larry, mark.dickinson, python-dev, steven.daprano
Priority: normal Keywords: patch

Created on 2015-09-19 11:40 by David MacIver, last changed 2016-05-05 19:23 by berker.peksag. This issue is now closed.

Files
File name Uploaded Description Edit
issue25177.patch bar.harel, 2015-09-19 18:58 review
Issue25177_rev2.patch bar.harel, 2015-09-27 14:03 review
Messages (23)
msg251076 - (view) Author: David MacIver (David MacIver) * Date: 2015-09-19 11:40
The following code produces an OverflowError:

import statistics
statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])

The error is:

  File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 293, in mean
    return _sum(data)/n
  File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 184, in _sum
    return T(total)
  File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/numbers.py", line 291, in __float__
    return self.numerator / self.denominator
OverflowError: integer division result too large for a float


If this is intended behaviour then it is not documented: https://docs.python.org/3/library/statistics.html#statistics.mean only specifies that it may raise StatisticsError.
msg251080 - (view) Author: Bar Harel (bar.harel) * Date: 2015-09-19 13:23
I believe it's an intended behavior as python's float has a limit after all. It's hard to reach it but definitely possible.
A workaround is to internally use Decimal (also take the advantage that it's implementation is now way faster) but it will cause a precision loss, and somewhat unexpected behavior.
Last thing we can do is catch that OverflowError and raise StatisticsError from exc.
I believe catching and re-raising as Statistics or just writing that it may also cause an Overflow error are the most viable ways.
Either way, I'd like another opinion before implementing either method as a patch.
msg251081 - (view) Author: David MacIver (David MacIver) * Date: 2015-09-19 13:58
I'm not sure what you mean by float having a limit here. It's certainly finite precision, but there is still a representable value with that finite precision closest to the mean.

As an example where there is an obvious correct answer that will trigger this error: statistics.mean([sys.float_info.max, sys.float_info.max]), this should return sys.float_info.max (which is definitely representable!), but instead raises this overflow error.
msg251084 - (view) Author: Bar Harel (bar.harel) * Date: 2015-09-19 15:08
Whoop! I see the reason for it now. By limit I don't mean the precision limit, I mean the top limit in which float converts to "inf".
Seems like this bug is due to the change of python 3's division operator.
Under numbers it states:
"It's important that this conversion use the integer's "true"
division rather than casting one side to float before dividing
so that ratios of huge integers convert without overflowing."

It is meant to not to overflow but the "/" operator is now "//".
Lemme patch it up and see if it fixes the problem. If it states the integer's "true" division, I believe this small fix will be sufficient as it has been tested before.
msg251085 - (view) Author: Bar Harel (bar.harel) * Date: 2015-09-19 15:25
Yup, it indeed fixes the problem. Sorry for thinking it's intended.
Seems like it's a small problem but it does affects all the uses of Integral  or Rational.
I'll test it with the suite hoping it has a sufficient coverage.
msg251088 - (view) Author: Bar Harel (bar.harel) * Date: 2015-09-19 16:11
Alright,
Seems like the problem is bigger than I thought. PEP 238 (https://www.python.org/dev/peps/pep-0238/) mentions that problem.
Using the new division, // will cause a floor while / will cause an OverflowError.
An ugly way around it involves type checking or reverting back to the classic division.
msg251103 - (view) Author: Bar Harel (bar.harel) * Date: 2015-09-19 18:58
Seems like this is the only viable option. It fixes the OverflowError but comes at the cost of precision and time.
Another option would be to try/except the overflow error and only then return the slower method.
Regarding the data [8.988465674311579e+307, 8.98846567431158e+307], float has lesser precision than other types like Decimal so the patched method would return 8.98846567431158e+307.
A dataset to test the fix is [8.99e+307, 8.989e+307] which gives a correct result.
msg251108 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-09-19 19:29
That patch doesn't really help, I'm afraid, since it introduces problems at the other end of the floating-point range: for example, `mean([5e-324, 5e-324])` would give `0.0` with that patch (instead of the `5e-324` it currently gives).

So currently, when computing the mean of a sequence of floats (possibly mixed with ints), the code:

1. Computes the sum as a Fraction.
2. Converts that Fraction to a float.
3. Divides the result by the number of elements.

Reversing steps 2 and 3 here would solve the issue, but would require some refactoring.  It's really up to Steven whether he thinks that that refactoring is worth it for these corner cases at the extremes of the floating-point range.  (It's difficult to imagine that such numbers would turn up frequently in practical applications.)
msg251331 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-09-22 16:58
Bar, thanks for the time you put into diagnosing this error, it is 
definitely a bug. The intention is for mean([huge, huge]) to return 
huge, not raise OverflowError.

I'm reluctant to say that mean() will *never* raise OverflowError, but 
it certainly shouldn't do so for this case if it can be avoided. I think 
Mark's diagnosis is probably correct that refactoring the code will fix 
this.
msg251697 - (view) Author: Bar Harel (bar.harel) * Date: 2015-09-27 10:38
Alright, I issued a fix, now testing it
msg251702 - (view) Author: Bar Harel (bar.harel) * Date: 2015-09-27 14:03
Alright, this patch passed all tests.
I've changed the function sum to include a new parameter called fraction which decides whether to return a fraction or not. It might be useful in more scenarios due to the fact fractions are more accurate.
I'm still unsure why not to support mixed types, as we can just again, return a mixed type as fraction but I left it like this in order not to cause unexpected behavior. I still think however that supporting mixed types should be fine as the bottleneck regarding precision is fraction, which is used as the base for conversion anyway.
I've added a few doctests on the way including the float max and min, which did not previously pass on the current implementation, and changed the unittest to match.
msg252726 - (view) Author: Bar Harel (bar.harel) * Date: 2015-10-10 16:28
Any comments on the patch?
msg252748 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-10-10 22:05
On Sat, Oct 10, 2015 at 04:28:22PM +0000, Bar Harel wrote:
> Any comments on the patch?

Not yet, I've been unable to look at it, but thank you. If I haven't 
responded by the 17th of this month, please ping me again.
msg253201 - (view) Author: Josh Rosenberg (josh.r) * Date: 2015-10-20 02:18
Do you have any benchmarks on the before and after? I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case, particularly for longer input sequences, but that's a hunch only.
msg253211 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-10-20 06:44
> I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case

The existing code already converts each of the input items to Fraction; the only difference is that the old code converts the sum of those Fractions to float (or whatever the target type is) *before* dividing by the count, while the new code performs the sum/count division in Fraction-land, and only *then* converts to float.  That is, it's the difference between "float(exact_sum) / count" and "float(exact_sum / count)".

IOW, the performance is already dead.  Or rather, it's just resting: IIUC, the module design prioritises correctness over speed.  I'm sure Steven would be open to suggestions for faster algorithms that maintain the current accuracy.
msg254154 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-11-06 01:06
Has anyone confirmed that this bug actually exists? I'm afraid that I cannot verify it. I get these results on three different computers:

py> x = 8.988465674311579e+307
py> statistics.mean([x, x])
8.988465674311579e+307
py> statistics.mean([x, x]) == x
True

running Python 3.4.3, a backport on 3.3.0rc3, and the default branch in the repo (3.6.0a).
msg254157 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-11-06 01:16
> Has anyone confirmed that this bug actually exists? 

Confirmed. The initial report is not quite correct: you need three 
values to trigger the overflow, not two:

py> x = 8.988465674311579e+307
py> statistics.mean([x]*2) == x
True
py> statistics.mean([x]*3) == x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./statistics.py", line 289, in mean
    return _sum(data)/n
  File "./statistics.py", line 184, in _sum
    return T(total)
  File "/usr/local/lib/python3.3/numbers.py", line 296, in __float__
    return self.numerator / self.denominator
OverflowError: integer division result too large for a float
msg254208 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-11-06 19:38
I can reproduce here (OS X 10.9, Python 3.5), exactly as described in the original post.

Python 3.5.0 (default, Sep 22 2015, 18:26:54) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import statistics
>>> statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 293, in mean
    return _sum(data)/n
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 184, in _sum
    return T(total)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/numbers.py", line 291, in __float__
    return self.numerator / self.denominator
OverflowError: integer division result too large for a float
msg254209 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-11-06 19:40
Note that the two input values given in the original report are not the same:

[8.988465674311579e+307, 8.98846567431158e+307] != [8.988465674311579e+307] * 2.
msg254312 - (view) Author: Bar Harel (bar.harel) * Date: 2015-11-07 22:48
Anyway, yes, it should be quite the same. I can provide some benchmarks tomorrow if you wish.
msg255651 - (view) Author: Roundup Robot (python-dev) Date: 2015-12-01 14:11
New changeset 4bc9405c4f7b by Steven D'Aprano in branch '3.4':
Fix for issue #25177 with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/4bc9405c4f7b

New changeset ed45a09e5a69 by Steven D'Aprano in branch '3.5':
Fixed issue #25177, problems with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/ed45a09e5a69

New changeset 0eeb39fc8ff5 by Steven D'Aprano in branch 'default':
Issue #25177: Fixed problem with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/0eeb39fc8ff5
msg255687 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-12-02 01:18
Larry,

Is it too late to get this into 3.5rc1?

changeset 99407:ed45a09e5a69

Thanks.
msg255714 - (view) Author: Roundup Robot (python-dev) Date: 2015-12-02 13:36
New changeset a7d2307055e7 by Victor Stinner in branch 'default':
Null merge 3.5, patch was already applied to default (isuse #25177)
https://hg.python.org/cpython/rev/a7d2307055e7
History
Date User Action Args
2016-05-05 19:23:25berker.peksagsetstatus: open -> closed
type: behavior
resolution: fixed
stage: resolved
2015-12-02 13:36:32python-devsetmessages: + msg255714
2015-12-02 01:18:33steven.dapranosetnosy: + larry
messages: + msg255687
2015-12-01 14:11:18python-devsetnosy: + python-dev
messages: + msg255651
2015-11-07 22:48:01bar.harelsetmessages: + msg254312
2015-11-06 19:40:54mark.dickinsonsetmessages: + msg254209
2015-11-06 19:38:27mark.dickinsonsetmessages: + msg254208
2015-11-06 01:16:07steven.dapranosetmessages: + msg254157
2015-11-06 01:06:43steven.dapranosetmessages: + msg254154
2015-10-20 06:44:40mark.dickinsonsetmessages: + msg253211
2015-10-20 02:18:41josh.rsetnosy: + josh.r
messages: + msg253201
2015-10-20 02:07:47steven.dapranosetassignee: steven.daprano
2015-10-10 22:05:02steven.dapranosetmessages: + msg252748
2015-10-10 16:28:22bar.harelsetmessages: + msg252726
2015-09-27 14:03:55bar.harelsetfiles: + Issue25177_rev2.patch

messages: + msg251702
2015-09-27 10:38:41bar.harelsetmessages: + msg251697
2015-09-22 16:58:44steven.dapranosetmessages: + msg251331
2015-09-19 19:29:15mark.dickinsonsetmessages: + msg251108
2015-09-19 18:58:01bar.harelsetfiles: + issue25177.patch
keywords: + patch
messages: + msg251103
2015-09-19 16:11:48bar.harelsetnosy: + mark.dickinson, benjamin.peterson
messages: + msg251088
2015-09-19 15:25:32bar.harelsetmessages: + msg251085
2015-09-19 15:08:45bar.harelsetmessages: + msg251084
2015-09-19 13:58:54David MacIversetmessages: + msg251081
2015-09-19 13:23:21bar.harelsetnosy: + bar.harel
messages: + msg251080
2015-09-19 12:09:31berker.peksagsetnosy: + steven.daprano
2015-09-19 11:40:40David MacIvercreate