msg251076 - (view) |
Author: David MacIver (David MacIver) * |
Date: 2015-09-19 11:40 |
The following code produces an OverflowError:
import statistics
statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])
The error is:
File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 293, in mean
return _sum(data)/n
File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 184, in _sum
return T(total)
File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/numbers.py", line 291, in __float__
return self.numerator / self.denominator
OverflowError: integer division result too large for a float
If this is intended behaviour then it is not documented: https://docs.python.org/3/library/statistics.html#statistics.mean only specifies that it may raise StatisticsError.
|
msg251080 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-09-19 13:23 |
I believe it's an intended behavior as python's float has a limit after all. It's hard to reach it but definitely possible.
A workaround is to internally use Decimal (also take the advantage that it's implementation is now way faster) but it will cause a precision loss, and somewhat unexpected behavior.
Last thing we can do is catch that OverflowError and raise StatisticsError from exc.
I believe catching and re-raising as Statistics or just writing that it may also cause an Overflow error are the most viable ways.
Either way, I'd like another opinion before implementing either method as a patch.
|
msg251081 - (view) |
Author: David MacIver (David MacIver) * |
Date: 2015-09-19 13:58 |
I'm not sure what you mean by float having a limit here. It's certainly finite precision, but there is still a representable value with that finite precision closest to the mean.
As an example where there is an obvious correct answer that will trigger this error: statistics.mean([sys.float_info.max, sys.float_info.max]), this should return sys.float_info.max (which is definitely representable!), but instead raises this overflow error.
|
msg251084 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-09-19 15:08 |
Whoop! I see the reason for it now. By limit I don't mean the precision limit, I mean the top limit in which float converts to "inf".
Seems like this bug is due to the change of python 3's division operator.
Under numbers it states:
"It's important that this conversion use the integer's "true"
division rather than casting one side to float before dividing
so that ratios of huge integers convert without overflowing."
It is meant to not to overflow but the "/" operator is now "//".
Lemme patch it up and see if it fixes the problem. If it states the integer's "true" division, I believe this small fix will be sufficient as it has been tested before.
|
msg251085 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-09-19 15:25 |
Yup, it indeed fixes the problem. Sorry for thinking it's intended.
Seems like it's a small problem but it does affects all the uses of Integral or Rational.
I'll test it with the suite hoping it has a sufficient coverage.
|
msg251088 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-09-19 16:11 |
Alright,
Seems like the problem is bigger than I thought. PEP 238 (https://www.python.org/dev/peps/pep-0238/) mentions that problem.
Using the new division, // will cause a floor while / will cause an OverflowError.
An ugly way around it involves type checking or reverting back to the classic division.
|
msg251103 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-09-19 18:58 |
Seems like this is the only viable option. It fixes the OverflowError but comes at the cost of precision and time.
Another option would be to try/except the overflow error and only then return the slower method.
Regarding the data [8.988465674311579e+307, 8.98846567431158e+307], float has lesser precision than other types like Decimal so the patched method would return 8.98846567431158e+307.
A dataset to test the fix is [8.99e+307, 8.989e+307] which gives a correct result.
|
msg251108 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2015-09-19 19:29 |
That patch doesn't really help, I'm afraid, since it introduces problems at the other end of the floating-point range: for example, `mean([5e-324, 5e-324])` would give `0.0` with that patch (instead of the `5e-324` it currently gives).
So currently, when computing the mean of a sequence of floats (possibly mixed with ints), the code:
1. Computes the sum as a Fraction.
2. Converts that Fraction to a float.
3. Divides the result by the number of elements.
Reversing steps 2 and 3 here would solve the issue, but would require some refactoring. It's really up to Steven whether he thinks that that refactoring is worth it for these corner cases at the extremes of the floating-point range. (It's difficult to imagine that such numbers would turn up frequently in practical applications.)
|
msg251331 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2015-09-22 16:58 |
Bar, thanks for the time you put into diagnosing this error, it is
definitely a bug. The intention is for mean([huge, huge]) to return
huge, not raise OverflowError.
I'm reluctant to say that mean() will *never* raise OverflowError, but
it certainly shouldn't do so for this case if it can be avoided. I think
Mark's diagnosis is probably correct that refactoring the code will fix
this.
|
msg251697 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-09-27 10:38 |
Alright, I issued a fix, now testing it
|
msg251702 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-09-27 14:03 |
Alright, this patch passed all tests.
I've changed the function sum to include a new parameter called fraction which decides whether to return a fraction or not. It might be useful in more scenarios due to the fact fractions are more accurate.
I'm still unsure why not to support mixed types, as we can just again, return a mixed type as fraction but I left it like this in order not to cause unexpected behavior. I still think however that supporting mixed types should be fine as the bottleneck regarding precision is fraction, which is used as the base for conversion anyway.
I've added a few doctests on the way including the float max and min, which did not previously pass on the current implementation, and changed the unittest to match.
|
msg252726 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-10-10 16:28 |
Any comments on the patch?
|
msg252748 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2015-10-10 22:05 |
On Sat, Oct 10, 2015 at 04:28:22PM +0000, Bar Harel wrote:
> Any comments on the patch?
Not yet, I've been unable to look at it, but thank you. If I haven't
responded by the 17th of this month, please ping me again.
|
msg253201 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2015-10-20 02:18 |
Do you have any benchmarks on the before and after? I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case, particularly for longer input sequences, but that's a hunch only.
|
msg253211 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2015-10-20 06:44 |
> I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case
The existing code already converts each of the input items to Fraction; the only difference is that the old code converts the sum of those Fractions to float (or whatever the target type is) *before* dividing by the count, while the new code performs the sum/count division in Fraction-land, and only *then* converts to float. That is, it's the difference between "float(exact_sum) / count" and "float(exact_sum / count)".
IOW, the performance is already dead. Or rather, it's just resting: IIUC, the module design prioritises correctness over speed. I'm sure Steven would be open to suggestions for faster algorithms that maintain the current accuracy.
|
msg254154 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2015-11-06 01:06 |
Has anyone confirmed that this bug actually exists? I'm afraid that I cannot verify it. I get these results on three different computers:
py> x = 8.988465674311579e+307
py> statistics.mean([x, x])
8.988465674311579e+307
py> statistics.mean([x, x]) == x
True
running Python 3.4.3, a backport on 3.3.0rc3, and the default branch in the repo (3.6.0a).
|
msg254157 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2015-11-06 01:16 |
> Has anyone confirmed that this bug actually exists?
Confirmed. The initial report is not quite correct: you need three
values to trigger the overflow, not two:
py> x = 8.988465674311579e+307
py> statistics.mean([x]*2) == x
True
py> statistics.mean([x]*3) == x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "./statistics.py", line 289, in mean
return _sum(data)/n
File "./statistics.py", line 184, in _sum
return T(total)
File "/usr/local/lib/python3.3/numbers.py", line 296, in __float__
return self.numerator / self.denominator
OverflowError: integer division result too large for a float
|
msg254208 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2015-11-06 19:38 |
I can reproduce here (OS X 10.9, Python 3.5), exactly as described in the original post.
Python 3.5.0 (default, Sep 22 2015, 18:26:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import statistics
>>> statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 293, in mean
return _sum(data)/n
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 184, in _sum
return T(total)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/numbers.py", line 291, in __float__
return self.numerator / self.denominator
OverflowError: integer division result too large for a float
|
msg254209 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2015-11-06 19:40 |
Note that the two input values given in the original report are not the same:
[8.988465674311579e+307, 8.98846567431158e+307] != [8.988465674311579e+307] * 2.
|
msg254312 - (view) |
Author: Bar Harel (bar.harel) * |
Date: 2015-11-07 22:48 |
Anyway, yes, it should be quite the same. I can provide some benchmarks tomorrow if you wish.
|
msg255651 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2015-12-01 14:11 |
New changeset 4bc9405c4f7b by Steven D'Aprano in branch '3.4':
Fix for issue #25177 with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/4bc9405c4f7b
New changeset ed45a09e5a69 by Steven D'Aprano in branch '3.5':
Fixed issue #25177, problems with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/ed45a09e5a69
New changeset 0eeb39fc8ff5 by Steven D'Aprano in branch 'default':
Issue #25177: Fixed problem with the mean of very small and very large numbers.
https://hg.python.org/cpython/rev/0eeb39fc8ff5
|
msg255687 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2015-12-02 01:18 |
Larry,
Is it too late to get this into 3.5rc1?
changeset 99407:ed45a09e5a69
Thanks.
|
msg255714 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2015-12-02 13:36 |
New changeset a7d2307055e7 by Victor Stinner in branch 'default':
Null merge 3.5, patch was already applied to default (isuse #25177)
https://hg.python.org/cpython/rev/a7d2307055e7
|
msg315094 - (view) |
Author: Wolfgang Maier (wolma) * |
Date: 2018-04-08 20:02 |
Steven's commit here also fixed issue 24068.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:21 | admin | set | github: 69364 |
2018-04-08 20:02:24 | wolma | set | nosy:
+ wolma messages:
+ msg315094
|
2016-05-05 19:23:25 | berker.peksag | set | status: open -> closed type: behavior resolution: fixed stage: resolved |
2015-12-02 13:36:32 | python-dev | set | messages:
+ msg255714 |
2015-12-02 01:18:33 | steven.daprano | set | nosy:
+ larry messages:
+ msg255687
|
2015-12-01 14:11:18 | python-dev | set | nosy:
+ python-dev messages:
+ msg255651
|
2015-11-07 22:48:01 | bar.harel | set | messages:
+ msg254312 |
2015-11-06 19:40:54 | mark.dickinson | set | messages:
+ msg254209 |
2015-11-06 19:38:27 | mark.dickinson | set | messages:
+ msg254208 |
2015-11-06 01:16:07 | steven.daprano | set | messages:
+ msg254157 |
2015-11-06 01:06:43 | steven.daprano | set | messages:
+ msg254154 |
2015-10-20 06:44:40 | mark.dickinson | set | messages:
+ msg253211 |
2015-10-20 02:18:41 | josh.r | set | nosy:
+ josh.r messages:
+ msg253201
|
2015-10-20 02:07:47 | steven.daprano | set | assignee: steven.daprano |
2015-10-10 22:05:02 | steven.daprano | set | messages:
+ msg252748 |
2015-10-10 16:28:22 | bar.harel | set | messages:
+ msg252726 |
2015-09-27 14:03:55 | bar.harel | set | files:
+ Issue25177_rev2.patch
messages:
+ msg251702 |
2015-09-27 10:38:41 | bar.harel | set | messages:
+ msg251697 |
2015-09-22 16:58:44 | steven.daprano | set | messages:
+ msg251331 |
2015-09-19 19:29:15 | mark.dickinson | set | messages:
+ msg251108 |
2015-09-19 18:58:01 | bar.harel | set | files:
+ issue25177.patch keywords:
+ patch messages:
+ msg251103
|
2015-09-19 16:11:48 | bar.harel | set | nosy:
+ mark.dickinson, benjamin.peterson messages:
+ msg251088
|
2015-09-19 15:25:32 | bar.harel | set | messages:
+ msg251085 |
2015-09-19 15:08:45 | bar.harel | set | messages:
+ msg251084 |
2015-09-19 13:58:54 | David MacIver | set | messages:
+ msg251081 |
2015-09-19 13:23:21 | bar.harel | set | nosy:
+ bar.harel messages:
+ msg251080
|
2015-09-19 12:09:31 | berker.peksag | set | nosy:
+ steven.daprano
|
2015-09-19 11:40:40 | David MacIver | create | |