Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError in statistics.mean when summing large floats #69364

Closed
DavidMacIver mannequin opened this issue Sep 19, 2015 · 24 comments
Closed

OverflowError in statistics.mean when summing large floats #69364

DavidMacIver mannequin opened this issue Sep 19, 2015 · 24 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@DavidMacIver
Copy link
Mannequin

DavidMacIver mannequin commented Sep 19, 2015

BPO 25177
Nosy @mdickinson, @larryhastings, @benjaminp, @stevendaprano, @wm75, @MojoVampire, @bharel
Files
  • issue25177.patch
  • Issue25177_rev2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/stevendaprano'
    closed_at = <Date 2016-05-05.19:23:25.858>
    created_at = <Date 2015-09-19.11:40:40.207>
    labels = ['type-bug', 'library']
    title = 'OverflowError in statistics.mean when summing large floats'
    updated_at = <Date 2018-04-08.20:02:24.810>
    user = 'https://bugs.python.org/DavidMacIver'

    bugs.python.org fields:

    activity = <Date 2018-04-08.20:02:24.810>
    actor = 'wolma'
    assignee = 'steven.daprano'
    closed = True
    closed_date = <Date 2016-05-05.19:23:25.858>
    closer = 'berker.peksag'
    components = ['Library (Lib)']
    creation = <Date 2015-09-19.11:40:40.207>
    creator = 'David MacIver'
    dependencies = []
    files = ['40519', '40597']
    hgrepos = []
    issue_num = 25177
    keywords = ['patch']
    message_count = 24.0
    messages = ['251076', '251080', '251081', '251084', '251085', '251088', '251103', '251108', '251331', '251697', '251702', '252726', '252748', '253201', '253211', '254154', '254157', '254208', '254209', '254312', '255651', '255687', '255714', '315094']
    nosy_count = 9.0
    nosy_names = ['mark.dickinson', 'larry', 'benjamin.peterson', 'steven.daprano', 'python-dev', 'wolma', 'josh.r', 'David MacIver', 'bar.harel']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25177'
    versions = ['Python 3.4', 'Python 3.5']

    @DavidMacIver
    Copy link
    Mannequin Author

    DavidMacIver mannequin commented Sep 19, 2015

    The following code produces an OverflowError:

    import statistics
    statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])

    The error is:

    File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 293, in mean
    return _sum(data)/n
    File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/statistics.py", line 184, in _sum
    return T(total)
    File "/home/david/.pyenv/versions/3.5.0/lib/python3.5/numbers.py", line 291, in __float__
    return self.numerator / self.denominator
    OverflowError: integer division result too large for a float

    If this is intended behaviour then it is not documented: https://docs.python.org/3/library/statistics.html#statistics.mean only specifies that it may raise StatisticsError.

    @DavidMacIver DavidMacIver mannequin added the stdlib Python modules in the Lib dir label Sep 19, 2015
    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Sep 19, 2015

    I believe it's an intended behavior as python's float has a limit after all. It's hard to reach it but definitely possible.
    A workaround is to internally use Decimal (also take the advantage that it's implementation is now way faster) but it will cause a precision loss, and somewhat unexpected behavior.
    Last thing we can do is catch that OverflowError and raise StatisticsError from exc.
    I believe catching and re-raising as Statistics or just writing that it may also cause an Overflow error are the most viable ways.
    Either way, I'd like another opinion before implementing either method as a patch.

    @DavidMacIver
    Copy link
    Mannequin Author

    DavidMacIver mannequin commented Sep 19, 2015

    I'm not sure what you mean by float having a limit here. It's certainly finite precision, but there is still a representable value with that finite precision closest to the mean.

    As an example where there is an obvious correct answer that will trigger this error: statistics.mean([sys.float_info.max, sys.float_info.max]), this should return sys.float_info.max (which is definitely representable!), but instead raises this overflow error.

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Sep 19, 2015

    Whoop! I see the reason for it now. By limit I don't mean the precision limit, I mean the top limit in which float converts to "inf".
    Seems like this bug is due to the change of python 3's division operator.
    Under numbers it states:
    "It's important that this conversion use the integer's "true"
    division rather than casting one side to float before dividing
    so that ratios of huge integers convert without overflowing."

    It is meant to not to overflow but the "/" operator is now "//".
    Lemme patch it up and see if it fixes the problem. If it states the integer's "true" division, I believe this small fix will be sufficient as it has been tested before.

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Sep 19, 2015

    Yup, it indeed fixes the problem. Sorry for thinking it's intended.
    Seems like it's a small problem but it does affects all the uses of Integral or Rational.
    I'll test it with the suite hoping it has a sufficient coverage.

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Sep 19, 2015

    Alright,
    Seems like the problem is bigger than I thought. PEP-238 (https://www.python.org/dev/peps/pep-0238/) mentions that problem.
    Using the new division, // will cause a floor while / will cause an OverflowError.
    An ugly way around it involves type checking or reverting back to the classic division.

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Sep 19, 2015

    Seems like this is the only viable option. It fixes the OverflowError but comes at the cost of precision and time.
    Another option would be to try/except the overflow error and only then return the slower method.
    Regarding the data [8.988465674311579e+307, 8.98846567431158e+307], float has lesser precision than other types like Decimal so the patched method would return 8.98846567431158e+307.
    A dataset to test the fix is [8.99e+307, 8.989e+307] which gives a correct result.

    @mdickinson
    Copy link
    Member

    That patch doesn't really help, I'm afraid, since it introduces problems at the other end of the floating-point range: for example, mean([5e-324, 5e-324]) would give 0.0 with that patch (instead of the 5e-324 it currently gives).

    So currently, when computing the mean of a sequence of floats (possibly mixed with ints), the code:

    1. Computes the sum as a Fraction.
    2. Converts that Fraction to a float.
    3. Divides the result by the number of elements.

    Reversing steps 2 and 3 here would solve the issue, but would require some refactoring. It's really up to Steven whether he thinks that that refactoring is worth it for these corner cases at the extremes of the floating-point range. (It's difficult to imagine that such numbers would turn up frequently in practical applications.)

    @stevendaprano
    Copy link
    Member

    Bar, thanks for the time you put into diagnosing this error, it is
    definitely a bug. The intention is for mean([huge, huge]) to return
    huge, not raise OverflowError.

    I'm reluctant to say that mean() will *never* raise OverflowError, but
    it certainly shouldn't do so for this case if it can be avoided. I think
    Mark's diagnosis is probably correct that refactoring the code will fix
    this.

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Sep 27, 2015

    Alright, I issued a fix, now testing it

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Sep 27, 2015

    Alright, this patch passed all tests.
    I've changed the function sum to include a new parameter called fraction which decides whether to return a fraction or not. It might be useful in more scenarios due to the fact fractions are more accurate.
    I'm still unsure why not to support mixed types, as we can just again, return a mixed type as fraction but I left it like this in order not to cause unexpected behavior. I still think however that supporting mixed types should be fine as the bottleneck regarding precision is fraction, which is used as the base for conversion anyway.
    I've added a few doctests on the way including the float max and min, which did not previously pass on the current implementation, and changed the unittest to match.

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Oct 10, 2015

    Any comments on the patch?

    @stevendaprano
    Copy link
    Member

    On Sat, Oct 10, 2015 at 04:28:22PM +0000, Bar Harel wrote:

    Any comments on the patch?

    Not yet, I've been unable to look at it, but thank you. If I haven't
    responded by the 17th of this month, please ping me again.

    @stevendaprano stevendaprano self-assigned this Oct 20, 2015
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Oct 20, 2015

    Do you have any benchmarks on the before and after? I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case, particularly for longer input sequences, but that's a hunch only.

    @mdickinson
    Copy link
    Member

    I strongly suspect that moving from float to Fraction-based ratios is going to kill performance in the common case

    The existing code already converts each of the input items to Fraction; the only difference is that the old code converts the sum of those Fractions to float (or whatever the target type is) *before* dividing by the count, while the new code performs the sum/count division in Fraction-land, and only *then* converts to float. That is, it's the difference between "float(exact_sum) / count" and "float(exact_sum / count)".

    IOW, the performance is already dead. Or rather, it's just resting: IIUC, the module design prioritises correctness over speed. I'm sure Steven would be open to suggestions for faster algorithms that maintain the current accuracy.

    @stevendaprano
    Copy link
    Member

    Has anyone confirmed that this bug actually exists? I'm afraid that I cannot verify it. I get these results on three different computers:

    py> x = 8.988465674311579e+307
    py> statistics.mean([x, x])
    8.988465674311579e+307
    py> statistics.mean([x, x]) == x
    True

    running Python 3.4.3, a backport on 3.3.0rc3, and the default branch in the repo (3.6.0a).

    @stevendaprano
    Copy link
    Member

    Has anyone confirmed that this bug actually exists?

    Confirmed. The initial report is not quite correct: you need three
    values to trigger the overflow, not two:

    py> x = 8.988465674311579e+307
    py> statistics.mean([x]*2) == x
    True
    py> statistics.mean([x]*3) == x
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "./statistics.py", line 289, in mean
        return _sum(data)/n
      File "./statistics.py", line 184, in _sum
        return T(total)
      File "/usr/local/lib/python3.3/numbers.py", line 296, in __float__
        return self.numerator / self.denominator
    OverflowError: integer division result too large for a float

    @mdickinson
    Copy link
    Member

    I can reproduce here (OS X 10.9, Python 3.5), exactly as described in the original post.

    Python 3.5.0 (default, Sep 22 2015, 18:26:54) 
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import statistics
    >>> statistics.mean([8.988465674311579e+307, 8.98846567431158e+307])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 293, in mean
        return _sum(data)/n
      File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/statistics.py", line 184, in _sum
        return T(total)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/numbers.py", line 291, in __float__
        return self.numerator / self.denominator
    OverflowError: integer division result too large for a float

    @mdickinson
    Copy link
    Member

    Note that the two input values given in the original report are not the same:

    [8.988465674311579e+307, 8.98846567431158e+307] != [8.988465674311579e+307] * 2.

    @bharel
    Copy link
    Mannequin

    bharel mannequin commented Nov 7, 2015

    Anyway, yes, it should be quite the same. I can provide some benchmarks tomorrow if you wish.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 1, 2015

    New changeset 4bc9405c4f7b by Steven D'Aprano in branch '3.4':
    Fix for issue bpo-25177 with the mean of very small and very large numbers.
    https://hg.python.org/cpython/rev/4bc9405c4f7b

    New changeset ed45a09e5a69 by Steven D'Aprano in branch '3.5':
    Fixed issue bpo-25177, problems with the mean of very small and very large numbers.
    https://hg.python.org/cpython/rev/ed45a09e5a69

    New changeset 0eeb39fc8ff5 by Steven D'Aprano in branch 'default':
    Issue bpo-25177: Fixed problem with the mean of very small and very large numbers.
    https://hg.python.org/cpython/rev/0eeb39fc8ff5

    @stevendaprano
    Copy link
    Member

    Larry,

    Is it too late to get this into 3.5rc1?

    changeset 99407:ed45a09e5a69

    Thanks.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 2, 2015

    New changeset a7d2307055e7 by Victor Stinner in branch 'default':
    Null merge 3.5, patch was already applied to default (isuse bpo-25177)
    https://hg.python.org/cpython/rev/a7d2307055e7

    @berkerpeksag berkerpeksag added the type-bug An unexpected behavior, bug, or error label May 5, 2016
    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Apr 8, 2018

    Steven's commit here also fixed bpo-24068.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants