Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding errors with statistics.variance #64698

Closed
oscarbenjamin mannequin opened this issue Feb 3, 2014 · 16 comments
Closed

Rounding errors with statistics.variance #64698

oscarbenjamin mannequin opened this issue Feb 3, 2014 · 16 comments
Assignees
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@oscarbenjamin
Copy link
Mannequin

oscarbenjamin mannequin commented Feb 3, 2014

BPO 20499
Nosy @rhettinger, @stevendaprano, @wm75, @iritkatriel
PRs
  • bpo-20499: Rounding error in statistics.pvariance #28230
  • [3.10] bpo-20499: Rounding error in statistics.pvariance (GH-28230) #28248
  • Files
  • statistics_with_exact_ratios.py: full re-implementation of the statistics module
  • statistics_with_exact_ratios.patch
  • statistics_with_exact_ratios_latest.py: v2 of the re-implementation
  • decimalsum.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2021-09-09.03:06:34.176>
    created_at = <Date 2014-02-03.12:46:19.312>
    labels = ['type-feature', 'library', '3.9', '3.10', '3.11']
    title = 'Rounding errors with statistics.variance'
    updated_at = <Date 2021-09-09.12:02:18.658>
    user = 'https://bugs.python.org/oscarbenjamin'

    bugs.python.org fields:

    activity = <Date 2021-09-09.12:02:18.658>
    actor = 'steven.daprano'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2021-09-09.03:06:34.176>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2014-02-03.12:46:19.312>
    creator = 'oscarbenjamin'
    dependencies = []
    files = ['33955', '33958', '33959', '33960']
    hgrepos = []
    issue_num = 20499
    keywords = ['patch']
    message_count = 15.0
    messages = ['210121', '210452', '210456', '210459', '210479', '210481', '210491', '210495', '210700', '210702', '399940', '401432', '401433', '401464', '401465']
    nosy_count = 5.0
    nosy_names = ['rhettinger', 'steven.daprano', 'oscarbenjamin', 'wolma', 'iritkatriel']
    pr_nums = ['28230', '28248']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue20499'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @oscarbenjamin
    Copy link
    Mannequin Author

    oscarbenjamin mannequin commented Feb 3, 2014

    The mean/variance functions in the statistics module don't quite round correctly.

    The reasons for this are that although exact rational arithmetic is used internally in the _sum function it is not used throughout the module. In particular the _sum function should be changed to return an exact result and exact arithmetic should be used right up to the point before returning to the user at which point a rounding/coercion should be used to give the user their answer in the appropriate type correctly rounded once.

    Using exact arithmetic everywhere makes it possible to replace all of the variance* functions with single-pass algorithms based on the computational formula for variance which should be more efficient as well.

    For example the following implements pvariance so that it returns a perfectly rounded result for float input and output in a single pass:

    def pvariance(data):
        sx = 0
        sx2 = 0
        for n, x in enumerate(map(Fraction, data), 1):
            sx += x
            sx2 += x ** 2
        Ex = sx / n
        Ex2 = sx2 / n
        var = Ex2 - Ex ** 2
        return float(var)

    Comparing the above with the statistics module:

    >>> pvariance([0, 0, 1])
    0.2222222222222222
    >>> statistics.pvariance([0, 0, 1])
    0.22222222222222224

    The true answer is:

    >>> from fractions import Fraction as F
    >>> float(statistics.pvariance([F(0), F(0), F(1)]))
    0.2222222222222222

    The logic in the _sum function for computing exact integer ratios and coercing back to the output type could be moved into utility functions so that it does not need to be duplicated.

    Some examples of rounding issues:

    >> from statistics import variance, mean
    >> from decimal import Decimal as D, getcontext
    >> from fractions import Fraction as F

    Variance with ints or floats returns a float but the float is not quite the nearest possible float:

    >>> variance([0, 0, 2])
    1.3333333333333335
    >>> float(variance([F(0), F(0), F(2)]))  # true result rounded once
    1.3333333333333333

    Another example with Decimal:

    >>> getcontext().prec = 5
    >>> getcontext()
    Context(prec=5, rounding=ROUND_HALF_EVEN, Emin=-999999999, Emax=999999999, capitals=1, clamp=0, flags=[Rounded, Inexact], traps=[DivisionByZero, Overflow, InvalidOperation])
    
    >>> variance([D(0), D(0), D(2)] * 2)  # Rounded down instead of up
    Decimal('1.0666')
    >>> r = (variance([F(0), F(0), F(2)] * 2))
    >>> D(r.numerator) / r.denominator  # Correctly rounded
    Decimal('1.0667')

    The mean function may also not be correctly rounded:

    >>> getcontext().prec = 2
    >>> r = mean((F('1.2'), F('1.3'), F('1.55')))
    >>> r
    Fraction(27, 20)
    >>> D(r.numerator) / r.denominator # Correctly rounded
    Decimal('1.4')
    >>> mean([D('1.2'), D('1.3'), D('1.55')])
    Decimal('1.3')

    @oscarbenjamin oscarbenjamin mannequin added the stdlib Python modules in the Lib dir label Feb 3, 2014
    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 7, 2014

    I have written a patch for this issue (I'm uploading the complete new code for everyone to try it - importing it into Python3.3 works fine; a diff with additional tests against Oscar's example will follow soon).
    Just as Oscar suggested, this new version performs all calculations using exact rational arithmetics and rounds/coerces only before returning the final result to the user. Its precision is, thus, only limited by that of the input data sequence.
    It passes Oscar's examples 1-3 as you can easily test yourself. It also gives the correct answer in the fourth example - mean([D('1.2'), D('1.3'), D('1.55')]) -, although on my system the original statistics module gets this one right already.

    The implementation I chose for this is a bit different from Oscar's suggestion. Essentially, it introduces a dedicated module-private class _ExactRatio to represent numbers as exact ratios and that gets passed between different functions in the module. This class borrows many of its algorithms from fractions.Fraction, but has some specialized methods for central tasks in the statistics module making it much more efficient in this context than fractions.Fraction. This class is currently really minimal, but can easily be extended if necessary.
    In my implementation this new class is used throughout the module whenever calculations with or conversions to exact ratios have to be performed, which allowed me to preserve almost all of the original code and to factor out the changes to the class.

    As for performance, the gain imagined by Oscar is not always realized even though the variance functions are now using single passes over the data. Specifically, in the case of floats the overhead of having to convert everything to exact ratios first eats up all the savings.
    In the case of fractional input, there is a dramatic performance boost though. I compiled a small table comparing (kind of) average performance of the two versions with various input data types. Take this with a grain of salt because the differences can vary quite a bit depending on the exact data:

    data type performance gain(+)/loss(-) over original module / %
    --------- ----------------------------------------------------
    float - 10 %
    short Decimal + 10 %
    long Decimal - 25 %
    Fraction + 80 % (!!)
    MyFloat + 25

    With Decimal input the costs of conversion to exact ratios depends on the digits in the Decimals, so with short Decimals the savings from the single-pass algorithm are larger than the conversion costs, but this reverses for very long Decimals.
    MyFloat is a minimal class inheriting from float and overriding just its arithmetic methods to return MyFloat instances again.
    The performance gain with Fraction input comes from two changes, the single-pass algorithm and an optimization in _sum (with Fraction, more than with any other type, the dictionary built by _sum can grow quite large and the optimization is in the conversion of the dictionary elements to exact ratios). This is why the extent of this gain can sometimes be significantly higher than the 80% listed in the table.

    Try this, for example:

    from statistics import variance as v
    from statistics_with_exact_ratios import variance as v2
    from fractions import Fraction
    
    data = [Fraction(1,x) for x in range(1,2000)]
    
    print('calculating variance using original statistics module ...')
    print(float(v(data)))
    print('now using exact ratio calculations ...')
    print(float(v2(data)))

    I invite everybody to test my implementation, which is very unlikely to be free of bugs at this stage.

    @wm75 wm75 mannequin added the type-feature A feature request or enhancement label Feb 7, 2014
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 7, 2014

    We can add a fast Decimal.as_integer_ratio() in C.

    That said, why is the sum of Decimals not done in decimal arithmetic
    with a very high context precision? It would be exact and with usual exponents in the range [-384, 383] it should be very fast.

    >>> c.prec = 50                              
    >>> sum([D("1e50"), D(1), D("-1e50")] * 1000)
    Decimal('0E+1')
    >>> 
    >>> c.prec = 51
    >>> sum([D("1e50"), D(1), D("-1e50")] * 1000)
    Decimal('1000')

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 7, 2014

    We can add a fast Decimal.as_integer_ratio() in C.

    That would be a terrific thing to have !! Currently, Decimals perform really poorly with the statistics functions and this is entirely due to this bottleneck.

    With regard to calculating the sum of Decimals in decimal arithmetic, the problem is how you'd detect that the input is all Decimals (or contains enough Decimals to justify a switch in the algorithm).
    The current statistics module, accepts input data consisting of mixed types, e.g., you could have Decimal mixed with ints or floats. Much of the complexity of the module stems from this fact. You can also read about some of the complications under this issue: http://bugs.python.org/issue20481 .

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 7, 2014

    Ok, I finally managed to get the test suite right for this, so here is the patch diff for everything.
    I made a few final changes to the module itself (mostly to please the test suite), so I'll edited the complete code as well.

    @oscarbenjamin
    Copy link
    Mannequin Author

    oscarbenjamin mannequin commented Feb 7, 2014

    A fast Decimal.as_integer_ratio() would be useful in any case.

    If you're going to use decimals though then you can trap inexact and
    keep increasing the precision until it becomes exact. The problem is
    with rationals that cannot be expressed in a finite number of decimal
    digits - these need to be handled separately. I've attached
    decimalsum.py that shows how to compute an exact sum of any mix of
    int, float and Decimal, but not Fraction.

    When I looked at this before, having special cases for everything from
    int to float to Decimal to Fraction makes the code really complicated.
    The common cases are int and float. For these cases sum() and fsum()
    are much faster. However you need to also have code that checks
    everything in the iterable.

    One option is to do something like:

    import math
    import itertools
    from decimal import Decimal
    from decimalsum import decimalsum
    
    def _sum(numbers):
        subtotals = []
        for T, nums in itertools.groupby(numbers, type):
            if T is int:
                subtotals.append(sum(nums))
            elif T is float:
                subtotals.append(math.fsum(nums))
            elif T is Decimal:
                subtotals.append(decimalsum(nums))
            else:
                raise NotImplementedError
        return decimalsum(subtotals)

    The main problem here is that fsum rounds every time it returns
    meaning that this sum is order-dependent if there are a mix of floats
    and other types (See bpo-19086 where I asked for way to change that).

    Also having separate code blocks to manage all the different types
    internally in e.g. the less trivial variance calculations is tedious.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 7, 2014

    In principle, your approach using itertools.groupby, the existing _sum, your decimalsum, and my _ExactRatio class could be combined.

    You could call decimalsum for Decimal and subclasses, the current _sum for everything else.
    _sum would return an _ExactRatio instance now anyway, just modify decimalsum accordingly and you are done.

    The problem I see is that it would cause a slow down in many cases where no Decimals or just a few are involved (think of mixes of ints and floats as a realistic scenario, and consider also that you would have to do an isinstance check to catch subclasses of Decimal).

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 7, 2014

    Oscar Benjamin <report@bugs.python.org> wrote:

    If you're going to use decimals though then you can trap inexact and
    keep increasing the precision until it becomes exact.

    For sums that is not necessary. Create a context with MAX_EMAX, MIN_EMIN and
    MAX_PREC and mpd_add() -- the underlying libmpdec function -- will only use
    as many digits as neccessary.

    Of course, calculating 1/3 in MAX_PREC would be catastrophic.

    The problem is with rationals that cannot be expressed in a finite number
    of decimal > digits - these need to be handled separately. I've attached
    decimalsum.py that shows how to compute an exact sum of any mix of
    int, float and Decimal, but not Fraction.

    Yes, I was thinking of a "don't do that" approach. Do people have mixed
    Fractions and Decimals in their data?

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 8, 2014

    I worked out a slightly speedier version of decimal_to_ratio today (Stefan: that's when I duplicated your bug report):

    from decimal import Context
    
    def _decimal_to_ratio (x):
        _, digits, exp = x.as_tuple()
        if exp in _ExactRatio.decimal_infinite:  # INF, NAN, sNAN
            assert not d.is_finite()
            raise ValueError
        if exp < 0:
            exp = -exp
            return int(x.scaleb(exp, Context(prec=len(digits)))), 10**exp
        return int(x), 1

    makes the variance functions in my re-implementation about 20-30% faster for Decimal. It's not a big breakthrough, but still.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 8, 2014

    oops,

    if exp in _ExactRatio.decimal_infinite: # INF, NAN, sNAN

    should read

    if exp in ('F', 'n', 'N'):  # INF, NAN, sNAN

    of course. What I pasted comes from a micro-optimization I tried, but which doesn't give any benefit.

    @stevendaprano stevendaprano self-assigned this Oct 20, 2015
    @iritkatriel
    Copy link
    Member

    Reproduced on 3.11:

    >>> statistics.pvariance([0,0,1])
    0.22222222222222224
    
    >>> statistics.variance([0,0,2])
    1.3333333333333335

    @iritkatriel iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes labels Aug 20, 2021
    @rhettinger
    Copy link
    Contributor

    New changeset 4a5cccb by Raymond Hettinger in branch 'main':
    bpo-20499: Rounding error in statistics.pvariance (GH-28230)
    4a5cccb

    @rhettinger
    Copy link
    Contributor

    New changeset 3c30805 by Raymond Hettinger in branch '3.10':
    [3.10] bpo-20499: Rounding error in statistics.pvariance (GH-28230) (GH-28248)
    3c30805

    @stevendaprano
    Copy link
    Member

    rhettinger requested a review from stevendaprano yesterday

    This is not a complaint Raymond, but you're too quick for me!

    Your changes look good to me, thank you.

    @stevendaprano
    Copy link
    Member

    Regarding the call to sorted that you removed.

    I just realised that this was buggy! In my testing, I found that there was a consistent speed-up by summing fractions in order of increasing denominators (small denominators first):

    >>> from fractions import Fraction
    >>> from timeit import Timer
    >>> from random import shuffle
    >>>
    >>> d1 = [Fraction(1, d) for d in range(1, 1000)]
    >>> d2 = d1[:]
    >>> shuffle(d2)
    >>>
    >>> t1 = Timer('sum(d)', setup='from __main__ import d1 as d')
    >>> t2 = Timer('sum(d)', setup='from __main__ import d2 as d')
    >>> 
    >>> min(t1.repeat(number=100, repeat=7)) # small dens first
    1.048042511974927
    >>> min(t1.repeat(number=100, repeat=7))
    1.0449080819962546
    >>>
    >>> min(t2.repeat(number=100, repeat=7)) # random order
    1.2761094039888121
    >>> min(t2.repeat(number=100, repeat=7))
    1.2763739289948717

    Summing larger denominators before smaller denominators is even slower:

    >>> d3 = list(reversed(d1))
    >>> t3 = Timer('sum(d)', setup='from __main__ import d3 as d')
    >>> 
    >>> min(t3.repeat(number=100, repeat=7)) # large dens first
    1.6581684999982826
    >>> min(t3.repeat(number=100, repeat=7))
    1.6580656860023737

    But then I screwed up by sorting the *tuples* (num, den) which would have the effect of summing small *numerators* first, not denominators.

    Thanks for catching that they way I had written it, the sort would have had at best no real performance benefit.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @jneb
    Copy link
    Contributor

    jneb commented Feb 19, 2023

    Why don't you use Welford's algorithm?
    This algorithm had good numeral stability, can be computed in one pass, uses no more memory than the "sum of squares" method and is only a little bit more work. (see Wikipedia ).

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants