Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure when calling statistics.variance() on a float32 Numpy array #83399

Closed
reed mannequin opened this issue Jan 5, 2020 · 12 comments
Closed

Assertion failure when calling statistics.variance() on a float32 Numpy array #83399

reed mannequin opened this issue Jan 5, 2020 · 12 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@reed
Copy link
Mannequin

reed mannequin commented Jan 5, 2020

BPO 39218
Nosy @rhettinger, @mdickinson, @stevendaprano, @tirkarthi, @iritkatriel
PRs
  • bpo-39218: Improve accuracy of variance calculation #27960
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-08-31.01:58:53.610>
    created_at = <Date 2020-01-05.05:34:52.233>
    labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
    title = 'Assertion failure when calling statistics.variance() on a float32 Numpy array'
    updated_at = <Date 2021-08-31.01:58:53.609>
    user = 'https://bugs.python.org/reed'

    bugs.python.org fields:

    activity = <Date 2021-08-31.01:58:53.609>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-08-31.01:58:53.610>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2020-01-05.05:34:52.233>
    creator = 'reed'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39218
    keywords = ['patch']
    message_count = 12.0
    messages = ['359323', '359324', '359325', '359329', '359381', '399964', '400006', '400303', '400321', '400323', '400359', '400681']
    nosy_count = 6.0
    nosy_names = ['rhettinger', 'mark.dickinson', 'steven.daprano', 'xtreak', 'reed', 'iritkatriel']
    pr_nums = ['27960']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue39218'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @reed
    Copy link
    Mannequin Author

    reed mannequin commented Jan 5, 2020

    If a float32 Numpy array is passed to statistics.variance(), an assertion failure occurs. For example:

        import statistics
        import numpy as np
        x = np.array([1, 2], dtype=np.float32)
        statistics.variance(x)

    The assertion error is:

    assert T == U and count == count2
    

    Even if you convert x to a list with x = list(x), the issue still occurs. The issue is caused by the following lines in statistics.py (

    cpython/Lib/statistics.py

    Lines 687 to 691 in ec007cb

    T, total, count = _sum((x-c)**2 for x in data)
    # The following sum should mathematically equal zero, but due to rounding
    # error may not.
    U, total2, count2 = _sum((x-c) for x in data)
    assert T == U and count == count2
    ):

    T, total, count = _sum((x-c)**2 for x in data)
    # The following sum should mathematically equal zero, but due to rounding
    # error may not.
    U, total2, count2 = _sum((x-c) for x in data)
    assert T == U and count == count2
    

    When a float32 Numpy value is squared in the term (x-c)**2, it turns into a float64 value, causing the T == U assertion to fail. I think the best way to fix this would be to replace (x-c)**2 with (x-c)*(x-c). This fix would no longer assume the input's ** operator returns the same type.

    @reed reed mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 5, 2020
    @tirkarthi
    Copy link
    Member

    I think it's more of an implementation artifact of numpy eq definition for float32 and float64 and can possibly break again if (x-c) * (x-c) was also changed to return float64 in future.

    @stevendaprano
    Copy link
    Member

    Nice analysis and bug report, thank you! That's pretty strange behaviour for float32, but I guess we're stuck with it.

    I wonder if the type assertion has outlived its usefulness? I.e. drop the T == U part and change the assertion to assert count == count2 only.

    If we removed the failing part of the assertion, and changed the final line to return (U, total), that ought to keep the exact sum but convert to float32 later, rather than float64.

    I am inclined to have the stdev of float32 return a float32 is possible. What do you think?

    We should check the numpy docs to see what the conversion rules for numpy floats are.

    @mdickinson
    Copy link
    Member

    [Karthikeyan]

    can possibly break again if (x-c) * (x-c) was also changed to return float64 in future

    I think it's safe to assume that multiplying two NumPy float32's will continue to give a float32 back in the future; NumPy has no reason to give back a different type, and changing this would be a big breaking change.

    The big difference between (x-c)**2 and (x-c)*(x-c) in this respect is that the latter is purely an operation on float32 operands, while the former is a mixed-type operation: NumPy sees a binary operation between a float32 and a Python int, and has to figure out a suitable type for the result. The int to float32 conversion is regarded as introducing unacceptable precision loss, so it chooses float64 as an acceptable output type, converts both input operands to that type, and then does the computation.

    Some relevant parts of the NumPy docs:

    https://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules
    https://docs.scipy.org/doc/numpy/reference/generated/numpy.result_type.html

    Even for pure Python floats, x*x is a simpler, more accurate, and likely faster operation than x**2. On a typical system, the former (eventually) maps to a hardware floating-point multiplication, which is highly likely to be correctly rounded, while the latter, after conversion of the r.h.s. to a float, maps to a libm pow call. That libm pow call *could* conceivably have a fast/accurate path for a right-hand-side of 2.0, but it could equally conceivably not have such a path.

    OTOH, (x-c)*(x-c) repeats the subtraction unnecessarily, but perhaps assignment expressions could rescue us? For example, replace:

    sum((x-c)**2 for x in data)
    

    with:

    sum((y:=(x-c))*y for x in data)
    

    [Steven]

    I am inclined to have the stdev of float32 return a float32 is possible.

    Would that also imply intermediate calculations being performed only with float32, or would intermediate calculations be performed with a more precise type? float32 has small enough precision to run into real accuracy problems with modestly large datasets. For example:

        >>> import numpy as np
        >>> sum(np.ones(10**8, dtype=np.float32), start=np.float32(0))
        16777216.0  # should be 10**8

    or, less dramatically:

        >>> sum(np.full(10**6, 1.73, dtype=np.float32), start=np.float32(0)) / 10**6
        1.74242125  # only two accurate significant figures

    @reed
    Copy link
    Mannequin Author

    reed mannequin commented Jan 5, 2020

    Thank you all for the comments! Either using (x-c)(x-c), or removing the assertion and changing the final line to return (U, total), seem reasonable. I slightly prefer the latter case, due to Mark's comments about xx being faster and simpler than x**2. But I am not an expert on this.

    I am inclined to have the stdev of float32 return a float32 is possible. What do you think?

    Agreed.

    OTOH, (x-c)*(x-c) repeats the subtraction unnecessarily, but perhaps assignment expressions could rescue us?

    Yeah, we should avoid repeating the subtraction. Another method of doing so is to define a square function. For example:

        def square(y):
            return y*y
        sum(square(x-c) for x in data)

    Would that also imply intermediate calculations being performed only with float32, or would intermediate calculations be performed with a more precise type?

    Currently, statistics.py computes sums in infinite precision (

    def _sum(data, start=0):
    ) for any type. The multiplications (and exponents if we go that route) would still be float32.

    @iritkatriel
    Copy link
    Member

    I've reproduced this on 3.9 and 3.10. This part of the code in main is still the same, so the issue is probably there even though we don't have numpy with which to test.

    @iritkatriel iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes and removed 3.8 only security fixes labels Aug 20, 2021
    @rhettinger
    Copy link
    Contributor

    Removing the assertion and implementing Steven's idea seems like the best way to go:

    sum((y:=(x-c)) * y for x in data)
    

    @rhettinger
    Copy link
    Contributor

    The rounding correction in _ss() looks mathematically incorrect to me:

    ∑ (xᵢ - x̅ + εᵢ)² = ∑ (xᵢ - x̅)² - (∑ εᵢ)² ÷ n

    If we drop this logic (which seems completely bogus), all the tests still pass and the code becomes cleaner:

        def _ss(data, c=None):
            if c is None:
                c = mean(data)
            T, total, count = _sum((y := x - c) * y for x in data)
            return (T, total)

    -- Algebraic form of the current code ----------------------

    from sympy import symbols, simplify

    x1, x2, x3, e1, e2, e3 = symbols('x1 x2 x3 e1 e2 e3')
    n = 3

    # high accuracy mean
    c = (x1 + x2 + x3) / n
    
    # sum of squared deviations with subtraction errors
    total = (x1 - c + e1)**2 + (x2 - c + e2)**2 + (x3 - c + e3)**2
    
    # sum of subtraction errors = e1 + e2 + e3
    total2 = (x1 - c + e1) + (x2 - c + e2) + (x3 - c + e3)

    # corrected sum of squared deviations
    total -= total2 ** 2 / n

    # exact sum of squared deviations
    desired = (x1 - c)**2 + (x2 - c)**2 + (x3 - c)**2
    
    # expected versus actual
    print(simplify(desired - total))

    This gives:

    (e1 + e2 + e3)**2/3
    + (-2*x1 + x2 + x3)**2/9
    + (x1 - 2*x2 + x3)**2/9
    + (x1 + x2 - 2*x3)**2/9
    - (3*e1 + 2*x1 - x2 - x3)**2/9
    - (3*e2 - x1 + 2*x2 - x3)**2/9
    - (3*e3 - x1 - x2 + 2*x3)**2/9
    

    -- Substituting in concrete values ----------------------

    x1, x2, x3, e1, e2, e3 = 11, 17, 5, 0.3, 0.1, -0.2

    This gives:

    75.74000000000001  uncorrected total
    75.72666666666667  "corrected" total
    72.0               desired result
    

    @mdickinson
    Copy link
    Member

    The rounding correction in _ss() looks mathematically incorrect to me [...]

    I don't think it was intended as a rounding correction - I think it's just computing the variance (prior to the division by n or n-1) of the (x - c) terms using the standard "expectation of x^2 - (expectation of x)^2" formula:

    sum((x - c)**2 for x in data) - (sum(x - c for x in data)**2) / n

    So I guess it *can* be thought of as a rounding correction, but what it's correcting for is an inaccurate value of "c"; it's not correcting for inaccuracies in the subtraction results. That is, if you were to add an artificial error into c at some point before computing "total" and "total2", that correction term should take you back to something approaching the true sum of squares of deviations.

    So mathematically, I think it's correct, but not useful, because mathematically "total2" will be zero. Numerically, it's probably not helpful.

    @mdickinson
    Copy link
    Member

    what it's correcting for is an inaccurate value of "c" [...]

    In more detail:

    Suppose "m" is the true mean of the x in data, but all we have is an approximate mean "c" to work with. Write "e" for the error in that approximation, so that c = m + e. Then (using Python notation, but treating the expressions as exact mathematical expressions computed in the reals):

    sum((x-c)**2 for x in data)

    == sum((x-m-e)**2 for x in data)

    == sum((x - m)**2 for x in data) - 2 * sum((x - m)*e for x in data)
    + sum(e**2 for x in data)

    == sum((x - m)**2 for x in data) - 2 * e * sum((x - m) for x in data)
    + sum(e**2 for x in data)

    == sum((x - m)**2 for x in data) + sum(e**2 for x in data)
    (because sum((x - m) for x in data) is 0)

    == sum((x - m)**2 for x in data) + n*e**2

    So the error in our result arising from the error in computing m is that n*e**2 term. And that's the term that's being subtracted here, because

    sum(x - c for x in data) ** 2 / n
    == sum(x - m - e for x in data) ** 2 / n
    == (sum(x - m for x in data) - sum(e for x in data))**2 / n
    == (0 - n * e)**2 / n
    == n * e**2

    @rhettinger
    Copy link
    Contributor

    what it's correcting for is an inaccurate value of "c" [...]

    I'll leave the logic as-is and just add a note about what is being corrected.

    Numerically, it's probably not helpful.

    To make a difference, the mean would have to have huge magnitude relative to the variance; otherwise, squaring the error would drown it out to zero.

    The mean() should already be accurate to within a 1/2 ulp. The summation and division are exact. There is only a single rounding when the result converts from Fraction to a float or decimal.

    @rhettinger
    Copy link
    Contributor

    New changeset 793f55b by Raymond Hettinger in branch 'main':
    bpo-39218: Improve accuracy of variance calculation (GH-27960)
    793f55b

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants