Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics.stdev ignore xbar argument #85032

Closed
Twokubikmeter mannequin opened this issue Jun 3, 2020 · 13 comments
Closed

statistics.stdev ignore xbar argument #85032

Twokubikmeter mannequin opened this issue Jun 3, 2020 · 13 comments
Assignees
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@Twokubikmeter
Copy link
Mannequin

Twokubikmeter mannequin commented Jun 3, 2020

BPO 40855
Nosy @rhettinger, @stevendaprano, @miss-islington, @Twokubikmeter
PRs
  • bpo-40855: Fix ignored mu and xbar parameters #20835
  • [3.9] bpo-40855: Fix ignored mu and xbar parameters (GH-20835) #20862
  • [3.8] bpo-40855: Fix ignored mu and xbar parameters (GH-20835) #20863
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/stevendaprano'
    closed_at = <Date 2020-06-13.23:58:57.556>
    created_at = <Date 2020-06-03.12:31:21.567>
    labels = ['3.8', 'type-bug', 'library', '3.9', '3.10']
    title = 'statistics.stdev ignore xbar argument'
    updated_at = <Date 2020-06-17.09:38:49.489>
    user = 'https://github.com/Twokubikmeter'

    bugs.python.org fields:

    activity = <Date 2020-06-17.09:38:49.489>
    actor = 'Folket'
    assignee = 'steven.daprano'
    closed = True
    closed_date = <Date 2020-06-13.23:58:57.556>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2020-06-03.12:31:21.567>
    creator = 'Folket'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40855
    keywords = ['patch']
    message_count = 13.0
    messages = ['370659', '370680', '370684', '370687', '370701', '371262', '371313', '371471', '371475', '371476', '371477', '371723', '371724']
    nosy_count = 4.0
    nosy_names = ['rhettinger', 'steven.daprano', 'miss-islington', 'Folket']
    pr_nums = ['20835', '20862', '20863']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue40855'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @Twokubikmeter
    Copy link
    Mannequin Author

    Twokubikmeter mannequin commented Jun 3, 2020

    statistics.variance also has the same problem.

    >>> import statistics
    >>> statistics.stdev([1,2])
    0.7071067811865476
    >>> statistics.stdev([1,2], 3)
    0.7071067811865476
    >>> statistics.stdev([1,2], 1.5)
    0.7071067811865476

    should be
    0.7071067811865476
    2.23606797749979
    0.5

    @Twokubikmeter Twokubikmeter mannequin added the type-bug An unexpected behavior, bug, or error label Jun 3, 2020
    @rhettinger
    Copy link
    Contributor

    The relevant code is in the _ss() helper function:

    # The following sum should mathematically equal zero, but due to rounding
    # error may not.
    U, total2, count2 = _sum((x-c) for x in data)
    assert T == U and count == count2
    total -=  total2**2/len(data)
    

    The intent was to correct for small rounding errors, but the effect is to undo any xbar value that differs from the true mean.

    From a user point-of-view the xbar parameter should have two effects, saving the computation time for the mean and also giving the ability to recenter the stdev/variance around a different point. It does save a call to mean; however, that effort is mostly throw-away by the rounding adjustment code which does even more work than computing the mean.

    Likely, the fix for this is skip the rounding adjustment code if the user supplies an xbar value.

    @rhettinger
    Copy link
    Contributor

    Perhaps this would work:

    diff --git a/Lib/statistics.py b/Lib/statistics.py
    index c76a6ca519..93a4633464 100644
    --- a/Lib/statistics.py
    +++ b/Lib/statistics.py
    @@ -682,8 +682,10 @@ def _ss(data, c=None):
         calculated from ``c`` as given. Use the second case with care, as it can
         lead to garbage results.
         """
    -    if c is None:
    -        c = mean(data)
    +    if c is not None:
    +        T, total, count = _sum((x-c)**2 for x in data)
    +        return (T, total)
    +    c = mean(data)
         T, total, count = _sum((x-c)**2 for x in data)
         # The following sum should mathematically equal zero, but due to rounding
         # error may not.

    Matti, where do you get 0.5 as the expected outcome for the third example? The actual mean is 1.5, so I would expect the third case to give sqrt(2)/2 or 0.707.

    @rhettinger rhettinger added stdlib Python modules in the Lib dir 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels Jun 3, 2020
    @stevendaprano
    Copy link
    Member

    Thanks Raymond, that is the intended effect, and your analysis seems
    plausible.

    @Twokubikmeter
    Copy link
    Mannequin Author

    Twokubikmeter mannequin commented Jun 4, 2020

    If we estimate the mean using a sample we loose one degree of freedom so it will be divided by N-1, while if we have the mean independent of the sample it should be divided by N to be unbiased.

    i.e.
    example 1
    sqrt(((1-1.5)²+(2-1.5)²)/(2-1)) = 0.7...
    example 3
    sqrt(((1-1.5)²+(2-1.5)²)/(2)) = 0.5

    @Twokubikmeter
    Copy link
    Mannequin Author

    Twokubikmeter mannequin commented Jun 11, 2020

    Hi Raymond and Steven!

    I'm happy that you are solving this issue but do you have any comment on my previous answer?

    @rhettinger
    Copy link
    Contributor

    do you have any comment on my previous answer?

    I see what you're trying to do but think that interpretation is surprising
    and is at odds with the existing and intended uses of the *xbar* argument.

    The goals were to allow the mean to be precomputed (common case) or to be recentered (uncommon). Neither case should have the effect of changing the divisor.

    We can't break existing code that assumes that stdev(data) is equal to stdev(data, xbar=mean(data)).

    >>> data = [1, 2]
    >>> stdev(data)
    0.7071067811865476
    >>> stdev(data, xbar=mean(data))
    0.7071067811865476

    @rhettinger
    Copy link
    Contributor

    New changeset d71ab4f by Raymond Hettinger in branch 'master':
    bpo-40855: Fix ignored mu and xbar parameters (GH-20835)
    d71ab4f

    @rhettinger
    Copy link
    Contributor

    New changeset 55c1d21 by Miss Islington (bot) in branch '3.9':
    bpo-40855: Fix ignored mu and xbar parameters (GH-20835) (#GH-20862)
    55c1d21

    @rhettinger
    Copy link
    Contributor

    New changeset 811e040 by Miss Islington (bot) in branch '3.8':
    bpo-40855: Fix ignored mu and xbar parameters (GH-20835) (GH-20863)
    811e040

    @rhettinger
    Copy link
    Contributor

    Thanks for the bug report 😊

    @Twokubikmeter
    Copy link
    Mannequin Author

    Twokubikmeter mannequin commented Jun 17, 2020

    I see what you're trying to do but think that interpretation is surprising
    and is at odds with the existing and intended uses of the *xbar* argument.

    The goals were to allow the mean to be precomputed (common case) or to be recentered (uncommon). Neither case should have the effect of changing the divisor.

    We can't break existing code that assumes that stdev(data) is equal to stdev(data, xbar=mean(data)).

    Maybe the requirement are buged? It seems to me that recalculating the mean is a very niche use case. You will very little time on a call you do once.

    But what good is it to supply a re-centered mean if you get a wrong estimation of the standard deviation? If the mean is not the mean of the sample it was not calculated using the sample so there is no loos of degrees of freedom.

    @Twokubikmeter
    Copy link
    Mannequin Author

    Twokubikmeter mannequin commented Jun 17, 2020

    I meant to write "pre-calculate".

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes 3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants