Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add direct proportion option to statistics.linear_regression() #89927

Closed
rhettinger opened this issue Nov 9, 2021 · 7 comments
Closed

Add direct proportion option to statistics.linear_regression() #89927

rhettinger opened this issue Nov 9, 2021 · 7 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir

Comments

@rhettinger
Copy link
Contributor

BPO 45766
Nosy @rhettinger, @stevendaprano
PRs
  • bpo-45766: Add direct proportion option to linear_regression(). #29490
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-11-21.14:40:28.189>
    created_at = <Date 2021-11-09.15:16:23.189>
    labels = ['library', '3.11']
    title = 'Add direct proportion option to statistics.linear_regression()'
    updated_at = <Date 2021-11-21.14:40:28.188>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2021-11-21.14:40:28.188>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-11-21.14:40:28.189>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2021-11-09.15:16:23.189>
    creator = 'rhettinger'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 45766
    keywords = ['patch']
    message_count = 7.0
    messages = ['406026', '406141', '406146', '406169', '406707', '406718', '406719']
    nosy_count = 2.0
    nosy_names = ['rhettinger', 'steven.daprano']
    pr_nums = ['29490']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue45766'
    versions = ['Python 3.11']

    @rhettinger
    Copy link
    Contributor Author

    Signature:

        def linear_regression(x, y, /, *, proportional=False):

    Additional docstring with example:

    If *proportional* is true, the independent variable *x* and the
    dependent variable *y* are assumed to be directly proportional.
    The data is fit to a line passing through the origin.
    
    Since the *intercept* will always be 0.0, the underlying linear
    function simplifies to:
    
            y = slope * x + noise
        >>> y = [3 * x[i] + noise[i] for i in range(5)]
        >>> linear_regression(x, y, proportional=True)  #doctest: +ELLIPSIS
        LinearRegression(slope=3.0244754248461283, intercept=0.0)

    See Wikipedia entry for regression without an intercept term:
    https://en.wikipedia.org/wiki/Simple_linear_regression#Simple_linear_regression_without_the_intercept_term_(single_regressor)

    Compare with the *const* parameter in MS Excel's linest() function:
    https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d

    Compare with the *IncludeConstantBasis* option in Mathematica:
    https://reference.wolfram.com/language/ref/IncludeConstantBasis.html

    @rhettinger rhettinger added 3.11 only security fixes stdlib Python modules in the Lib dir labels Nov 9, 2021
    @stevendaprano
    Copy link
    Member

    Hi Raymond,

    I'm conflicted by this. Regression through the origin is clearly a thing which is often desired. In that sense, I'm happy to see it added, and thank you.

    But on the other hand, this may open a can of worms that I personally don't feel entirely competent to deal with. Are you happy to hold off a few days while I consult with some statistics experts?

    • There is some uncertainty as to the correct method of calculation, with many stats packages giving different results for the same data, e.g.

    https://web.ist.utl.pt/~ist11038/compute/errtheory/,regression/regrthroughorigin.pdf

    • Forcing the intercept through the origin is a dubious thing to do, even if you think it is theoretically justified, see for example the above paper, also:

    https://dynamicecology.wordpress.com/2017/04/13/dont-force-your-regression-through-zero-just-because-you-know-the-true-intercept-has-to-be-zero/

    https://www.theanalysisfactor.com/regression-through-the-origin/

    • Regression through the origin needs a revised calculation for the coefficient of determination (Pearson's R squared):

    https://pubs.cif-ifc.org/doi/pdf/10.5558/tfc71326-3

    https://www.researchgate.net/publication/283333191_Re-interpreting_R-squared_regression_through_the_origin_and_weighted_least_squares

    but it's not clear how to revise the calculation, with some methods giving R squared negative or greater than 1.

    • Regression through the origin is only one of a number of variants of least-squares linear regression that we might also wish to offer, e.g. intercept-only, Deming or orthogonal regression.

    https://en.wikipedia.org/wiki/Deming_regression

    @rhettinger
    Copy link
    Contributor Author

    Sure, I’m happy to wait.

    My thoughts:

    • The first link you provided does give the same slope across packages. Where they differ is in how they choose to report statistics for assessing goodness of fit or for informing hypothesis testing. Neither of those apply to us.

    • The compared stats packages offer this functionality because some models don’t benefit from a non-zero constant.

    • The second link is of low quality and reads like hastily typed, stream of consciousness rant that roughly translates to “As a blanket statement applicable to all RTO, I don’t believe the underlying process is linear and I don’t believe that a person could have a priori knowledge of a directly proportional relationship.” This is bunk — a cold caller makes sales in direct proportion to the number of calls they make, and zero calls means zero sales.

    • The last point is a distractor. Dealing with error analysis or input error models is beyond the scope of the package. Doing something I could easily do with my HP-12C is within scope.

    • We’re offering users something simple. If you have a need to fit a data to directly proportional model, set a flag.

    • If we don’t offer the option, users have to do too much work to bridge from what we have to what they need:

      (covariance(x, y) + mean(x)*mean(y)) / (variance(x) + mean(x)**2)

    @rhettinger
    Copy link
    Contributor Author

    It usually isn't wise to be preachy in the docs, but we could add a suggestion that proportional=True be used only when (0, 0) is known to be in the dataset and when it is in the same neighborhood as the other data points. A reasonable cross-check would be to verify than a plain OLS regression would produce an intercept near zero.

        linear_regression(hours_since_poll_started, number_of_respondents, proportional=True)

    @stevendaprano
    Copy link
    Member

    Hi Raymond,

    I'm satisfied that this should be approved. The code looks good to me
    and in my tests it matches the results from other software.

    I don't think there is any need to verify that plain OLS regression
    produces an intercept close to zero. (What counts as close to zero?) If
    users want to check that, they can do so themselves.

    Regarding my concern with the coefficient of determination, I don't
    think that's enough of a problem that it should delay adding this
    functionality. I don't know what, if anything, should be done, but in
    the meantime we should approve this new feature.

    For the record, an example of the problem can be seen on the last slide
    here:

    https://www.azdhs.gov/documents/preparedness/state-laboratory/lab-licensure-certification/technical-resources/calibration-training/09-linear-forced-through-zero-calib.pdf

    The computed r**2 of 1.0 is clearly too high for the RTO line.

    @rhettinger
    Copy link
    Contributor Author

    New changeset d2b55b0 by Raymond Hettinger in branch 'main':
    bpo-45766: Add direct proportion option to linear_regression(). (bpo-29490)
    d2b55b0

    @rhettinger
    Copy link
    Contributor Author

    Thanks for looking at this and giving it some good thought.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants