This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add direct proportion option to statistics.linear_regression()
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: rhettinger, steven.daprano
Priority: normal Keywords: patch

Created on 2021-11-09 15:16 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 29490 merged rhettinger, 2021-11-09 15:21
Messages (7)
msg406026 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-09 15:16
Signature:

    def linear_regression(x, y, /, *, proportional=False):

Additional docstring with example:

    If *proportional* is true, the independent variable *x* and the
    dependent variable *y* are assumed to be directly proportional.
    The data is fit to a line passing through the origin.

    Since the *intercept* will always be 0.0, the underlying linear
    function simplifies to:

        y = slope * x + noise

    >>> y = [3 * x[i] + noise[i] for i in range(5)]
    >>> linear_regression(x, y, proportional=True)  #doctest: +ELLIPSIS
    LinearRegression(slope=3.0244754248461283, intercept=0.0)

See Wikipedia entry for regression without an intercept term:
https://en.wikipedia.org/wiki/Simple_linear_regression#Simple_linear_regression_without_the_intercept_term_(single_regressor)

Compare with the *const* parameter in MS Excel's linest() function:
https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d

Compare with the *IncludeConstantBasis* option in Mathematica:
https://reference.wolfram.com/language/ref/IncludeConstantBasis.html
msg406141 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-11-11 01:53
Hi Raymond,

I'm conflicted by this. Regression through the origin is clearly a thing which is often desired. In that sense, I'm happy to see it added, and thank you.

But on the other hand, this may open a can of worms that I personally don't feel entirely competent to deal with. Are you happy to hold off a few days while I consult with some statistics experts?

- There is some uncertainty as to the correct method of calculation, with many stats packages giving different results for the same data, e.g.

https://web.ist.utl.pt/~ist11038/compute/errtheory/,regression/regrthroughorigin.pdf

- Forcing the intercept through the origin is a dubious thing to do, even if you think it is theoretically justified, see for example the above paper, also:

https://dynamicecology.wordpress.com/2017/04/13/dont-force-your-regression-through-zero-just-because-you-know-the-true-intercept-has-to-be-zero/

https://www.theanalysisfactor.com/regression-through-the-origin/

- Regression through the origin needs a revised calculation for the coefficient of determination (Pearson's R squared):

https://pubs.cif-ifc.org/doi/pdf/10.5558/tfc71326-3

https://www.researchgate.net/publication/283333191_Re-interpreting_R-squared_regression_through_the_origin_and_weighted_least_squares

but it's not clear how to revise the calculation, with some methods giving R squared negative or greater than 1.

- Regression through the origin is only one of a number of variants of least-squares linear regression that we might also wish to offer, e.g. intercept-only, Deming or orthogonal regression.

https://en.wikipedia.org/wiki/Deming_regression
msg406146 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-11 06:54
Sure, I’m happy to wait.

My thoughts:

* The first link you provided does give the same slope across packages.  Where they differ is in how they choose to report statistics for assessing goodness of fit or for informing hypothesis testing. Neither of those apply to us.

* The compared stats packages offer this functionality because some models don’t benefit from a non-zero constant. 

* The second link is of low quality and reads like hastily typed, stream of consciousness rant that roughly translates to “As a blanket statement applicable to all RTO, I don’t believe the underlying process is linear and I don’t believe that a person could have a priori knowledge of a directly proportional relationship.”  This is bunk — a cold caller makes sales in direct proportion to the number of calls they make, and zero calls means zero sales.

* The last point is a distractor.  Dealing with error analysis or input error models is beyond the scope of the package. Doing something I could easily do with my HP-12C is within scope. 

* We’re offering users something simple. If you have a need to fit a data to directly proportional model, set a flag.

* If we don’t offer the option, users have to do too much work to bridge from what we have to what they need:

   (covariance(x, y) + mean(x)*mean(y)) / (variance(x) + mean(x)**2)
msg406169 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-11 18:25
It usually isn't wise to be preachy in the docs, but we could add a suggestion that proportional=True be used only when (0, 0) is known to be in the dataset and when it is in the same neighborhood as the other data points.  A reasonable cross-check would be to verify than a plain OLS regression would produce an intercept near zero.

    linear_regression(hours_since_poll_started, number_of_respondents, proportional=True)
msg406707 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-11-21 10:29
Hi Raymond,

I'm satisfied that this should be approved. The code looks good to me 
and in my tests it matches the results from other software.

I don't think there is any need to verify that plain OLS regression 
produces an intercept close to zero. (What counts as close to zero?) If 
users want to check that, they can do so themselves.

Regarding my concern with the coefficient of determination, I don't 
think that's enough of a problem that it should delay adding this 
functionality. I don't know what, if anything, should be done, but in 
the meantime we should approve this new feature.

For the record, an example of the problem can be seen on the last slide 
here:

https://www.azdhs.gov/documents/preparedness/state-laboratory/lab-licensure-certification/technical-resources/calibration-training/09-linear-forced-through-zero-calib.pdf

The computed r**2 of 1.0 is clearly too high for the RTO line.
msg406718 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-21 14:39
New changeset d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f by Raymond Hettinger in branch 'main':
bpo-45766: Add direct proportion option to linear_regression(). (#29490)
https://github.com/python/cpython/commit/d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f
msg406719 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-21 14:40
Thanks for looking at this and giving it some good thought.
History
Date User Action Args
2022-04-11 14:59:52adminsetgithub: 89927
2021-11-21 14:40:28rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg406719

stage: patch review -> resolved
2021-11-21 14:39:29rhettingersetmessages: + msg406718
2021-11-21 10:29:56steven.dapranosetmessages: + msg406707
2021-11-11 18:25:45rhettingersetmessages: + msg406169
2021-11-11 06:54:54rhettingersetmessages: + msg406146
2021-11-11 01:53:52steven.dapranosetmessages: + msg406141
2021-11-09 15:21:36rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request27741
2021-11-09 15:16:23rhettingercreate