This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Type: Add direct proportion option to statistics.linear_regression() resolved Library (Lib) Python 3.11
process
Status: Resolution: closed fixed rhettinger, steven.daprano normal patch

Created on 2021-11-09 15:16 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
PR 29490 merged rhettinger, 2021-11-09 15:21
Messages (7)
msg406026 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2021-11-09 15:16
```Signature:

def linear_regression(x, y, /, *, proportional=False):

If *proportional* is true, the independent variable *x* and the
dependent variable *y* are assumed to be directly proportional.
The data is fit to a line passing through the origin.

Since the *intercept* will always be 0.0, the underlying linear
function simplifies to:

y = slope * x + noise

>>> y = [3 * x[i] + noise[i] for i in range(5)]
>>> linear_regression(x, y, proportional=True)  #doctest: +ELLIPSIS
LinearRegression(slope=3.0244754248461283, intercept=0.0)

See Wikipedia entry for regression without an intercept term:
https://en.wikipedia.org/wiki/Simple_linear_regression#Simple_linear_regression_without_the_intercept_term_(single_regressor)

Compare with the *const* parameter in MS Excel's linest() function:
https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d

Compare with the *IncludeConstantBasis* option in Mathematica:
https://reference.wolfram.com/language/ref/IncludeConstantBasis.html```
msg406141 - (view) Author: Steven D'Aprano (steven.daprano) * Date: 2021-11-11 01:53
```Hi Raymond,

I'm conflicted by this. Regression through the origin is clearly a thing which is often desired. In that sense, I'm happy to see it added, and thank you.

But on the other hand, this may open a can of worms that I personally don't feel entirely competent to deal with. Are you happy to hold off a few days while I consult with some statistics experts?

- There is some uncertainty as to the correct method of calculation, with many stats packages giving different results for the same data, e.g.

https://web.ist.utl.pt/~ist11038/compute/errtheory/,regression/regrthroughorigin.pdf

- Forcing the intercept through the origin is a dubious thing to do, even if you think it is theoretically justified, see for example the above paper, also:

https://dynamicecology.wordpress.com/2017/04/13/dont-force-your-regression-through-zero-just-because-you-know-the-true-intercept-has-to-be-zero/

https://www.theanalysisfactor.com/regression-through-the-origin/

- Regression through the origin needs a revised calculation for the coefficient of determination (Pearson's R squared):

https://pubs.cif-ifc.org/doi/pdf/10.5558/tfc71326-3

https://www.researchgate.net/publication/283333191_Re-interpreting_R-squared_regression_through_the_origin_and_weighted_least_squares

but it's not clear how to revise the calculation, with some methods giving R squared negative or greater than 1.

- Regression through the origin is only one of a number of variants of least-squares linear regression that we might also wish to offer, e.g. intercept-only, Deming or orthogonal regression.

https://en.wikipedia.org/wiki/Deming_regression```
msg406146 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2021-11-11 06:54
```Sure, I’m happy to wait.

My thoughts:

* The first link you provided does give the same slope across packages.  Where they differ is in how they choose to report statistics for assessing goodness of fit or for informing hypothesis testing. Neither of those apply to us.

* The compared stats packages offer this functionality because some models don’t benefit from a non-zero constant.

* The second link is of low quality and reads like hastily typed, stream of consciousness rant that roughly translates to “As a blanket statement applicable to all RTO, I don’t believe the underlying process is linear and I don’t believe that a person could have a priori knowledge of a directly proportional relationship.”  This is bunk — a cold caller makes sales in direct proportion to the number of calls they make, and zero calls means zero sales.

* The last point is a distractor.  Dealing with error analysis or input error models is beyond the scope of the package. Doing something I could easily do with my HP-12C is within scope.

* We’re offering users something simple. If you have a need to fit a data to directly proportional model, set a flag.

* If we don’t offer the option, users have to do too much work to bridge from what we have to what they need:

(covariance(x, y) + mean(x)*mean(y)) / (variance(x) + mean(x)**2)```
msg406169 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2021-11-11 18:25
```It usually isn't wise to be preachy in the docs, but we could add a suggestion that proportional=True be used only when (0, 0) is known to be in the dataset and when it is in the same neighborhood as the other data points.  A reasonable cross-check would be to verify than a plain OLS regression would produce an intercept near zero.

linear_regression(hours_since_poll_started, number_of_respondents, proportional=True)```
msg406707 - (view) Author: Steven D'Aprano (steven.daprano) * Date: 2021-11-21 10:29
```Hi Raymond,

I'm satisfied that this should be approved. The code looks good to me
and in my tests it matches the results from other software.

I don't think there is any need to verify that plain OLS regression
produces an intercept close to zero. (What counts as close to zero?) If
users want to check that, they can do so themselves.

Regarding my concern with the coefficient of determination, I don't
think that's enough of a problem that it should delay adding this
functionality. I don't know what, if anything, should be done, but in
the meantime we should approve this new feature.

For the record, an example of the problem can be seen on the last slide
here:

https://www.azdhs.gov/documents/preparedness/state-laboratory/lab-licensure-certification/technical-resources/calibration-training/09-linear-forced-through-zero-calib.pdf

The computed r**2 of 1.0 is clearly too high for the RTO line.```
msg406718 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2021-11-21 14:39
```New changeset d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f by Raymond Hettinger in branch 'main':
bpo-45766: Add direct proportion option to linear_regression(). (#29490)
https://github.com/python/cpython/commit/d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f
```
msg406719 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2021-11-21 14:40
`Thanks for looking at this and giving it some good thought.`
History
Date User Action Args
2021-11-21 14:40:28rhettingersetstatus: open -> closed
resolution: fixed
messages: + msg406719

stage: patch review -> resolved
2021-11-21 14:39:29rhettingersetmessages: + msg406718
2021-11-21 10:29:56steven.dapranosetmessages: + msg406707
2021-11-11 18:25:45rhettingersetmessages: + msg406169
2021-11-11 06:54:54rhettingersetmessages: + msg406146
2021-11-11 01:53:52steven.dapranosetmessages: + msg406141
2021-11-09 15:21:36rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request27741
2021-11-09 15:16:23rhettingercreate