Issue 45766: Add direct proportion option to statistics.linear_regression()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/89927

classification

Title:	Add direct proportion option to statistics.linear_regression()
Type:		Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	rhettinger, steven.daprano
Priority:	normal	Keywords:	patch

Created on 2021-11-09 15:16 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 29490	merged	rhettinger, 2021-11-09 15:21

Messages (7)
msg406026 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-11-09 15:16
Signature: def linear_regression(x, y, /, , proportional=False): Additional docstring with example: If proportional* is true, the independent variable x and the dependent variable y are assumed to be directly proportional. The data is fit to a line passing through the origin. Since the intercept will always be 0.0, the underlying linear function simplifies to: y = slope * x + noise >>> y = [3 * x[i] + noise[i] for i in range(5)] >>> linear_regression(x, y, proportional=True) #doctest: +ELLIPSIS LinearRegression(slope=3.0244754248461283, intercept=0.0) See Wikipedia entry for regression without an intercept term: https://en.wikipedia.org/wiki/Simple_linear_regression#Simple_linear_regression_without_the_intercept_term_(single_regressor) Compare with the const parameter in MS Excel's linest() function: https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d Compare with the IncludeConstantBasis option in Mathematica: https://reference.wolfram.com/language/ref/IncludeConstantBasis.html
msg406141 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2021-11-11 01:53
Hi Raymond, I'm conflicted by this. Regression through the origin is clearly a thing which is often desired. In that sense, I'm happy to see it added, and thank you. But on the other hand, this may open a can of worms that I personally don't feel entirely competent to deal with. Are you happy to hold off a few days while I consult with some statistics experts? - There is some uncertainty as to the correct method of calculation, with many stats packages giving different results for the same data, e.g. https://web.ist.utl.pt/~ist11038/compute/errtheory/,regression/regrthroughorigin.pdf - Forcing the intercept through the origin is a dubious thing to do, even if you think it is theoretically justified, see for example the above paper, also: https://dynamicecology.wordpress.com/2017/04/13/dont-force-your-regression-through-zero-just-because-you-know-the-true-intercept-has-to-be-zero/ https://www.theanalysisfactor.com/regression-through-the-origin/ - Regression through the origin needs a revised calculation for the coefficient of determination (Pearson's R squared): https://pubs.cif-ifc.org/doi/pdf/10.5558/tfc71326-3 https://www.researchgate.net/publication/283333191_Re-interpreting_R-squared_regression_through_the_origin_and_weighted_least_squares but it's not clear how to revise the calculation, with some methods giving R squared negative or greater than 1. - Regression through the origin is only one of a number of variants of least-squares linear regression that we might also wish to offer, e.g. intercept-only, Deming or orthogonal regression. https://en.wikipedia.org/wiki/Deming_regression
msg406146 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-11-11 06:54
Sure, I’m happy to wait. My thoughts: * The first link you provided does give the same slope across packages. Where they differ is in how they choose to report statistics for assessing goodness of fit or for informing hypothesis testing. Neither of those apply to us. * The compared stats packages offer this functionality because some models don’t benefit from a non-zero constant. * The second link is of low quality and reads like hastily typed, stream of consciousness rant that roughly translates to “As a blanket statement applicable to all RTO, I don’t believe the underlying process is linear and I don’t believe that a person could have a priori knowledge of a directly proportional relationship.” This is bunk — a cold caller makes sales in direct proportion to the number of calls they make, and zero calls means zero sales. * The last point is a distractor. Dealing with error analysis or input error models is beyond the scope of the package. Doing something I could easily do with my HP-12C is within scope. * We’re offering users something simple. If you have a need to fit a data to directly proportional model, set a flag. * If we don’t offer the option, users have to do too much work to bridge from what we have to what they need: (covariance(x, y) + mean(x)mean(y)) / (variance(x) + mean(x)*2)
msg406169 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-11-11 18:25
It usually isn't wise to be preachy in the docs, but we could add a suggestion that proportional=True be used only when (0, 0) is known to be in the dataset and when it is in the same neighborhood as the other data points. A reasonable cross-check would be to verify than a plain OLS regression would produce an intercept near zero. linear_regression(hours_since_poll_started, number_of_respondents, proportional=True)
msg406707 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2021-11-21 10:29
Hi Raymond, I'm satisfied that this should be approved. The code looks good to me and in my tests it matches the results from other software. I don't think there is any need to verify that plain OLS regression produces an intercept close to zero. (What counts as close to zero?) If users want to check that, they can do so themselves. Regarding my concern with the coefficient of determination, I don't think that's enough of a problem that it should delay adding this functionality. I don't know what, if anything, should be done, but in the meantime we should approve this new feature. For the record, an example of the problem can be seen on the last slide here: https://www.azdhs.gov/documents/preparedness/state-laboratory/lab-licensure-certification/technical-resources/calibration-training/09-linear-forced-through-zero-calib.pdf The computed r**2 of 1.0 is clearly too high for the RTO line.
msg406718 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-11-21 14:39
New changeset d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f by Raymond Hettinger in branch 'main': bpo-45766: Add direct proportion option to linear_regression(). (#29490) https://github.com/python/cpython/commit/d2b55b07d2b503dcd3b5c0e2753efa835cff8e8f
msg406719 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-11-21 14:40
Thanks for looking at this and giving it some good thought.

History
Date	User	Action	Args
2022-04-11 14:59:52	admin	set	github: 89927
2021-11-21 14:40:28	rhettinger	set	status: open -> closed resolution: fixed messages: + msg406719 stage: patch review -> resolved
2021-11-21 14:39:29	rhettinger	set	messages: + msg406718
2021-11-21 10:29:56	steven.daprano	set	messages: + msg406707
2021-11-11 18:25:45	rhettinger	set	messages: + msg406169
2021-11-11 06:54:54	rhettinger	set	messages: + msg406146
2021-11-11 01:53:52	steven.daprano	set	messages: + msg406141
2021-11-09 15:21:36	rhettinger	set	keywords: + patch stage: patch review pull_requests: + pull_request27741
2021-11-09 15:16:23	rhettinger	create