Issue 44151: Improve parameter names and return value ordering for linear_regression

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/88317

classification

Title:	Improve parameter names and return value ordering for linear_regression
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	matthewharrison, miss-islington, pablogsal, rhettinger, steven.daprano, tebeka, zkneupper
Priority:	normal	Keywords:	patch

Created on 2021-05-16 19:48 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
Screen Shot 2021-05-16 at 1.46.45 PM.png	rhettinger, 2021-05-16 20:49	Screen shot for TI-84

Pull Requests
URL	Status	Linked	Edit
PR 26199	merged	zkneupper, 2021-05-17 20:33
PR 26338	merged	miss-islington, 2021-05-25 00:31
PR 26344	merged	rhettinger, 2021-05-25 04:11
PR 26345	merged	miss-islington, 2021-05-25 06:04

Messages (17)
msg393754 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-16 19:48
The current signature is: linear_regression(regressor, dependent_variable) While the term "regressor" is used in some problem domains, it isn't well known outside of those domains. The term "independent_variable" would be better because it is common to all domains and because it is the natural counterpart to "dependent_variable". Another issue is that the return value is a named tuple in the form: LinearRegression(intercept, slope) While that order is seen in multiple linear regression, most people first learn it in algebra as the slope/intercept form: y = mx + b. That will be the natural order for a majority of users, especially given that we aren't supporting multiple linear regression. The named tuple is called LinearRegression which describes how the result was obtained rather than the result itself. The output of any process that fits data to a line is a line. The named tuple should be called Line because that is what it describes. Also, a Line class would be reusuable for other purposes that linear regression. Proposed signature: linear_regression(independent_variable, dependent_variable) -> Line(slope, intercept)
msg393758 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-16 20:49
Related links: * https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d * https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-trend-lines/a/linear-regression-review * TI-84: LinReg(ax + b) See attached screen shot
msg393771 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2021-05-16 23:49
I agree with you that "regressor" is too obscure and should be changed. I disagree about the "y = mx + c". Haven't we already discussed this? That form is used in linear algebra, but not used in statistics. Quoting from Yale: "A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0)." http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm This function is being used for statistics, not linear algebra. The users of the module are not people doing linear algebra, and most users of statistics will be familiar with the Y = a + bX form (or possibly reversed order bX + a). The TI-84 offers two linear regression functions, ax+b and a+bx. So does the Casio Classpad. The Nspire calls them a+bx and mx+b. https://www.statology.org/linear-regression-ti-84-calculator/ I've seen: a + bx ax + b bx + a mx + c mx + b among others. I don't think that there is any justification for claiming that a majority of users will be most familiar with mx+b.
msg393772 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2021-05-16 23:53
> The named tuple should be called Line because that is what it describes. Also, a Line class would be reusuable for other purposes that linear regression. I think that most people would expect that a Line class would represent a straight line widget in a GUI, not the coefficients of a linear equation. The result tuple represents an equation (or at least the coefficients of such), not the line itself. If it doesn't have a .draw() method, I don't think we should call it "Line".
msg393778 - (view)	Author: Matt Harrison (matthewharrison)	Date: 2021-05-17 02:06
The ML world has collapsed on the terms X and y. (With that capitalization). Moreover, most (Python libraries) follow the interface of scikit-learn [0]. Training a model looks like this: model = LinearRegression() model.fit(X, y) After that, the model instance has attribute that end in "_" that were learned from fitting. For linear regression[1] you get: model.coef_ # slope model.intercept_ # intercept To make predictions you call .predict: y_hat = model.predict(X) One bonus of leveraging the .fit/.predict interface (which other libraries such as XGBoost have also adopted) is that if your model is in the correct layout, you can trivially try different models. 0 - https://scikit-learn.org/stable/tutorial/basic/tutorial.html#learning-and-predicting 1 - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
msg393779 - (view)	Author: Matt Harrison (matthewharrison)	Date: 2021-05-17 02:08
And by "if your model is in the correct layout", I meant "if your data is in the correct layout"
msg393780 - (view)	Author: Miki Tebeka (tebeka) *	Date: 2021-05-17 03:52
I'm +1 on the changes proposed by Raymond. In my teaching experience most developers who will use the built-in statistics package will have highschool level math experience. On the other hand, they'll probably to Wikipedia and the entry there uses dependent variable and regressor (https://en.wikipedia.org/wiki/Linear_regression#Introduction) On the third hand :) scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) uses slope and intercept. I thin "Line" is a good name for the result.
msg393844 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2021-05-17 23:33
> The ML world has collapsed on the terms X and y. (With that > capitalization). I just googled for "ML linear regression" and there is no consistency in either the variable used or the parameters. But most seem to use lowercase x,y. Out of the top 6 links I checked, only one seems to use X,y and even there the y has a hat (circumflex) on it: X,ŷ. More importantly, the ML community has no consistency about the parameters either. I see: y = B0 + B1x ŷ = X W + b y = a_0 + a_1 x y = m x + b y = θ1 + θ2 x y = b0 + b1 x I'm going to give the URLs since the search page results are not reproducable from person to person. See below. The bottom line here is that I don't think the ML community is going to give us much guidence here. And it's probably not our target user base either, which is more aimed at high school and undergraduate users of basic level statistics, not ML algorithms. https://www.geeksforgeeks.org/ml-linear-regression/ https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a https://machinelearningmastery.com/linear-regression-for-machine-learning/ https://www.analytixlabs.co.in/blog/linear-regression-machine-learning/ https://madewithml.com/courses/basics/linear-regression/
msg393850 - (view)	Author: Zachary Kneupper (zkneupper) *	Date: 2021-05-18 00:09
> The ML world has collapsed on the terms X and y. (With that > capitalization). The ML community will probably use 3rd party packages for their linear regressions in any case. In my estimation, the ML community would be comfortable with any of these pairs of terms: Fine: + regressor, dependent_variable + independent_variable, dependent_variable + x, y Bad: + X, y <- this wouldn't makes sense here since the first argument is always a vector and is never a matrix. Often, capital letters indicate matrices, and lower case letters indicate vectors (or scalars). The reason that X is often capitalized is because it indicates that X is an m-by-n matrix of several independent variables; whereas y is lowercase because it is a single vector for the dependent variable. Since this linear_regression(regressor, dependent_variable) function takes a vector for the independent variable (as opposed to allowing a matrix of multiple regressors), it's probably not appropriate to use `X` (capitalized).
msg393852 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-18 02:55
Looking over the comments so far, it looks like (x, y) would be best and (independent variable, dependent variable) would be second best. The (x, y) also has the advantage of matching correlation() and covariance(). For output order, it seems that algebraic formulas sometimes have the intercept first and sometimes have it last. That said, whenever the words "slope" and "intercept" are used in text, it seems that slope almost always comes first (as in slope/intercept form of a line). FWIW, MS Excel's LINEST function returns is {mn,mn-1,...,m1,b} and the documentation described in terms of y = mx + b.
msg393913 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-19 00:39
Any objections to linear_regression(x, y) -> (slope, intercept)?
msg393917 - (view)	Author: Zachary Kneupper (zkneupper) *	Date: 2021-05-19 02:25
> Any objections to linear_regression(x, y) -> (slope, intercept)? I think `linear_regression(x, y)` would be intuitive for a wide range of users. Just to clarify, is the proposal to return a regular tuple instead of named tuple? Would we do this: return (slope, intercept) and not do this: return LinearRegression(intercept=intercept, slope=slope) and not do this: return Line(intercept=intercept, slope=slope) ?
msg393955 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-19 14:53
> Just to clarify, is the proposal to return a > regular tuple instead of named tuple? No, it should still have named fields. Either Line or LinearRegression would suffice.
msg393990 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-20 00:10
Steven, do you approve of this? linear_regression(x, y) -> LinearRegression(slope, intercept)
msg394159 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-21 21:23
Zachery, unless someone steps with an objection, I think you can go forward with the PR to implement this signature: linear_regression(x, y, /) -> LinearRegression(slope, intercept)
msg394281 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-25 01:11
New changeset 86779878dfc0bcb74b4721aba7fd9a84e9cbd5c7 by Miss Islington (bot) in branch '3.10': bpo-44151: linear_regression() minor API improvements (GH-26199) (GH-26338) https://github.com/python/cpython/commit/86779878dfc0bcb74b4721aba7fd9a84e9cbd5c7
msg394296 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-05-25 06:23
New changeset a6825197e9f2bd730d8da38f223608411e508695 by Miss Islington (bot) in branch '3.10': bpo-44151: Various grammar, word order, and markup fixes (GH-26344) (GH-26345) https://github.com/python/cpython/commit/a6825197e9f2bd730d8da38f223608411e508695

History
Date	User	Action	Args
2022-04-11 14:59:45	admin	set	github: 88317
2021-05-25 06:23:19	rhettinger	set	messages: + msg394296
2021-05-25 06:04:13	miss-islington	set	pull_requests: + pull_request24937
2021-05-25 04:11:34	rhettinger	set	pull_requests: + pull_request24936
2021-05-25 01:12:41	rhettinger	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2021-05-25 01:11:20	rhettinger	set	messages: + msg394281
2021-05-25 00:31:07	miss-islington	set	nosy: + miss-islington pull_requests: + pull_request24930
2021-05-21 21:23:07	rhettinger	set	messages: + msg394159
2021-05-20 00:10:41	rhettinger	set	messages: + msg393990
2021-05-19 14:53:54	rhettinger	set	messages: + msg393955
2021-05-19 02:25:06	zkneupper	set	messages: + msg393917
2021-05-19 00:39:23	rhettinger	set	messages: + msg393913
2021-05-18 02:55:55	rhettinger	set	messages: + msg393852
2021-05-18 00:09:08	zkneupper	set	messages: + msg393850
2021-05-17 23:33:06	steven.daprano	set	messages: + msg393844
2021-05-17 20:33:43	zkneupper	set	keywords: + patch nosy: + zkneupper pull_requests: + pull_request24816 stage: patch review
2021-05-17 03:52:59	tebeka	set	nosy: + tebeka messages: + msg393780
2021-05-17 02:08:33	matthewharrison	set	messages: + msg393779
2021-05-17 02:06:30	matthewharrison	set	nosy: + matthewharrison messages: + msg393778
2021-05-16 23:53:01	steven.daprano	set	messages: + msg393772
2021-05-16 23:49:56	steven.daprano	set	messages: + msg393771
2021-05-16 20:49:04	rhettinger	set	files: + Screen Shot 2021-05-16 at 1.46.45 PM.png messages: + msg393758
2021-05-16 19:48:29	rhettinger	create