This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Improve parameter names and return value ordering for linear_regression
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.11, Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: matthewharrison, miss-islington, pablogsal, rhettinger, steven.daprano, tebeka, zkneupper
Priority: normal Keywords: patch

Created on 2021-05-16 19:48 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
Screen Shot 2021-05-16 at 1.46.45 PM.png rhettinger, 2021-05-16 20:49 Screen shot for TI-84
Pull Requests
URL Status Linked Edit
PR 26199 merged zkneupper, 2021-05-17 20:33
PR 26338 merged miss-islington, 2021-05-25 00:31
PR 26344 merged rhettinger, 2021-05-25 04:11
PR 26345 merged miss-islington, 2021-05-25 06:04
Messages (17)
msg393754 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-16 19:48
The current signature is:

    linear_regression(regressor, dependent_variable)

While the term "regressor" is used in some problem domains, it isn't well known outside of those domains.   The term "independent_variable" would be better because it is common to all domains and because it is the natural counterpart to "dependent_variable".

Another issue is that the return value is a named tuple in the form:

    LinearRegression(intercept, slope)

While that order is seen in multiple linear regression, most people first learn it in algebra as the slope/intercept form:  y = mx + b.   That will be the natural order for a majority of users, especially given that we aren't supporting multiple linear regression.

The named tuple is called LinearRegression which describes how the result was obtained rather than the result itself.  The output of any process that fits data to a line is a line.  The named tuple should be called Line because that is what it describes.  Also, a Line class would be reusuable for other purposes that linear regression.

Proposed signature:

  linear_regression(independent_variable, dependent_variable) -> Line(slope, intercept)
msg393758 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-16 20:49
Related links:

* https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d

* https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-trend-lines/a/linear-regression-review

* TI-84:  LinReg(ax + b)    See attached screen shot
msg393771 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-05-16 23:49
I agree with you that "regressor" is too obscure and should be changed.

I disagree about the "y = mx + c". Haven't we already discussed this? That form is used in linear algebra, but not used in statistics. Quoting from Yale:

"A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0)."

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

This function is being used for statistics, not linear algebra. The users of the module are not people doing linear algebra, and most users of statistics will be familiar with the Y = a + bX form (or possibly reversed order bX + a).

The TI-84 offers two linear regression functions, ax+b and a+bx. So does the Casio Classpad. The Nspire calls them a+bx and mx+b.

https://www.statology.org/linear-regression-ti-84-calculator/

I've seen:

a + bx
ax + b
bx + a
mx + c
mx + b

among others. I don't think that there is any justification for claiming that a majority of users will be most familiar with mx+b.
msg393772 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-05-16 23:53
> The named tuple should be called Line because that is what it describes.  Also, a Line class would be reusuable for other purposes that linear regression.

I think that most people would expect that a Line class would represent a straight line widget in a GUI, not the coefficients of a linear equation.

The result tuple represents an equation (or at least the coefficients of such), not the line itself. If it doesn't have a .draw() method, I don't think we should call it "Line".
msg393778 - (view) Author: Matt Harrison (matthewharrison) Date: 2021-05-17 02:06
The ML world has collapsed on the terms X and y. (With that capitalization). Moreover, most (Python libraries) follow the interface of scikit-learn [0].

Training a model looks like this:

    model = LinearRegression()
    model.fit(X, y)

After that, the model instance has attribute that end in "_" that were learned from fitting. For linear regression[1] you get:

    model.coef_        # slope
    model.intercept_   # intercept

To make predictions you call .predict:

    y_hat = model.predict(X)

One bonus of leveraging the .fit/.predict interface (which other libraries such as XGBoost have also adopted) is that if your model is in the correct layout, you can trivially try different models.


0 - https://scikit-learn.org/stable/tutorial/basic/tutorial.html#learning-and-predicting

1 - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
msg393779 - (view) Author: Matt Harrison (matthewharrison) Date: 2021-05-17 02:08
And by "if your model is in the correct layout", I meant "if your data is in the correct layout"
msg393780 - (view) Author: Miki Tebeka (tebeka) * Date: 2021-05-17 03:52
I'm +1 on the changes proposed by Raymond.

In my teaching experience most developers who will use the built-in statistics package will have highschool level math experience.

On the other hand, they'll probably to Wikipedia and the entry there uses dependent variable and regressor (https://en.wikipedia.org/wiki/Linear_regression#Introduction)

On the third hand :) scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) uses slope and intercept.

I thin "Line" is a good name for the result.
msg393844 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-05-17 23:33
> The ML world has collapsed on the terms X and y. (With that 
> capitalization).

I just googled for "ML linear regression" and there is no consistency in 
either the variable used or the parameters. But most seem to use 
lowercase x,y. Out of the top 6 links I checked, only one seems to use 
X,y and even there the y has a hat (circumflex) on it: X,ŷ.

More importantly, the ML community has no consistency about the 
parameters either. I see:

y = B0 + B1*x
ŷ = X W + b
y = a_0 + a_1 * x
y = m x + b
y = θ1 + θ2 x
y = b0 + b1 x

I'm going to give the URLs since the search page results are not 
reproducable from person to person. See below.

The bottom line here is that I don't think the ML community is going to 
give us much guidence here. And it's probably not our target user base 
either, which is more aimed at high school and undergraduate users of 
basic level statistics, not ML algorithms.

https://www.geeksforgeeks.org/ml-linear-regression/

https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html

https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a

https://machinelearningmastery.com/linear-regression-for-machine-learning/

https://www.analytixlabs.co.in/blog/linear-regression-machine-learning/

https://madewithml.com/courses/basics/linear-regression/
msg393850 - (view) Author: Zachary Kneupper (zkneupper) * Date: 2021-05-18 00:09
> The ML world has collapsed on the terms X and y. (With that 
> capitalization).

The ML community will probably use 3rd party packages for their linear regressions in any case.

In my estimation, the ML community would be comfortable with any of these pairs of terms:

Fine:
+ regressor, dependent_variable
+ independent_variable, dependent_variable
+ x, y

Bad:
+ X, y <- this wouldn't makes sense here since the first argument is always a vector and is never a matrix.


Often, capital letters indicate matrices, and lower case letters indicate vectors (or scalars). The reason that X is often capitalized is because it indicates that X is an m-by-n matrix of several independent variables; whereas y is lowercase because it is a single vector for the dependent variable. Since this linear_regression(regressor, dependent_variable) function takes a vector for the independent variable (as opposed to allowing a matrix of multiple regressors), it's probably not appropriate to use `X` (capitalized).
msg393852 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-18 02:55
Looking over the comments so far, it looks like (x, y) would be best and (independent variable, dependent variable) would be second best.  The (x, y) also has the advantage of matching correlation() and covariance().

For output order, it seems that algebraic formulas sometimes have the intercept first and sometimes have it last.  That said, whenever the words "slope" and "intercept" are used in text, it seems that slope almost always comes first (as in slope/intercept form of a line).

FWIW, MS Excel's LINEST function returns is {mn,mn-1,...,m1,b} and the documentation described in terms of y = mx + b.
msg393913 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-19 00:39
Any objections to linear_regression(x, y) -> (slope, intercept)?
msg393917 - (view) Author: Zachary Kneupper (zkneupper) * Date: 2021-05-19 02:25
> Any objections to linear_regression(x, y) -> (slope, intercept)?


I think `linear_regression(x, y)` would be intuitive for a wide range of users.

Just to clarify, is the proposal to return a regular tuple instead of named tuple?

Would we do this:

    return (slope, intercept)

and not do this:

    return LinearRegression(intercept=intercept, slope=slope)
 
and not do this:

    return Line(intercept=intercept, slope=slope)
 
?
msg393955 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-19 14:53
> Just to clarify, is the proposal to return a 
> regular tuple instead of named tuple?

No, it should still have named fields.  Either Line or LinearRegression would suffice.
msg393990 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-20 00:10
Steven, do you approve of this?

    linear_regression(x, y) -> LinearRegression(slope, intercept)
msg394159 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-21 21:23
Zachery, unless someone steps with an objection, I think you can go forward with the PR to implement this signature:

    linear_regression(x, y, /) -> LinearRegression(slope, intercept)
msg394281 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-25 01:11
New changeset 86779878dfc0bcb74b4721aba7fd9a84e9cbd5c7 by Miss Islington (bot) in branch '3.10':
bpo-44151: linear_regression() minor API improvements (GH-26199) (GH-26338)
https://github.com/python/cpython/commit/86779878dfc0bcb74b4721aba7fd9a84e9cbd5c7
msg394296 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-25 06:23
New changeset a6825197e9f2bd730d8da38f223608411e508695 by Miss Islington (bot) in branch '3.10':
bpo-44151: Various grammar, word order, and markup fixes (GH-26344) (GH-26345)
https://github.com/python/cpython/commit/a6825197e9f2bd730d8da38f223608411e508695
History
Date User Action Args
2022-04-11 14:59:45adminsetgithub: 88317
2021-05-25 06:23:19rhettingersetmessages: + msg394296
2021-05-25 06:04:13miss-islingtonsetpull_requests: + pull_request24937
2021-05-25 04:11:34rhettingersetpull_requests: + pull_request24936
2021-05-25 01:12:41rhettingersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-05-25 01:11:20rhettingersetmessages: + msg394281
2021-05-25 00:31:07miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request24930
2021-05-21 21:23:07rhettingersetmessages: + msg394159
2021-05-20 00:10:41rhettingersetmessages: + msg393990
2021-05-19 14:53:54rhettingersetmessages: + msg393955
2021-05-19 02:25:06zkneuppersetmessages: + msg393917
2021-05-19 00:39:23rhettingersetmessages: + msg393913
2021-05-18 02:55:55rhettingersetmessages: + msg393852
2021-05-18 00:09:08zkneuppersetmessages: + msg393850
2021-05-17 23:33:06steven.dapranosetmessages: + msg393844
2021-05-17 20:33:43zkneuppersetkeywords: + patch
nosy: + zkneupper

pull_requests: + pull_request24816
stage: patch review
2021-05-17 03:52:59tebekasetnosy: + tebeka
messages: + msg393780
2021-05-17 02:08:33matthewharrisonsetmessages: + msg393779
2021-05-17 02:06:30matthewharrisonsetnosy: + matthewharrison
messages: + msg393778
2021-05-16 23:53:01steven.dapranosetmessages: + msg393772
2021-05-16 23:49:56steven.dapranosetmessages: + msg393771
2021-05-16 20:49:04rhettingersetfiles: + Screen Shot 2021-05-16 at 1.46.45 PM.png

messages: + msg393758
2021-05-16 19:48:29rhettingercreate