msg393754 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-16 19:48 |
The current signature is:
linear_regression(regressor, dependent_variable)
While the term "regressor" is used in some problem domains, it isn't well known outside of those domains. The term "independent_variable" would be better because it is common to all domains and because it is the natural counterpart to "dependent_variable".
Another issue is that the return value is a named tuple in the form:
LinearRegression(intercept, slope)
While that order is seen in multiple linear regression, most people first learn it in algebra as the slope/intercept form: y = mx + b. That will be the natural order for a majority of users, especially given that we aren't supporting multiple linear regression.
The named tuple is called LinearRegression which describes how the result was obtained rather than the result itself. The output of any process that fits data to a line is a line. The named tuple should be called Line because that is what it describes. Also, a Line class would be reusuable for other purposes that linear regression.
Proposed signature:
linear_regression(independent_variable, dependent_variable) -> Line(slope, intercept)
|
msg393758 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-16 20:49 |
Related links:
* https://support.microsoft.com/en-us/office/linest-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d
* https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-trend-lines/a/linear-regression-review
* TI-84: LinReg(ax + b) See attached screen shot
|
msg393771 - (view) |
Author: Steven D'Aprano (steven.daprano) * |
Date: 2021-05-16 23:49 |
I agree with you that "regressor" is too obscure and should be changed.
I disagree about the "y = mx + c". Haven't we already discussed this? That form is used in linear algebra, but not used in statistics. Quoting from Yale:
"A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0)."
http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
This function is being used for statistics, not linear algebra. The users of the module are not people doing linear algebra, and most users of statistics will be familiar with the Y = a + bX form (or possibly reversed order bX + a).
The TI-84 offers two linear regression functions, ax+b and a+bx. So does the Casio Classpad. The Nspire calls them a+bx and mx+b.
https://www.statology.org/linear-regression-ti-84-calculator/
I've seen:
a + bx
ax + b
bx + a
mx + c
mx + b
among others. I don't think that there is any justification for claiming that a majority of users will be most familiar with mx+b.
|
msg393772 - (view) |
Author: Steven D'Aprano (steven.daprano) * |
Date: 2021-05-16 23:53 |
> The named tuple should be called Line because that is what it describes. Also, a Line class would be reusuable for other purposes that linear regression.
I think that most people would expect that a Line class would represent a straight line widget in a GUI, not the coefficients of a linear equation.
The result tuple represents an equation (or at least the coefficients of such), not the line itself. If it doesn't have a .draw() method, I don't think we should call it "Line".
|
msg393778 - (view) |
Author: Matt Harrison (matthewharrison) |
Date: 2021-05-17 02:06 |
The ML world has collapsed on the terms X and y. (With that capitalization). Moreover, most (Python libraries) follow the interface of scikit-learn [0].
Training a model looks like this:
model = LinearRegression()
model.fit(X, y)
After that, the model instance has attribute that end in "_" that were learned from fitting. For linear regression[1] you get:
model.coef_ # slope
model.intercept_ # intercept
To make predictions you call .predict:
y_hat = model.predict(X)
One bonus of leveraging the .fit/.predict interface (which other libraries such as XGBoost have also adopted) is that if your model is in the correct layout, you can trivially try different models.
0 - https://scikit-learn.org/stable/tutorial/basic/tutorial.html#learning-and-predicting
1 - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
|
msg393779 - (view) |
Author: Matt Harrison (matthewharrison) |
Date: 2021-05-17 02:08 |
And by "if your model is in the correct layout", I meant "if your data is in the correct layout"
|
msg393780 - (view) |
Author: Miki Tebeka (tebeka) * |
Date: 2021-05-17 03:52 |
I'm +1 on the changes proposed by Raymond.
In my teaching experience most developers who will use the built-in statistics package will have highschool level math experience.
On the other hand, they'll probably to Wikipedia and the entry there uses dependent variable and regressor (https://en.wikipedia.org/wiki/Linear_regression#Introduction)
On the third hand :) scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) uses slope and intercept.
I thin "Line" is a good name for the result.
|
msg393844 - (view) |
Author: Steven D'Aprano (steven.daprano) * |
Date: 2021-05-17 23:33 |
> The ML world has collapsed on the terms X and y. (With that
> capitalization).
I just googled for "ML linear regression" and there is no consistency in
either the variable used or the parameters. But most seem to use
lowercase x,y. Out of the top 6 links I checked, only one seems to use
X,y and even there the y has a hat (circumflex) on it: X,ŷ.
More importantly, the ML community has no consistency about the
parameters either. I see:
y = B0 + B1*x
ŷ = X W + b
y = a_0 + a_1 * x
y = m x + b
y = θ1 + θ2 x
y = b0 + b1 x
I'm going to give the URLs since the search page results are not
reproducable from person to person. See below.
The bottom line here is that I don't think the ML community is going to
give us much guidence here. And it's probably not our target user base
either, which is more aimed at high school and undergraduate users of
basic level statistics, not ML algorithms.
https://www.geeksforgeeks.org/ml-linear-regression/
https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html
https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a
https://machinelearningmastery.com/linear-regression-for-machine-learning/
https://www.analytixlabs.co.in/blog/linear-regression-machine-learning/
https://madewithml.com/courses/basics/linear-regression/
|
msg393850 - (view) |
Author: Zachary Kneupper (zkneupper) * |
Date: 2021-05-18 00:09 |
> The ML world has collapsed on the terms X and y. (With that
> capitalization).
The ML community will probably use 3rd party packages for their linear regressions in any case.
In my estimation, the ML community would be comfortable with any of these pairs of terms:
Fine:
+ regressor, dependent_variable
+ independent_variable, dependent_variable
+ x, y
Bad:
+ X, y <- this wouldn't makes sense here since the first argument is always a vector and is never a matrix.
Often, capital letters indicate matrices, and lower case letters indicate vectors (or scalars). The reason that X is often capitalized is because it indicates that X is an m-by-n matrix of several independent variables; whereas y is lowercase because it is a single vector for the dependent variable. Since this linear_regression(regressor, dependent_variable) function takes a vector for the independent variable (as opposed to allowing a matrix of multiple regressors), it's probably not appropriate to use `X` (capitalized).
|
msg393852 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-18 02:55 |
Looking over the comments so far, it looks like (x, y) would be best and (independent variable, dependent variable) would be second best. The (x, y) also has the advantage of matching correlation() and covariance().
For output order, it seems that algebraic formulas sometimes have the intercept first and sometimes have it last. That said, whenever the words "slope" and "intercept" are used in text, it seems that slope almost always comes first (as in slope/intercept form of a line).
FWIW, MS Excel's LINEST function returns is {mn,mn-1,...,m1,b} and the documentation described in terms of y = mx + b.
|
msg393913 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-19 00:39 |
Any objections to linear_regression(x, y) -> (slope, intercept)?
|
msg393917 - (view) |
Author: Zachary Kneupper (zkneupper) * |
Date: 2021-05-19 02:25 |
> Any objections to linear_regression(x, y) -> (slope, intercept)?
I think `linear_regression(x, y)` would be intuitive for a wide range of users.
Just to clarify, is the proposal to return a regular tuple instead of named tuple?
Would we do this:
return (slope, intercept)
and not do this:
return LinearRegression(intercept=intercept, slope=slope)
and not do this:
return Line(intercept=intercept, slope=slope)
?
|
msg393955 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-19 14:53 |
> Just to clarify, is the proposal to return a
> regular tuple instead of named tuple?
No, it should still have named fields. Either Line or LinearRegression would suffice.
|
msg393990 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-20 00:10 |
Steven, do you approve of this?
linear_regression(x, y) -> LinearRegression(slope, intercept)
|
msg394159 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-21 21:23 |
Zachery, unless someone steps with an objection, I think you can go forward with the PR to implement this signature:
linear_regression(x, y, /) -> LinearRegression(slope, intercept)
|
msg394281 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-25 01:11 |
New changeset 86779878dfc0bcb74b4721aba7fd9a84e9cbd5c7 by Miss Islington (bot) in branch '3.10':
bpo-44151: linear_regression() minor API improvements (GH-26199) (GH-26338)
https://github.com/python/cpython/commit/86779878dfc0bcb74b4721aba7fd9a84e9cbd5c7
|
msg394296 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2021-05-25 06:23 |
New changeset a6825197e9f2bd730d8da38f223608411e508695 by Miss Islington (bot) in branch '3.10':
bpo-44151: Various grammar, word order, and markup fixes (GH-26344) (GH-26345)
https://github.com/python/cpython/commit/a6825197e9f2bd730d8da38f223608411e508695
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:45 | admin | set | github: 88317 |
2021-05-25 06:23:19 | rhettinger | set | messages:
+ msg394296 |
2021-05-25 06:04:13 | miss-islington | set | pull_requests:
+ pull_request24937 |
2021-05-25 04:11:34 | rhettinger | set | pull_requests:
+ pull_request24936 |
2021-05-25 01:12:41 | rhettinger | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2021-05-25 01:11:20 | rhettinger | set | messages:
+ msg394281 |
2021-05-25 00:31:07 | miss-islington | set | nosy:
+ miss-islington pull_requests:
+ pull_request24930
|
2021-05-21 21:23:07 | rhettinger | set | messages:
+ msg394159 |
2021-05-20 00:10:41 | rhettinger | set | messages:
+ msg393990 |
2021-05-19 14:53:54 | rhettinger | set | messages:
+ msg393955 |
2021-05-19 02:25:06 | zkneupper | set | messages:
+ msg393917 |
2021-05-19 00:39:23 | rhettinger | set | messages:
+ msg393913 |
2021-05-18 02:55:55 | rhettinger | set | messages:
+ msg393852 |
2021-05-18 00:09:08 | zkneupper | set | messages:
+ msg393850 |
2021-05-17 23:33:06 | steven.daprano | set | messages:
+ msg393844 |
2021-05-17 20:33:43 | zkneupper | set | keywords:
+ patch nosy:
+ zkneupper
pull_requests:
+ pull_request24816 stage: patch review |
2021-05-17 03:52:59 | tebeka | set | nosy:
+ tebeka messages:
+ msg393780
|
2021-05-17 02:08:33 | matthewharrison | set | messages:
+ msg393779 |
2021-05-17 02:06:30 | matthewharrison | set | nosy:
+ matthewharrison messages:
+ msg393778
|
2021-05-16 23:53:01 | steven.daprano | set | messages:
+ msg393772 |
2021-05-16 23:49:56 | steven.daprano | set | messages:
+ msg393771 |
2021-05-16 20:49:04 | rhettinger | set | files:
+ Screen Shot 2021-05-16 at 1.46.45 PM.png
messages:
+ msg393758 |
2021-05-16 19:48:29 | rhettinger | create | |