问题
Accidentally I have noticed, that OLS models implemented by sklearn
and statsmodels
yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:
import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm
np.random.seed(42)
N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)
sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)
print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)
print(sklearn.__version__, statsmodels.__version__)
prints:
0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0
Where the difference comes from?
The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel
(with intercept) was fit for X prepared as for statsmodels.api.OLS
.
The question differs from
Statsmodels: Calculate fitted values and R squared
as it addresses difference between two Python packages (statsmodels
and scikit-learn
) while linked question is about statsmodels
and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?
回答1:
As pointed by @user333700 in comments, OLS definition of R^2 is different in statsmodels
' implementation than in scikit-learn
's.
From documentation of RegressionResults class (emphasis mine):
rsquared
R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.
From documentation of LinearRegression.score():
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
来源:https://stackoverflow.com/questions/48832925/why-sklearn-and-statsmodels-implementation-of-ols-regression-give-different