Why `sklearn` and `statsmodels` implementation of OLS regression give different R^2?

问题

Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

prints:

0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0

Where the difference comes from?

The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel (with intercept) was fit for X prepared as for statsmodels.api.OLS.

The question differs from Statsmodels: Calculate fitted values and R squared as it addresses difference between two Python packages (statsmodels and scikit-learn) while linked question is about statsmodels and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?

回答1:

As pointed by @user333700 in comments, OLS definition of R^2 is different in statsmodels' implementation than in scikit-learn's.

From documentation of RegressionResults class (emphasis mine):

rsquared

R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.

From documentation of LinearRegression.score():

score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual

sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

来源：https://stackoverflow.com/questions/48832925/why-sklearn-and-statsmodels-implementation-of-ols-regression-give-different

标签

python

python-3.x

scikit-learn

linear-regression

statsmodels