Python scikit learn Linear Model Parameter Standard Error

前端 未结 2 575
太阳男子
太阳男子 2021-02-18 18:51

I am working with sklearn and specifically the linear_model module. After fitting a simple linear as in

import pandas as pd
import numpy as np
from sklearn impo         


        
2条回答
  •  野的像风
    2021-02-18 19:28

    tl;dr

    not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below.

    also here's a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31

    what and why

    the standard errors of your estimates are just the square root of the variances of your estimates. what's the variance of your estimate? if you assume your model has gaussian error, it's:

    Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat

    and then the standard error of beta_hat[i] is Var(beta_hat)[i, i] ** 0.5.

    All you have to compute sigma_squared_hat. This is the estimate of your model's gaussian error. This is not known a priori but can be estimated with the sample variance of your residuals.

    Also you need to add an intercept term to your data matrix. Scikit-learn does this automatically with the LinearRegression class. So to compute this yourself you need to add that to your X matrix or dataframe.

    how

    Starting after your code,

    show your scikit-learn results

    print(model.intercept_)
    print(model.coef_)
    
    [-0.28671532]
    [[ 0.17501115 -0.6928708   0.22336584]]
    

    reproduce this with linear algebra

    N = len(X)
    p = len(X.columns) + 1  # plus one because LinearRegression adds an intercept term
    
    X_with_intercept = np.empty(shape=(N, p), dtype=np.float)
    X_with_intercept[:, 0] = 1
    X_with_intercept[:, 1:p] = X.values
    
    beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y.values
    print(beta_hat)
    
    [[-0.28671532]
     [ 0.17501115]
     [-0.6928708 ]
     [ 0.22336584]]
    

    compute standard errors of the parameter estimates

    y_hat = model.predict(X)
    residuals = y.values - y_hat
    residual_sum_of_squares = residuals.T @ residuals
    sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p)
    var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat
    for p_ in range(p):
        standard_error = var_beta_hat[p_, p_] ** 0.5
        print(f"SE(beta_hat[{p_}]): {standard_error}")
    
    SE(beta_hat[0]): 0.2468580488280805
    SE(beta_hat[1]): 0.2965501221823944
    SE(beta_hat[2]): 0.3518847753610169
    SE(beta_hat[3]): 0.3250760291745124
    

    confirm with statsmodels

    import statsmodels.api as sm
    ols = sm.OLS(y.values, X_with_intercept)
    ols_result = ols.fit()
    ols_result.summary()
    
    ...
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const         -0.2867      0.247     -1.161      0.290      -0.891       0.317
    x1             0.1750      0.297      0.590      0.577      -0.551       0.901
    x2            -0.6929      0.352     -1.969      0.096      -1.554       0.168
    x3             0.2234      0.325      0.687      0.518      -0.572       1.019
    ==============================================================================
    

    yay, done!

提交回复
热议问题