How to compute Studentized Residuals in Python?

前端 未结 4 492
青春惊慌失措
青春惊慌失措 2020-12-11 17:42

I\'ve tried searching for an answer to this problem but so far I haven\'t found any. I used statsmodel to implement an Ordinary Least Squares regression model on a mean-impu

相关标签:
4条回答
  • 2020-12-11 18:11

    I was dealing with the same issue. The solution is to use the statsmodels library:

    from statsmodels.stats.outliers_influence import OLSInfluence
    

    It has a resid_studentized_internal method included.

    0 讨论(0)
  • 2020-12-11 18:13

    Use the OLSRresults.outlier_test() function to produce a dataset that contains the studentized residual for each observation.

    For example:

    #import necessary packages and functions
    import numpy as np
    import pandas as pd
    import statsmodels.api as sm
    from statsmodels.formula.api import ols
    
    #create dataset
    df = pd.DataFrame({'rating': [90, 85, 82, 88, 94, 90, 76, 75, 87, 86],
                       'points': [25, 20, 14, 16, 27, 20, 12, 15, 14, 19]})
    
    #fit simple linear regression model
    model = ols('rating ~ points', data=df).fit()
    
    #calculate studentized residuals
    stud_res = model.outlier_test()
    
    #display studentized residuals
    print(stud_res)
    
    student_resid    unadj_p     bonf(p)
    0   -0.486471   0.641494    1.000000
    1   -0.491937   0.637814    1.000000
    2    0.172006   0.868300    1.000000
    3    1.287711   0.238781    1.000000
    4    0.106923   0.917850    1.000000
    5    0.748842   0.478355    1.000000
    6   -0.968124   0.365234    1.000000
    7   -2.409911   0.046780    0.467801
    8    1.688046   0.135258    1.000000
    9   -0.014163   0.989095    1.000000

    This tutorial provides a full explanation: https://www.statology.org/studentized-residuals-in-python/

    0 讨论(0)
  • 2020-12-11 18:15

    For a simple linear regression, you can calculate studentized residuals using following

    define mean of X and Y as :

    mean_X = sum(X) / len(X) 
    mean_Y = sum(Y) / len(Y) 
    

    Now you have to estimate coefficients beta_0 and beta_1

    beta1 = sum([(X[i] - mean_X)*(Y[i] - mean_Y) for i in range(len(X))]) / sum([(X[i] - mean_X)**2 for i in range(len(X))]) 
    beta0 = mean_Y - beta1 * mean_X
    

    Now you need to find fitted values, by using this

    y_hat = [beta0 + beta1*X[i] for i in range(len(X))]
    

    Now compute Residuals, which is Y - Y_hat

    residuals = [Y[i] - y_hat[i] for i in range(len(Y))]
    

    We need to find H matrix which is where X is the matrix of our independent variables.

    To find leverage, we have to take the diagonal elements of H matrix, in the following way:

    leverage = numpy.diagonal(H)
    

    Find Standard Error if regression as

    Var_e = sum([(Y[i] - y_hat[i])**2 for i in range(len(Y)) ]) / (len(Y) -2)
    SE_regression = math.sqrt(Var_e*[(1-leverage[i]) for i in range len(leverage)])
    

    Now you can compute Studentized Residuals

    studentized_residuals = [residuals[i]/SE_regression for i in range(len(residuals))] 
    

    Note that we have two types of studentized residuals. One is Internally Studentized Residuals and second is Externally Studentized Residuals

    My solution finds Internally Studentized Residuals.

    I made corrections in my calculation. For externally studentized residuals, refer @kkawabat's answer

    0 讨论(0)
  • 2020-12-11 18:31

    Nodar's implementation is incorrect here is the corrected formula from https://newonlinecourses.science.psu.edu/stat501/node/339/ as well as the deleted studentized residual in case people don't want to use statsmodels package. Both formulas return the same result as the examples in the link above

    def internally_studentized_residual(X,Y):
        X = np.array(X, dtype=float)
        Y = np.array(Y, dtype=float)
        mean_X = np.mean(X)
        mean_Y = np.mean(Y)
        n = len(X)
        diff_mean_sqr = np.dot((X - mean_X), (X - mean_X))
        beta1 = np.dot((X - mean_X), (Y - mean_Y)) / diff_mean_sqr
        beta0 = mean_Y - beta1 * mean_X
        y_hat = beta0 + beta1 * X
        residuals = Y - y_hat
        h_ii = (X - mean_X) ** 2 / diff_mean_sqr + (1 / n)
        Var_e = math.sqrt(sum((Y - y_hat) ** 2)/(n-2))
        SE_regression = Var_e*((1-h_ii) ** 0.5)
        studentized_residuals = residuals/SE_regression
        return studentized_residuals
    
    def deleted_studentized_residual(X,Y):
        #formula from https://newonlinecourses.science.psu.edu/stat501/node/401/
        r = internally_studentized_residual(X,Y)
        n = len(r)
        return [r_i*math.sqrt((n-2-1)/(n-2-r_i**2)) for r_i in r]
    
    0 讨论(0)
提交回复
热议问题