Stepwise Regression in Python

后端 未结 7 851
青春惊慌失措
青春惊慌失措 2020-12-24 14:26

How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard

相关标签:
7条回答
  • 2020-12-24 14:34

    You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.

    However, this answer describes why you should not use stepwise selection for econometric models in the first place.

    0 讨论(0)
  • 2020-12-24 14:37

    Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.

    0 讨论(0)
  • 2020-12-24 14:47

    Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.

    0 讨论(0)
  • 2020-12-24 14:51

    Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:

    • lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the number of data points, and Y, where Y is the response in the training data

    • curr_preds- a list with ['const']

    • potential_preds- a list of all potential predictors. There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors

    • tol, optional. The max pvalue, .05 if not specified

    def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
      while (len(potential_preds) > 0):
        index_best = -1 # this will record the index of the best predictor
        curr = -1 # this will record current index
        best_r_squared = lm.rsquared_adj # record the r squared of the current model
        # loop to determine if any of the predictors can better the r-squared  
        for pred in potential_preds:
          curr += 1 # increment current
          preds = curr_preds.copy() # grab the current predictors
          preds.append(pred)
          lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
          new_r_sq = lm_new.rsquared_adj # record r squared for new model
          if new_r_sq > best_r_squared:
            best_r_squared = new_r_sq
            index_best = curr
    
        if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
          curr_preds.append(potential_preds.pop(index_best))
        else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
          break
    
        # fit a new lm using the new predictors, look at the p-values
        pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
        pval_too_big = []
        # make a list of all the p-values that are greater than the tolerance 
        for feat in pvals.index:
          if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
            pval_too_big.append(feat)
    
        # now remove all the features from curr_preds that have a p-value that is too large
        for feat in pval_too_big:
          pop_index = curr_preds.index(feat)
          curr_preds.pop(pop_index)
    
    0 讨论(0)
  • 2020-12-24 14:53
    """Importing the api class from statsmodels"""
    import statsmodels.formula.api as sm
    
    """X_opt variable has all the columns of independent variables of matrix X 
    in this case we have 5 independent variables"""
    X_opt = X[:,[0,1,2,3,4]]
    
    """Running the OLS method on X_opt and storing results in regressor_OLS"""
    regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
    regressor_OLS.summary()
    

    Using the summary method, you can check in your kernel the p values of your variables written as 'P>|t|'. Then check for the variable with the highest p value. Suppose x3 has the highest value e.g 0.956. Then remove this column from your array and repeat all the steps.

    X_opt = X[:,[0,1,3,4]]
    regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
    regressor_OLS.summary()
    

    Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.

    0 讨论(0)
  • 2020-12-24 14:54

    You may try mlxtend which got various selection methods.

    from mlxtend.feature_selection import SequentialFeatureSelector as sfs
    
    clf = LinearRegression()
    
    # Build step forward feature selection
    sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
    
    # Perform SFFS
    sfs1 = sfs1.fit(X_train, y_train)
    
    0 讨论(0)
提交回复
热议问题