How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard
You can make forward-backward selection based on statsmodels.api.OLS
model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors. There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your variables written as 'P>|t|'. Then check for the variable with the highest p value. Suppose x3 has the highest value e.g 0.956. Then remove this column from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)