How to iterate over columns of pandas dataframe to run regression

后端 未结 8 1827
Happy的楠姐
Happy的楠姐 2021-01-29 18:07

I\'m sure this is simple, but as a complete newbie to python, I\'m having trouble figuring out how to iterate over variables in a pandas dataframe and run a regress

8条回答
  •  野的像风
    2021-01-29 18:24

    I'm a bit late but here's how I did this. The steps:

    1. Create a list of all columns
    2. Use itertools to take x combinations
    3. Append each result R squared value to a result dataframe along with excluded column list
    4. Sort the result DF in descending order of R squared to see which is the best fit.

    This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

    import pandas as pd
    # setting options to print without truncating output
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)
    
    import statsmodels.formula.api as smf
    import itertools
    
    # This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
    itercols = aft_tmt.columns.tolist()
    itercols.remove("sc97")
    itercols.remove("sc")
    itercols.remove("grc")
    itercols.remove("grc97")
    print itercols
    len(itercols)
    
    # results DF
    regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
    
    # excluded cols
    exc = []
    
    # change 9 to the number of columns you want to combine from N columns.
    #Possibly run an outer loop from 0 to N/2?
    for x in itertools.combinations(itercols, 9):
        lmstr = "+".join(x)
        m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
        f = m.fit()
        exc = [item for item in x if item not in itercols]
        regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
    
    regression_res.sort_values(by="Rsq", ascending = False)
    

提交回复
热议问题