How to iterate over columns of pandas dataframe to run regression

后端 未结 8 1802
Happy的楠姐
Happy的楠姐 2021-01-29 18:07

I\'m sure this is simple, but as a complete newbie to python, I\'m having trouble figuring out how to iterate over variables in a pandas dataframe and run a regress

相关标签:
8条回答
  • 2021-01-29 18:08

    You can index dataframe columns by the position using ix.

    df1.ix[:,1]
    

    This returns the first column for example. (0 would be the index)

    df1.ix[0,]
    

    This returns the first row.

    df1.ix[:,1]
    

    This would be the value at the intersection of row 0 and column 1:

    df1.ix[0,1]
    

    and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

    0 讨论(0)
  • 2021-01-29 18:08

    Based on the accepted answer, if an index corresponding to each column is also desired:

    for i, column in enumerate(df):
        print i, df[column]
    

    The above df[column] type is Series, which can simply be converted into numpy ndarrays:

    for i, column in enumerate(df):
        print i, np.asarray(df[column])
    
    0 讨论(0)
  • 2021-01-29 18:10

    You can use iteritems():

    for name, values in df.iteritems():
        print('{name}: {value}'.format(name=name, value=values[0]))
    
    0 讨论(0)
  • 2021-01-29 18:24

    I'm a bit late but here's how I did this. The steps:

    1. Create a list of all columns
    2. Use itertools to take x combinations
    3. Append each result R squared value to a result dataframe along with excluded column list
    4. Sort the result DF in descending order of R squared to see which is the best fit.

    This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

    import pandas as pd
    # setting options to print without truncating output
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)
    
    import statsmodels.formula.api as smf
    import itertools
    
    # This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
    itercols = aft_tmt.columns.tolist()
    itercols.remove("sc97")
    itercols.remove("sc")
    itercols.remove("grc")
    itercols.remove("grc97")
    print itercols
    len(itercols)
    
    # results DF
    regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
    
    # excluded cols
    exc = []
    
    # change 9 to the number of columns you want to combine from N columns.
    #Possibly run an outer loop from 0 to N/2?
    for x in itertools.combinations(itercols, 9):
        lmstr = "+".join(x)
        m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
        f = m.fit()
        exc = [item for item in x if item not in itercols]
        regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
    
    regression_res.sort_values(by="Rsq", ascending = False)
    
    0 讨论(0)
  • 2021-01-29 18:25

    Using list comprehension, you can get all the columns names (header):

    [column for column in df]

    0 讨论(0)
  • 2021-01-29 18:29
    for column in df:
        print(df[column])
    
    0 讨论(0)
提交回复
热议问题