I\'m sure this is simple, but as a complete newbie to python, I\'m having trouble figuring out how to iterate over variables in a pandas
dataframe and run a regress
You can index dataframe columns by the position using ix
.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate()
returns.keys():
and use the number to index the dataframe.
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column]
type is Series
, which can simply be converted into numpy
ndarray
s:
for i, column in enumerate(df):
print i, np.asarray(df[column])
You can use iteritems()
:
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
I'm a bit late but here's how I did this. The steps:
This is the code I used on DataFrame called aft_tmt
. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
Using list comprehension, you can get all the columns names (header):
[column for column in df]
for column in df:
print(df[column])