问题
I have a data frame like this:
Index ID Industry years_spend asset
6646 892 4 4 144.977037
2347 315 10 8 137.749138
7342 985 1 5 104.310217
137 18 5 5 156.593396
2840 381 11 2 229.538828
6579 883 11 1 171.380125
1776 235 4 7 217.734377
2691 361 1 2 148.865341
815 110 15 4 233.309491
2932 393 17 5 187.281724
I want to create dummy variables for Industry X years_spend which creates len(df.Industry.value_counts()) * len(df.years_spend.value_counts())
varaible, for example d_11_4 = 1 for all rows that has industry==1 and years spend=4 otherwise d_11_4 = 0. Then I can use these vars for some regression works.
I know I can make groups like what I want using df.groupby(['Industry','years_spend']) and I know I can create such variable for one column using patsy
syntax in statsmodels
:
import statsmodels.formula.api as smf
mod = smf.ols("income ~ C(Industry)", data=df).fit()
but If I want to do with 2 columns I get an error that:
IndexError: tuple index out of range
How can I do that with pandas or using some function inside statsmodels?
回答1:
You could do something like this where you have to first create a calculated field that encapsulates the Industry
and years_spend
:
df = pd.DataFrame({'Industry': [4, 3, 11, 4, 1, 1], 'years_spend': [4, 5, 8, 4, 4, 1]})
df['industry_years'] = df['Industry'].astype('str') + '_' + df['years_spend'].astype('str') # this is the calculated field
Here's what the df
looks like:
Industry years_spend industry_years
0 4 4 4_4
1 3 5 3_5
2 11 8 11_8
3 4 4 4_4
4 1 4 1_4
5 1 1 1_1
Now you can apply get_dummies:
df = pd.get_dummies(df, columns=['industry_years'])
That'll get you what you want :)
回答2:
Using patsy syntax it's just:
import statsmodels.formula.api as smf
mod = smf.ols("income ~ C(Industry):C(years_spend)", data=df).fit()
The :
character means "interaction"; you can also generalize this to interactions of more than two items (C(a):C(b):C(c)
), interactions between numerical and categorical values, etc. You might find the patsy docs useful.
来源:https://stackoverflow.com/questions/45056401/creating-dummy-variable-using-pandas-or-statsmodel-for-interaction-of-two-column