Creating dummy variable using pandas or statsmodel for interaction of two columns

点点圈 提交于 2019-12-04 19:40:35

You could do something like this where you have to first create a calculated field that encapsulates the Industry and years_spend:

df = pd.DataFrame({'Industry': [4, 3, 11, 4, 1, 1], 'years_spend': [4, 5, 8, 4, 4, 1]})
df['industry_years'] = df['Industry'].astype('str') + '_' + df['years_spend'].astype('str')  # this is the calculated field

Here's what the df looks like:

   Industry  years_spend industry_years
0         4            4            4_4
1         3            5            3_5
2        11            8           11_8
3         4            4            4_4
4         1            4            1_4
5         1            1            1_1

Now you can apply get_dummies:

df = pd.get_dummies(df, columns=['industry_years'])

That'll get you what you want :)

Using patsy syntax it's just:

import statsmodels.formula.api as smf

mod = smf.ols("income ~ C(Industry):C(years_spend)", data=df).fit()

The : character means "interaction"; you can also generalize this to interactions of more than two items (C(a):C(b):C(c)), interactions between numerical and categorical values, etc. You might find the patsy docs useful.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!