Difference in Differences in Python + Pandas

馋奶兔 提交于 2019-12-03 07:32:57

It seems that what you need are not difference in differences (DD) regressions. DD regressions are relevant when you can distinguish a control group and a treatment group. A standard simplified example would be the evaluation of a medecine. You split a population of sick people in two groups. Half of them are given nothing: they are the control group. The other half are given a medicine: they are the treatment group. Essentially, the DD regression will capture the fact that the real effect of the medicine is not directly measurable in terms of how many people who were given the medicine got healthy. Intuitively, you want to know if these people did better than the ones who were not given any medicine. This result could be refined by adding yet another category: a placebo one i.e. people who are given something which looks like a medicine but actually isn't... but again this would be a well defined group. Last but not least, for a DD regression to be really appropriate, you need to make sure groups are not heterogeneous in a way that could bias results. A bad situation for your medicine test would be if the treatment group includes only people who are young and super fit (hence more likely to heal in general), while the control group is a bunch of old alcoholics...

In your case, if I'm not mistaken, everybody gets "treated" to some extent... so you are closer to a standard regression framework where the impact of X on Y (e.g. IQ on wage) is to be measured. I understand that you want to measure the impact of the number of permits on the score (or is it the other way? -_-), and you have classical endogeneity to deal with i.e. if Peter is more skilled than Paul, he'll typically obtain more permits AND a higher score. So what you actually want to use is the fact that with the same level of skill over time, Peter (respectively Paul) will be "given" different levels of permits over years... and there you'll really measure the influence of permits on score...

I might not be guessing well, but I want to insist on the fact that there are many ways to obtain biased, hence meaningless results, if you don't put enough efforts to understand/explain what's going on in the data. Regarding technical details, your estimation only have year fixed effects (likely not estimated but taken into account through demeaning, hence not returned in the output), so what you want to do is to add entity_effects = True. If you want to go further... I'm afraid panel data regressions are not well covered in any Python package so far, (including statsmodels which if the reference for econometrics) so if you're not willing to invest... I would rather suggest using R or Stata. Meanwhile, if a Fixed Effect regression is all you need, you can also get it with statsmodels (which also allows to cluster standard errors if needed...):

import statsmodels.formula.api as smf
df = s.reset_index(drop = False)
reg = smf.ols('y ~ x + C(date) + C(id)',
              data = df).fit()
print(reg.summary())
# clustering standard errors at individual level
reg_cl = smf.ols(formula='y ~ x + C(date) + C(id)',
                 data=df).fit(cov_type='cluster',
                              cov_kwds={'groups': df['id']})
print(reg_cl.summary())
# output only coeff and standard error of x
print(u'{:.3f} ({:.3f})'.format(reg.params.ix['x'], reg.bse.ix['x']))
print(u'{:.3f} ({:.3f})'.format(reg_cl.params.ix['x'], reg_cl.bse.ix['x']))

Regarding econometrics, you'll likely get more/better answers on Cross Validated than here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!