问题
My dataset ("prob") is an unbalanced panel, looking like:
index x1 x2 x3 y (dummy 0/1)
(100, Timestamp('2016-01-26 09:10:00')) 19.9 13.44 -0.006 0
(100, Timestamp('2016-01-26 09:15:00')) 17.2 13.25 -0.046 0
(200, Timestamp('2016-01-26 09:20:00')) 19.4 19.06 0.04 1
I would like to estimate a panel probit model in Python (y is my left-hand side variable, x1, x2, x3 are the right-hand side variables). Panel entity shoud be the IDs, which are in the first part of dataframe index (100, 200, etc.) As far as I understand, I need a Python function similar to Stata's "xtprobit".
The only way I came up with is:
mod = PanelOLS(prob.dummy, prob[['x1', 'x2', 'x3']], entity_effects=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res)
Is this a panel probit model?
The output looks different than that of the probit model (received via "sm.Probit" function from statsmodels), and I do not know how I can estimate probit marginal effects. Or, I should somehow modify "sm.Probit" to make it a panel probit? (now I know only how to use it in "time-series" manner for one entity only).
回答1:
Some background:
The behavior of models for panel data depends on whether we have a large number of observations within entities or groups n_i (long panels) or we have a large number of g_groups with a small number of observations within groups (wide panels).
statsmodels uses mostly the term groups
to refer to the entities.
The asymptotic behavior of the models depends on whether all n_i become large, or whether n_i stays small and the number of groups becomes large. Additionally the implementation of different estimators is targeted to either one of the two cases.
In the case of long panels we can use the standard estimators and using a fixed effect for each group can be consistently estimated.
So in this case we can just use dummy variables for the group or entity effect, for example creating the entity effects automatically with patsy using the formula interface, where data
is a pandas DataFrame or dict like object with variable names as keys.
mod = probit('y ~ x1 + x2 + x3 + C(group_id)', data)
Patsy creates fixed effects dummies for C(group_id)
. If a constant is included, which it is by default, then one reference level will be dropped to avoid the "dummy variable trap".
A similar distinction between long and wide panels applies to standard errors that are robust to within group correlation.
cov_type='cluster'
assumes we have the wide panel case, i.e. a large number of entities and only a few observations per entity. The computation assumes that the number of entities or clusters is larger than the number of observation in clusters, IIRC.
For long panels with serial correlation within entities we can use HAC cov_type within entities. For this case statsmodels has cov_types "hac-panel" and "hac-groupsum" available.
statsmodels still doesn't have a central location to document sandwich cov_types but it is the same in models that support it. The available cov_types and required additional information is provided here:
http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html
For wide panels, the main model that is available in statsmodels is GEE. Recently a Bayesian MixedGLM has been added. There are no frequentist MixedGLM models available yet, the only one available is the linear Gaussian MixedLM.
来源:https://stackoverflow.com/questions/50784406/panel-probit-in-python