问题
I'm trying to estimate child work based on a lagged variable on children's school aspirations.
I'm deciding whether I should use glm or clogit to run my models (need fixed effect logits). When I run my glm, my coefficients are very different from my clogit.
model1 <- glm(chldwork~lag_aspgrade_binned+age+as.factor(childid), data=finaletdtlag, family='binomial')
GLM Output:
Call:
glm(formula = chldwork ~ lag_aspgrade_binned + age + as.factor(childid),
family = "binomial", data = finaletdtlag)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.02350 0.00001 0.00002 0.17344 2.13769
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.037e+01 1.933e+04 0.002 0.9987
lag_aspgrade_binneddid not complete elementary 2.339e+00 1.083e+00 2.161 0.0307 *
lag_aspgrade_binnedhs 1.252e+00 6.082e-01 2.059 0.0395 *
lag_aspgrade_binnedprimary some hs 1.206e+00 6.739e-01 1.789 0.0735 .
lag_aspgrade_binnedsome college 2.081e+00 4.800e-01 4.335 1.46e-05 ***
age -6.123e-01 3.995e-02 -15.326 < 2e-16 ***
Also, when I ran my clogit, I didn't get an intercept in my output (like this example shows: https://data.princeton.edu/wws509/r/fixedRandom3).
My clogit output:
> modela <- clogit(chldwork~lag_aspgrade_binned+age+strata(childid), data=finaletdtlag, method = 'exact')
> summary(modela)
Call:
coxph(formula = Surv(rep(1, 2770L), chldwork) ~ lag_aspgrade_binned +
age + strata(childid), data = finaletdtlag, method = "exact")
n= 2770, number of events= 2358
coef exp(coef) se(coef) z Pr(>|z|)
lag_aspgrade_binneddid not complete elementary 1.09351 2.98473 0.83332 1.312 0.18944
lag_aspgrade_binnedhs 0.53032 1.69948 0.45095 1.176 0.23959
lag_aspgrade_binnedprimary some hs 0.49815 1.64567 0.50075 0.995 0.31983
lag_aspgrade_binnedsome college 1.00269 2.72560 0.34619 2.896 0.00377 **
age -0.36846 0.69180 0.02905 -12.684 < 2e-16 ***
Do I have an error in my code? Do y'all prefer one to the other?
回答1:
Calculating a nonlinear (binary) fixed effects (FE) model with least squares dummy variable (LSDV) approach, as you do with glm
, does not yet take into account the incidental parameters problem (IPP), and therefore the estimator is biased. Basically the IPP occurs when the number of observations, and thus number of individual dummies, is large in relation to the observed time periods. You can find a statistical explanation of the IPP in this answer on Cross Validated.
survival::clogit
as well as bife::bife
combined with bife::bias_corr
take the IPP into account.
Let's attempt to calculate a binary FE model using glm
(see "data" at the bottom of the answer).
s <- summary(g.fit <- glm(y ~ 0 + x1 + x2 + factor(id), data=idc, family='binomial'))$coe
s[-grep("factor", rownames(s)), ]
# Estimate Std. Error z value Pr(>|z|)
# x1 0.80750181 0.32069729 2.517956 0.01180379
# x2 0.08353417 0.05040558 1.657240 0.09747086
Calculating the model with bife::bife
initially gives the same result.
summary(b.fit <- bife::bife(y ~ x1 + x2 | id, data=idc))$cm
# Estimate Std. error z value Pr(> |z|)
# x1 0.80750173 0.32070309 2.517911 0.01180533
# x2 0.08353417 0.05040645 1.657212 0.09747662
However, the results are biased upwards and we get smaller estimates when we correct for the IPP:
summary(b.fit.corr <- bias_corr(b.fit))$cm
# Estimate Std. error z value Pr(> |z|)
# x1 0.65672233 0.3192673 2.056967 0.03968939
# x2 0.06672389 0.0498996 1.337163 0.18116946
clogit
now already takes the IPP into account.
summary(survival::clogit(y ~ x1 + x2 + strata(id), data=idc))$coe
# coef exp(coef) se(coef) z Pr(>|z|)
# x1 0.6533630 1.921994 0.2875215 2.272397 0.02306255
# x2 0.0659169 1.068138 0.0449555 1.466270 0.14257474
Note, that bife::bife
calculates the standard errors more conservatively. However, using Stata we get the same result as with survival::clogit
:
. use http://www.stata-press.com/data/r13/clogitid, clear
. clogit y x1 x2, group(id) noheader
note: multiple positive outcomes within groups encountered.
Iteration 0: log likelihood = -123.42828
Iteration 1: log likelihood = -123.41386
Iteration 2: log likelihood = -123.41386
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | .653363 .2875215 2.27 0.023 .0898312 1.216895
x2 | .0659169 .0449555 1.47 0.143 -.0221943 .1540281
------------------------------------------------------------------------------
Hence, to answer your question, the results of your clogit
approach should be preferred.
The issue with your "missing" intercept is explained as follows. In a linear regression model we have one overall intercept a,
whereas in fixed effects we have many individual intercepts ai.
bife::bife
stores the fixed effects in an object named alpha
:
b.fit.corr$alpha
# 1014 1017 1019 1023 1025 1027
# -1.192037100 -0.740369821 -0.093573868 -0.740850707 0.374930145 -1.179830711
# 1029 1030 1031 1032 1039 1043
# -0.702326346 -0.460228413 -1.441325728 -1.286722837 -0.117864675 -2.140867994
# 1044 1045 1047 1050 1055 1060
# -2.114299987 0.645971508 -0.436457378 -1.165316816 -1.657762052 -0.720114822
# 1069 1073 1074 1075 1076 1077
# -1.637936117 -0.782373571 -1.395162657 -1.395167427 -1.637684316 -1.696587849
# 1078 1092 1094 1095 1097 1098
# -1.106264431 -0.127039684 -1.439563580 -1.310283872 -0.778302356 -1.982045758
# 1099 1104 1105 1108 1110 1113
# -0.005352088 -0.006881257 -1.802152970 -1.165296728 -0.314567361 -1.817725352
# 1118 1120 1121 1122 1125 1126
# -1.110847553 -2.128173826 -1.803025930 -1.164773956 -1.107030040 -2.251146286
# 1128 1133 1137 1141 1142 1144
# -0.940981858 -1.416409893 -1.441811848 -0.330724832 -1.657610560 -1.136508411
# 1146 1147 1148 1149 1150 1154
# -1.286055926 0.196135238 -1.107793309 -1.637240256 -2.226747493 -0.701304388
# 1155 1156 1157 1163 1169 1172
# -1.658472805 -1.654763658 -1.134978654 -2.024766764 -1.440093115 -0.940165139
# 1176 1181 1186 1187 1191 1195
# -0.481589127 -2.114877897 -1.137394808 -0.006881257 -1.636654053 -0.027152409
They correspond to the individual dummies in the glm
model,
g.fit$coefficients[grep("factor", names(g.fit$coe))]
however—as explained above—those are biased due to the IPP.
Note: I couldn't find a method tho extract the FE with survival::clogit
, if someone knows how to do this, please let me know in the comments!
Data:
idc <- readstata13::read.dta13("http://www.stata-press.com/data/r11/clogitid.dta")
来源:https://stackoverflow.com/questions/65366779/how-to-calculate-nonlinear-binary-fixed-effects-logit-for-longitudinal-panel-d