问题
I'm unable to cluster standard errors using R and guidance based on this post. The cl function returns the error:
Error in tapply(x, cluster1, sum) : arguments must have same length
After reading up on tapply
I'm still not sure why my cluster argument is the wrong length, and what is causing this error.
Here is a link to the data set that I'm using.
https://www.dropbox.com/s/y2od7um9pp4vn0s/Ec%201820%20-%20DD%20Data%20with%20Controls.csv
Here is the R code:
# read in data
charter<-read.csv(file.choose())
View(charter)
colnames(charter)
# standardize NAEP scores
charter$naep.standardized <- (charter$naep - mean(charter$naep, na.rm=T))/sd(charter$naep, na.rm=T)
# change NAs in year.passed column to 2014
charter$year.passed[is.na(charter$year.passed)]<-2014
# Add column with indicator for in treatment (passed legislation)
charter$treatment<-ifelse(charter$year.passed<=charter$year,1,0)
# fit model
charter.model<-lm(naep ~ factor(year) + factor(state) + treatment, data = charter)
summary(charter.model)
# account for clustered standard errors by state
cl(dat=charter, fm=charter.model, cluster=charter$state)
# accounting for controls
charter.model.controls<-lm(naep~factor)
# clustered standard errors
# ---------
# function that calculates clustered standard errors
# source: http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/
cl <- function(dat, fm, cluster){
require(sandwich, quietly = TRUE)
require(lmtest, quietly = TRUE)
M <- length(unique(cluster))
N <- length(cluster)
K <- fm$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
print(K)
uj <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
coeftest(fm, vcovCL)
}
# calculate clustered standard errors
cl(charter, charter.model, charter$state)
The inner workings of the function are a little over my head.
回答1:
When you execute your code, notice that there are missing observations in the linear model:
> summary(charter.model)
Call:
lm(formula = naep ~ factor(year) + factor(state) + treatment,
data = charter)
Residuals:
Min 1Q Median 3Q Max
-15.2420 -1.6740 -0.2024 1.8345 12.3580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 250.4983 1.2115 206.767 < 2e-16 ***
factor(year)1992 3.7970 0.7198 5.275 2.17e-07 ***
factor(year)1996 7.0436 0.8607 8.183 3.64e-15 ***
[..]
Residual standard error: 3.128 on 404 degrees of freedom
(759 observations deleted due to missingness)
Multiple R-squared: 0.9337, Adjusted R-squared: 0.9239
F-statistic: 94.85 on 60 and 404 DF, p-value: < 2.2e-16
This is what is causing the Error in tapply(x, cluster1, sum) : arguments must have same length
error message that you see.
In cl(dat=charter, fm=charter.model, cluster=charter$state)
the cluster variable charter$state
should have the exact same length as the number of observations effectively used in the regression estimation (which due to NAs is NOT the same as the number of rows in the original data frame).
To work around this you can do the following.
First off you're using an older version of Arai's function (
cl
) (see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R for references to both the old or the new versions, the latter being calledclx
).Second I think Arai's original approach to this function is a bit convoluted, and doesn't really follow the standard interface of
vcov*
functions fromsandwich
. That's why I came with a slightly modified version ofclx
. I made the code a bit more readable, and the interface more like what you would expect from asandwich
vcov*
function:vcovCL <- function(x, cluster.by, type="sss", dfcw=1){ # R-codes (www.r-project.org) for computing # clustered-standard errors. Mahmood Arai, Jan 26, 2008. # The arguments of the function are: # fitted model, cluster1 and cluster2 # You need to install libraries `sandwich' and `lmtest' # reweighting the var-cov matrix for the within model require(sandwich) cluster <- cluster.by M <- length(unique(cluster)) N <- length(cluster) stopifnot(N == length(x$residuals)) K <- x$rank ##only Stata small-sample correction supported right now ##see plm >= 1.5-4 stopifnot(type=="sss") if(type=="sss"){ dfc <- (M/(M-1))*((N-1)/(N-K)) } uj <- apply(estfun(x), 2, function(y) tapply(y, cluster, sum)) mycov <- dfc * sandwich(x, meat=crossprod(uj)/N) * dfcw return(mycov) }
If you try this function on the data you will see that it catches this specific issue:
> coeftest(charter.model, vcov=function(x) vcovCL(x, charter$state))
Error: N == length(x$residuals) is not TRUE
To avoid the issue you could proceed as follows:
> charter.x <- na.omit(charter[ , c("state",
all.vars(formula(charter.model)))])
> coeftest(charter.model, vcov=function(x) vcovCL(x, charter.x$state))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5050e+02 9.3781e-01 2.6711e+02 < 2.2e-16 ***
factor(year)1992 3.7970e+00 5.6019e-01 6.7780e+00 4.330e-11 ***
factor(year)1996 7.0436e+00 8.8574e-01 7.9522e+00 1.856e-14 ***
factor(year)2000 8.4313e+00 1.0906e+00 7.7311e+00 8.560e-14 ***
factor(year)2003 1.2392e+01 1.1670e+00 1.0619e+01 < 2.2e-16 ***
factor(year)2005 1.3490e+01 1.1747e+00 1.1484e+01 < 2.2e-16 ***
factor(year)2007 1.6334e+01 1.2469e+00 1.3100e+01 < 2.2e-16 ***
factor(year)2009 1.8118e+01 1.2556e+00 1.4430e+01 < 2.2e-16 ***
factor(year)2011 1.9110e+01 1.3459e+00 1.4199e+01 < 2.2e-16 ***
factor(year)2013 1.9301e+01 1.4896e+00 1.2957e+01 < 2.2e-16 ***
factor(state)Alaska 1.4178e+01 8.7686e-01 1.6169e+01 < 2.2e-16 ***
factor(state)Arizona 8.6313e+00 8.1439e-01 1.0598e+01 < 2.2e-16 ***
factor(state)Arkansas 4.3313e+00 8.1439e-01 5.3185e+00 1.736e-07 ***
factor(state)California 3.1103e+00 9.1619e-01 3.3948e+00 0.0007549 ***
factor(state)Colorado 1.7939e+01 7.9736e-01 2.2498e+01 < 2.2e-16 ***
factor(state)Connecticut 1.8031e+01 8.1439e-01 2.2141e+01 < 2.2e-16 ***
factor(state)D.C. -1.8369e+01 8.1439e-01 -2.2555e+01 < 2.2e-16 ***
factor(state)Delaware 1.2050e+01 7.9736e-01 1.5113e+01 < 2.2e-16 ***
factor(state)Florida 7.3838e+00 7.9736e-01 9.2602e+00 < 2.2e-16 ***
factor(state)Georgia 6.4313e+00 8.1439e-01 7.8971e+00 2.724e-14 ***
factor(state)Hawaii 3.3313e+00 8.1439e-01 4.0906e+00 5.196e-05 ***
factor(state)Idaho 1.7118e+01 7.8321e-01 2.1857e+01 < 2.2e-16 ***
factor(state)Illinois 1.2670e+01 8.2224e-01 1.5409e+01 < 2.2e-16 ***
factor(state)Indianna 1.7174e+01 6.1079e-01 2.8117e+01 < 2.2e-16 ***
factor(state)Iowa 2.0074e+01 6.8460e-01 2.9322e+01 < 2.2e-16 ***
factor(state)Kansas 2.0123e+01 8.6796e-01 2.3184e+01 < 2.2e-16 ***
factor(state)Kentucky 1.0200e+01 4.1999e-14 2.4287e+14 < 2.2e-16 ***
factor(state)Louisiana -1.6866e-01 8.1439e-01 -2.0710e-01 0.8360322
factor(state)Maine 2.0231e+01 1.7564e-01 1.1518e+02 < 2.2e-16 ***
factor(state)Maryland 1.4274e+01 6.1079e-01 2.3369e+01 < 2.2e-16 ***
factor(state)Massachusetts 2.4868e+01 8.3960e-01 2.9619e+01 < 2.2e-16 ***
factor(state)Michigan 1.2031e+01 8.1439e-01 1.4773e+01 < 2.2e-16 ***
factor(state)Minnesota 2.5110e+01 9.1619e-01 2.7407e+01 < 2.2e-16 ***
factor(state)Mississippi -3.5470e+00 1.7564e-01 -2.0195e+01 < 2.2e-16 ***
factor(state)Missouri 1.3447e+01 7.2706e-01 1.8495e+01 < 2.2e-16 ***
factor(state)Montana 2.2512e+01 8.4814e-01 2.6543e+01 < 2.2e-16 ***
factor(state)Nebraska 1.9600e+01 4.3105e-14 4.5471e+14 < 2.2e-16 ***
factor(state)Nevada 4.9800e+00 8.6796e-01 5.7375e+00 1.887e-08 ***
factor(state)New Hampshire 2.2026e+01 7.6338e-01 2.8853e+01 < 2.2e-16 ***
factor(state)New Jersey 2.0651e+01 7.6338e-01 2.7052e+01 < 2.2e-16 ***
factor(state)New Mexico 1.5313e+00 8.1439e-01 1.8803e+00 0.0607809 .
factor(state)New York 1.2152e+01 7.1259e-01 1.7054e+01 < 2.2e-16 ***
factor(state)North Carolina 1.2231e+01 8.1439e-01 1.5019e+01 < 2.2e-16 ***
factor(state)North Dakota 2.4278e+01 1.0420e-01 2.3299e+02 < 2.2e-16 ***
factor(state)Ohio 1.7118e+01 7.8321e-01 2.1857e+01 < 2.2e-16 ***
factor(state)Oklahoma 8.4518e+00 7.8321e-01 1.0791e+01 < 2.2e-16 ***
factor(state)Oregon 1.6535e+01 7.3538e-01 2.2486e+01 < 2.2e-16 ***
factor(state)Pennsylvania 1.6651e+01 7.6338e-01 2.1812e+01 < 2.2e-16 ***
factor(state)Rhode Island 9.5313e+00 8.1439e-01 1.1704e+01 < 2.2e-16 ***
factor(state)South Carolina 9.5346e+00 8.3960e-01 1.1356e+01 < 2.2e-16 ***
factor(state)South Dakota 2.1211e+01 3.5103e-01 6.0425e+01 < 2.2e-16 ***
factor(state)Tennessee 4.9148e+00 6.1473e-01 7.9951e+00 1.375e-14 ***
factor(state)Texas 1.4231e+01 8.1439e-01 1.7475e+01 < 2.2e-16 ***
factor(state)Utah 1.5114e+01 7.2706e-01 2.0787e+01 < 2.2e-16 ***
factor(state)Vermont 2.3474e+01 2.0299e-01 1.1564e+02 < 2.2e-16 ***
factor(state)Virginia 1.6252e+01 7.1259e-01 2.2807e+01 < 2.2e-16 ***
factor(state)Washington 1.9073e+01 1.8183e-01 1.0489e+02 < 2.2e-16 ***
factor(state)West Virginia 5.0000e+00 4.2022e-14 1.1899e+14 < 2.2e-16 ***
factor(state)Wisconsin 1.9994e+01 8.2447e-01 2.4251e+01 < 2.2e-16 ***
factor(state)Wyoming 1.8231e+01 8.1439e-01 2.2386e+01 < 2.2e-16 ***
treatment 1.2108e+00 1.0180e+00 1.1894e+00 0.2349682
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It's not nice, but gets the job done. Now cl
will also work just fine and yield the same result as above:
cl(dat=charter, fm=charter.model, cluster=charter.x$state)
A better way to go about this is to use the multiwayvcov
package. As per the packages's website, it is an improvement upon Arai's code:
Transparent handling of observations dropped due to missingness
Using the Petersen data with simulated NAs and cluster.vcov()
:
library("lmtest")
library("multiwayvcov")
data(petersen)
set.seed(123)
petersen[ sample(1:5000, 15), 3] <- NA
m1 <- lm(y ~ x, data = petersen)
summary(m1)
##
## Call:
## lm(formula = y ~ x, data = petersen)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.759 -1.371 -0.018 1.340 8.680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02793 0.02842 0.983 0.326
## x 1.03635 0.02865 36.175 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 2.007 on 4983 degrees of freedom
## (15 observations deleted due to missingness)
## Multiple R-squared: 0.208, Adjusted R-squared: 0.2078
## F-statistic: 1309 on 1 and 4983 DF, p-value: < 2.2e-16
coeftest(m1, vcov=function(x) cluster.vcov(x, petersen$firmid))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.027932 0.067198 0.4157 0.6777
## x 1.036354 0.050700 20.4407 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For a different approach using the plm
package see:
- Double clustered standard errors for panel data
回答2:
For one-way clustering, the robcov
command, from the {rms}
package works really well. read this for more information
http://www.inside-r.org/packages/cran/rms/docs/robcov
来源:https://stackoverflow.com/questions/23313907/clustered-standard-errors-with-data-containing-nas