Clustered Standard Errors with data containing NAs

前端 未结 2 2012
野性不改
野性不改 2021-02-09 05:46

I\'m unable to cluster standard errors using R and guidance based on this post. The cl function returns the error:

Error in tapply(x, cluster1, sum) : arguments          


        
2条回答
  •  南方客
    南方客 (楼主)
    2021-02-09 06:39

    When you execute your code, notice that there are missing observations in the linear model:

    > summary(charter.model)
    
    Call:
    lm(formula = naep ~ factor(year) + factor(state) + treatment, 
        data = charter)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -15.2420  -1.6740  -0.2024   1.8345  12.3580 
    
    Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                 250.4983     1.2115 206.767  < 2e-16 ***
    factor(year)1992              3.7970     0.7198   5.275 2.17e-07 ***
    factor(year)1996              7.0436     0.8607   8.183 3.64e-15 ***
    
    [..]
    
    Residual standard error: 3.128 on 404 degrees of freedom
      (759 observations deleted due to missingness)
    Multiple R-squared:  0.9337,    Adjusted R-squared:  0.9239 
    F-statistic: 94.85 on 60 and 404 DF,  p-value: < 2.2e-16
    

    This is what is causing the Error in tapply(x, cluster1, sum) : arguments must have same length error message that you see.

    In cl(dat=charter, fm=charter.model, cluster=charter$state) the cluster variable charter$state should have the exact same length as the number of observations effectively used in the regression estimation (which due to NAs is NOT the same as the number of rows in the original data frame).


    To work around this you can do the following.

    1. First off you're using an older version of Arai's function (cl) (see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R for references to both the old or the new versions, the latter being called clx).

    2. Second I think Arai's original approach to this function is a bit convoluted, and doesn't really follow the standard interface of vcov* functions from sandwich. That's why I came with a slightly modified version of clx. I made the code a bit more readable, and the interface more like what you would expect from a sandwich vcov* function:

      vcovCL <- function(x, cluster.by, type="sss", dfcw=1){
          # R-codes (www.r-project.org) for computing
          # clustered-standard errors. Mahmood Arai, Jan 26, 2008.
      
          # The arguments of the function are:
          # fitted model, cluster1 and cluster2
          # You need to install libraries `sandwich' and `lmtest'
      
          # reweighting the var-cov matrix for the within model
          require(sandwich)
          cluster <- cluster.by
          M <- length(unique(cluster))   
          N <- length(cluster)
          stopifnot(N == length(x$residuals))
          K <- x$rank
          ##only Stata small-sample correction supported right now 
          ##see plm >= 1.5-4
          stopifnot(type=="sss")  
          if(type=="sss"){
              dfc <- (M/(M-1))*((N-1)/(N-K))
          }
          uj  <- apply(estfun(x), 2, function(y) tapply(y, cluster, sum))
          mycov <- dfc * sandwich(x, meat=crossprod(uj)/N) * dfcw
          return(mycov)
      }
      

    If you try this function on the data you will see that it catches this specific issue:

    > coeftest(charter.model, vcov=function(x) vcovCL(x, charter$state))
     Error: N == length(x$residuals) is not TRUE
    

    To avoid the issue you could proceed as follows:

    > charter.x <- na.omit(charter[ , c("state", 
                                      all.vars(formula(charter.model)))])
    > coeftest(charter.model, vcov=function(x) vcovCL(x, charter.x$state)) 
    
    t test of coefficients:
    
                                   Estimate  Std. Error     t value  Pr(>|t|)    
    (Intercept)                  2.5050e+02  9.3781e-01  2.6711e+02 < 2.2e-16 ***
    factor(year)1992             3.7970e+00  5.6019e-01  6.7780e+00 4.330e-11 ***
    factor(year)1996             7.0436e+00  8.8574e-01  7.9522e+00 1.856e-14 ***
    factor(year)2000             8.4313e+00  1.0906e+00  7.7311e+00 8.560e-14 ***
    factor(year)2003             1.2392e+01  1.1670e+00  1.0619e+01 < 2.2e-16 ***
    factor(year)2005             1.3490e+01  1.1747e+00  1.1484e+01 < 2.2e-16 ***
    factor(year)2007             1.6334e+01  1.2469e+00  1.3100e+01 < 2.2e-16 ***
    factor(year)2009             1.8118e+01  1.2556e+00  1.4430e+01 < 2.2e-16 ***
    factor(year)2011             1.9110e+01  1.3459e+00  1.4199e+01 < 2.2e-16 ***
    factor(year)2013             1.9301e+01  1.4896e+00  1.2957e+01 < 2.2e-16 ***
    factor(state)Alaska          1.4178e+01  8.7686e-01  1.6169e+01 < 2.2e-16 ***
    factor(state)Arizona         8.6313e+00  8.1439e-01  1.0598e+01 < 2.2e-16 ***
    factor(state)Arkansas        4.3313e+00  8.1439e-01  5.3185e+00 1.736e-07 ***
    factor(state)California      3.1103e+00  9.1619e-01  3.3948e+00 0.0007549 ***
    factor(state)Colorado        1.7939e+01  7.9736e-01  2.2498e+01 < 2.2e-16 ***
    factor(state)Connecticut     1.8031e+01  8.1439e-01  2.2141e+01 < 2.2e-16 ***
    factor(state)D.C.           -1.8369e+01  8.1439e-01 -2.2555e+01 < 2.2e-16 ***
    factor(state)Delaware        1.2050e+01  7.9736e-01  1.5113e+01 < 2.2e-16 ***
    factor(state)Florida         7.3838e+00  7.9736e-01  9.2602e+00 < 2.2e-16 ***
    factor(state)Georgia         6.4313e+00  8.1439e-01  7.8971e+00 2.724e-14 ***
    factor(state)Hawaii          3.3313e+00  8.1439e-01  4.0906e+00 5.196e-05 ***
    factor(state)Idaho           1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
    factor(state)Illinois        1.2670e+01  8.2224e-01  1.5409e+01 < 2.2e-16 ***
    factor(state)Indianna        1.7174e+01  6.1079e-01  2.8117e+01 < 2.2e-16 ***
    factor(state)Iowa            2.0074e+01  6.8460e-01  2.9322e+01 < 2.2e-16 ***
    factor(state)Kansas          2.0123e+01  8.6796e-01  2.3184e+01 < 2.2e-16 ***
    factor(state)Kentucky        1.0200e+01  4.1999e-14  2.4287e+14 < 2.2e-16 ***
    factor(state)Louisiana      -1.6866e-01  8.1439e-01 -2.0710e-01 0.8360322    
    factor(state)Maine           2.0231e+01  1.7564e-01  1.1518e+02 < 2.2e-16 ***
    factor(state)Maryland        1.4274e+01  6.1079e-01  2.3369e+01 < 2.2e-16 ***
    factor(state)Massachusetts   2.4868e+01  8.3960e-01  2.9619e+01 < 2.2e-16 ***
    factor(state)Michigan        1.2031e+01  8.1439e-01  1.4773e+01 < 2.2e-16 ***
    factor(state)Minnesota       2.5110e+01  9.1619e-01  2.7407e+01 < 2.2e-16 ***
    factor(state)Mississippi    -3.5470e+00  1.7564e-01 -2.0195e+01 < 2.2e-16 ***
    factor(state)Missouri        1.3447e+01  7.2706e-01  1.8495e+01 < 2.2e-16 ***
    factor(state)Montana         2.2512e+01  8.4814e-01  2.6543e+01 < 2.2e-16 ***
    factor(state)Nebraska        1.9600e+01  4.3105e-14  4.5471e+14 < 2.2e-16 ***
    factor(state)Nevada          4.9800e+00  8.6796e-01  5.7375e+00 1.887e-08 ***
    factor(state)New Hampshire   2.2026e+01  7.6338e-01  2.8853e+01 < 2.2e-16 ***
    factor(state)New Jersey      2.0651e+01  7.6338e-01  2.7052e+01 < 2.2e-16 ***
    factor(state)New Mexico      1.5313e+00  8.1439e-01  1.8803e+00 0.0607809 .  
    factor(state)New York        1.2152e+01  7.1259e-01  1.7054e+01 < 2.2e-16 ***
    factor(state)North Carolina  1.2231e+01  8.1439e-01  1.5019e+01 < 2.2e-16 ***
    factor(state)North Dakota    2.4278e+01  1.0420e-01  2.3299e+02 < 2.2e-16 ***
    factor(state)Ohio            1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 ***
    factor(state)Oklahoma        8.4518e+00  7.8321e-01  1.0791e+01 < 2.2e-16 ***
    factor(state)Oregon          1.6535e+01  7.3538e-01  2.2486e+01 < 2.2e-16 ***
    factor(state)Pennsylvania    1.6651e+01  7.6338e-01  2.1812e+01 < 2.2e-16 ***
    factor(state)Rhode Island    9.5313e+00  8.1439e-01  1.1704e+01 < 2.2e-16 ***
    factor(state)South Carolina  9.5346e+00  8.3960e-01  1.1356e+01 < 2.2e-16 ***
    factor(state)South Dakota    2.1211e+01  3.5103e-01  6.0425e+01 < 2.2e-16 ***
    factor(state)Tennessee       4.9148e+00  6.1473e-01  7.9951e+00 1.375e-14 ***
    factor(state)Texas           1.4231e+01  8.1439e-01  1.7475e+01 < 2.2e-16 ***
    factor(state)Utah            1.5114e+01  7.2706e-01  2.0787e+01 < 2.2e-16 ***
    factor(state)Vermont         2.3474e+01  2.0299e-01  1.1564e+02 < 2.2e-16 ***
    factor(state)Virginia        1.6252e+01  7.1259e-01  2.2807e+01 < 2.2e-16 ***
    factor(state)Washington      1.9073e+01  1.8183e-01  1.0489e+02 < 2.2e-16 ***
    factor(state)West Virginia   5.0000e+00  4.2022e-14  1.1899e+14 < 2.2e-16 ***
    factor(state)Wisconsin       1.9994e+01  8.2447e-01  2.4251e+01 < 2.2e-16 ***
    factor(state)Wyoming         1.8231e+01  8.1439e-01  2.2386e+01 < 2.2e-16 ***
    treatment                    1.2108e+00  1.0180e+00  1.1894e+00 0.2349682    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

    It's not nice, but gets the job done. Now cl will also work just fine and yield the same result as above:

    cl(dat=charter, fm=charter.model, cluster=charter.x$state)
    

    A better way to go about this is to use the multiwayvcov package. As per the packages's website, it is an improvement upon Arai's code:

    Transparent handling of observations dropped due to missingness

    Using the Petersen data with simulated NAs and cluster.vcov():

    library("lmtest")
    library("multiwayvcov")
    
    data(petersen)
    set.seed(123)
    petersen[ sample(1:5000, 15), 3] <- NA
    
    m1 <- lm(y ~ x, data = petersen)
    summary(m1)
    ## 
    ## Call:
    ## lm(formula = y ~ x, data = petersen)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -6.759 -1.371 -0.018  1.340  8.680 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  0.02793    0.02842   0.983    0.326    
    ## x            1.03635    0.02865  36.175   <2e-16 ***
    ## ---
    ## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    ## 
    ## Residual standard error: 2.007 on 4983 degrees of freedom
    ##   (15 observations deleted due to missingness)
    ## Multiple R-squared:  0.208,  Adjusted R-squared:  0.2078 
    ## F-statistic:  1309 on 1 and 4983 DF,  p-value: < 2.2e-16
    
    coeftest(m1, vcov=function(x) cluster.vcov(x, petersen$firmid))
    ## 
    ## t test of coefficients:
    ## 
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 0.027932   0.067198  0.4157   0.6777    
    ## x           1.036354   0.050700 20.4407   <2e-16 ***
    ## ---
    ## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

    For a different approach using the plm package see:

    • Double clustered standard errors for panel data

提交回复
热议问题